| Simple sequences are regions of low complexity made up of short sequence repeats (1-6 elements). The repeats may have a tandem organization or form segments of imperfect repeats, also called cryptic simple sequences. In DNA, as well as in proteins, regions of low complexity are extremely abundant. For example about 71% of the yeast proteins show significant overall simplicity as measured by the SIMPLE algorithm. This algorithm has been implemented into the SIMPLE v. 3.0 program for the analysis of simplicity in any nucleic acid or protein sequence and can be accessed online. Many short repeats are believed to have originated by DNA slippage and misaligning during replication, recombination or repair. We have studied the codon composition in regions of genes that encode for homopeptides in order to determine whether amino acid repeats correlate with trinucleotide repeats in the gene. A high correlation would be consistent with slippage while a mixture of codons could be indicative of selection of the homopeptide region. In mammals two populations of glutamine repeats can be clearly differentiated. The first is encoded by pure trinucleotide tracts (CAG) and the second by very mixed tracts (CAA/CAG). The latter type tends to be conserved in human and mouse than the pure tracts. The results suggest that while a subset may have been recently originated by slippage, and may therefore be neutral, some of the polyglutamine segments appear to have been preserved throughout evolution. |
 |