In mammals, a positive correlation between dN
and repeat length is weak but statistically significant. This result is congruent with previous analyses in smaller datasets of human and mouse genomes [13
]. The purity of the AARs per gene or exon shows a similar trend. But these weak correlations can be explained by the influence of the GC context surrounding the repeat. High GC content can generate a sequence context more prone to slippage[21
] and thus expansion of AARs. Indeed we found an example of this in exons that have experienced GC-biased gene conversion in primates. Similarly, while there is an increase in the amount of recent AARs in mammalian PSGs, these recent expansions are better explained by GC content than by positive selection acting on codons. Therefore it seems that, in contradiction to previous reports [15
], the expansion of AARs is not causally associated with substitution rates. While purifying selection limits the expansion of AARs[e.g. [29
]], this appears to be distinct from the selective pressure on individual (aligned) amino acid sites. That means that these repeats are experiencing not only different mutational processes, but also particular selective constraints, leading to a more complex scenario of evolution.
Our analyses, even of individual exons, suggest that increased substitution rates are not usually linked to the presence of AARs. However, it is possible that in some particular cases, as has been suggested for Drosophila
, the expansion of AARs can produce compensatory changes on the neighbouring sites to accommodate the perturbation generated by the repeat[30
]. We also cannot exclude the existence of adaptive evolution related with AARs[7
], in the absence of a good reference neutral model for tri-nucleotide expansions in proteins. But our results do show that the selective pressure as measured by codon models is not related with putative adaptive evolution of AARs.
AARs in mammalian genes do not seem to affect gene expression significantly. Unlike repeats which disrupt the reading frame, and have a strong effect on replication and transcription stability[31
], the tri-nucleotide repeats might be constrained in a different way. It seems that repeats located in the promoter region[32
] have a stronger influence on transcription than do AARs, even those near the transcription start.
The analyses of molecular function confirmed an enrichment in the transcription factor, DNA binding, molecular transducers and binding categories that is consistent with previous studies of polymorphic repeats [26
]. The overrepresentation of transcription factor categories supports the existence of trans
effects, as these repeats might alter the expression of the target genes and end up producing dramatic changes on the phenotype[7
]. However, while the ice-binding protein is involved in hypothermic resistance in some antartic fishes vertebrates[25
], its overrepresentation in alanine-rich mammalian genes is probably due to an annotation bias.
In general, we found that AARs are located in proteins that interact with DNA, RNA, ligands or other proteins, so it is likely that they contribute to adapt or modulate the interaction capacity of these proteins. Longer proteins and repeat-rich proteins tend to have a higher connectedness within interaction networks, suggesting that they contribute to an enlarged interaction surface and constitute more flexible subunits[36
]. Some AAR have been recently associated to the presence of repeats to specific domains, such as signal peptides or transmembrane regions[16
], pointing to their role in facilitating molecular interactions of extreme importance. For example, in the Drosophila ARC 70 cofactor complex, the -130 and -230 subunits contain an expansion of glutamine residues, a prevalent feature of sequence-specific activators in Drosophila