Scarlet sea anemone proteome is exceptionally rich in tandem repeats
An increased number of sequenced genomes led to a revised view of the origin of metazoan life. The eumetazoan clade contains taxa with nervous systems and muscle cells: Cnidaria, Ctenophora, and Bilateria [25
]. In addition, the complete proteome of the starlet sea anemone, Nematostella vectensis
, provides insight into gene invention dating hundreds of millions of years ago back to the last common Cnidarian-Bilaterian ancestor [26
]. The N. vectensis
genome allows us to examine the evolution of an ancestral genome that is rich in TRs. Examples of TR-proteins that fulfill this definition are shown in Figure . Proteins A7SW76 and A7S5V7 (UniProt IDs) share a repeated unit of 49 amino acids (yellow box) with a copy number of 3.5 and 4. In addition, two other repeated units are found in A7SW76, with a copy number of 6 (7 amino acids unit) and 3 (21 amino acids unit). The remaining portions of the proteins are not identified as TRs. TR-units are defined by at least 3 repeated units, each with a minimal length of 3 amino acids, with no intervening sequences. These are non-overlapping sequences that are identified according to the Xstream tool [20
] (see Methods). The number of unique TRs may differ from that of TR-proteins as the same TR-unit might occur in numerous proteins and the same TR-unit often appears in several segments on a particular protein (Figure ).
Figure 1 TR-containing proteins from N. vectensis. Graphical representation of two TR-containing proteins: A7SW76 and A7S5V7 (UniProt). Proteins contain TR with a unit length of 49 amino acids (yellow, TR-unit). This TR-unit appears twice on A7SW76, while the (more ...)
We compared the occurrence of protein TR sequences in N. vectensis in view of other complete proteomes, including fly (D. melanogaster), human, mouse, frog, and more. We restricted our analysis to those TRs for which all repeats share >80% sequence identity with the respective consensus repeat sequence (see Methods).
The complete proteome of N. vectensis includes 24,906 predicted proteins, where half of these are marked as fragments. The number of proteins identified with TRs was as high as 3875, of which 3315 are unique TR-units. When comparing the TRs in N. vectensis and other proteomes, the most striking difference is in the number of unique TRs. In N. vectensis, ~16% of all protein sequences are TR-containing proteins. When employing the same parameters, TR containing sequences account for only 3% of the Drosophila melanogaster proteome (Figure ).
An exhaustive comparison of the TR properties for representative proteomes, spanning an evolutionary range from N. vectensis to human, was performed. The selected organisms represent major evolutionary branches, covering worm (C. elegans), insects (fly, bee and beetle), plant (Arabidopsis), vertebrate (chicken, frog), marine chordata (Ciona), cnidaria (Hydra), mammals (human, mouse), metazoan parasite (Leishmania) and choanoflagellate (M. brevicollis).
When the fraction of TR containing proteins (TR-proteins) relative to the total Nematostella proteome is considered (Figure ), we noted a 2-5 fold enrichment in the fraction of such proteins relative to the other organisms tested.
Figure 2 N. vectensis TRs relative to representative proteomes along the metazoan evolutionary tree. (A) The fraction of TR-proteins within the proteome tested for 14 model organisms. Representative organisms include plant, insects, worm, sea squirt, frog, and (more ...)
Most N. vectensis tandem repeat units are unique
While the fraction of TR-proteins in N. vectensis
is higher than in any other tested organism, the usage of any particular TR is typically restricted. Most TR-units are used only once in N. vectensis
(Figure ). In mouse and human, each TR appears on average in 2.7 and 2.1 proteins, respectively. For example, in the mouse proteome, there are ~3000 TR-proteins, yet they are composed of only ~1100 unique TRs. While the genomes of human and mouse are quite active in reusing their repeated sequences [27
], no clear view was presented for the dynamics of TRs in N. vectensis
]. Moreover, the fraction occupied by a TR within the TR-protein (i.e., TR-coverage) is substantially higher in N. vectensis
(Figure ). This extreme TR-coverage (50%) reflects the fact that almost half of N. vectensis
proteins are marked as fragments due to missing exons (and failures of genome prediction tools). The 25% coverage measured for human (Figure ) and mouse (not shown) is still higher than the coverage determined for the other organisms tested, mainly insects and frog.
Characterizing the properties of TR-containing proteins from N. vectensis
The >3300 TRs of N. vectensis
were statistically analyzed to determine the most prevalent repeat properties in terms of number of TR units, their length, and the overall regions they occupy. Figure shows the abundance of these TR-units. It is evident that a unit length of 10-12 amino acids is most prevalent (Figure ). Furthermore, an inverse correlation is observed between the number of TR-units within a TR-segment (defined in Figure ) and the length of that basic unit (tested for a range of 6-20 amino acids). Thus, in most cases, the total region that is occupied by TRs in N. vectensis
remains within the range of 120 to 180 amino acids (Figure , pink). The average length of a TR-segment in a protein is 153 amino acids. Figure shows the accumulated level of variations relative to the TR consensus, where TRs with deviations ranging from 0 to accumulated changes in 20% of its amino acids (marked 0.2) are analyzed. The TRs from human and N. vectensis
demonstrate clear differences. While most TRs in human have diverged substantially, this is not the case for N. vectensis
(peaks at 0.06). It is possible that TRs in human and N. vectensis
differ in their tendency to accumulate and maintain variations. A substantial number of TRs in human (229) and in N. vectensis
(235) shows no variations (marked 0) but most of them (80%) are short TRs (<6 amino acids). An attractive idea suggests that the higher variation observed in mouse (not shown) and human (Figure ) coincide with the high usability of each TR-unit in human and mouse but not in Nematostella
(Figure ). A similar trend was proposed for the accelerated evolution of duplicated paralogous genes [28
]. A similar analysis performed for the ~1200 TR-segments of Hydra indicated an intermediate rate of variation (Additional file 1
Figure 3 Properties of the TRs in N. vectensis. (A) The distribution of the TR-unit length for all 3212 unique TR sequences from N. vectensis. A unit length of 10-12 amino acids is most frequent. The tail of the length distribution (of repeats longer than >60 (more ...)
We expanded the comparative study to proteomes that are evolutionarily closer to N. vectensis, such as Hydra. Recently, a Hydra genome (H. magnipapillata) was sequenced and a predicted protein set of 17,586 models was presented (Craig Venter Institute, ~6X sequence coverage, 1.3 Gb genome). The fraction of unique TR-units within the Hydra complete proteome is ~5% (Table ).
TR properties in 14 representative proteomes
The difference in the absolute number of proteins and the fraction of proteins marked as 'fragments' (Table ) is a proxy for the knowledge and annotation quality (exceptions are missing annotations for L. major and Hydra). The high fraction of partial proteins (i.e., 47% 'fragments' for N. vectensis) is consistent with their shorter average length (Table ). We included in the analysis Leishmania major, a protozoan parasite that underwent rapid evolution and for which many proteins are known to contain TRs. Table shows that the exceptionally high proportion of proteins with TRs in N. vectensis is a unique property of this organism and has not been reported to such an extent even in protozoan parasites.
To ensure that the trends seen in Table do not solely reflect the poor quality of the assembly and protein annotations reported [29
], we repeated the analysis but eliminated all sequences that include undefined nucleotides (indicated by nucleotide 'x') or sequences that could not match exactly their transcripts (see Methods). Even after such filtration, the fraction of TRs for N. vectensis
still remains exceptionally high (11% of the entire proteome).
Amino acid composition of N. vectensis tandem repeats
We analyzed the over-representation and under-representation of amino acids in TR-proteins, in comparison with all other N. vectensis proteins. We found that several amino acids, most notably histidine (H), cysteine (C), tyrosine (Y), proline (P), and threonine (T) were enriched, with C and H being the most significantly so (Figure ). On the other hand, we observed that polar and charged amino acids are strongly depleted in the TR sequences. Specifically, the most depleted amino acids are phenylalanine (F), glutamic acid (E), and lysine (K). Similar enrichment and depletion in amino acid preference were evident when compared to the SwissProt database (~400,000 sequences). In addition to the distinct biophysical properties of these under-represented amino acids, it is of interest that their codons tends to be AT rich and they are mostly coded by a limited number of codons (2 codons each).
Figure 4 Amino acid composition in TR-segments. (A) Composition of amino acids in TR-segments relative to non-TR proteins from N. vectensis. The over-represented and under-represented amino acids are shown. (B) The TR-proteins from N. vectensis were compared to (more ...)
A significant difference in amino acid composition exists between N. vectensis TR and non-TR proteins (Figure ). We thus tested whether the amino acid composition in TRs of N. vectensis is similar to that of other organisms. Specifically, we compared the N. vectensis TRs to the ~970 unique TRs from human. When comparing the TR sequences (Figure ), a relative enrichment in tyrosine (Y) is evident, and to a lesser extent C, I, L, and V. Interestingly, relative to the human repeats, the TRs from N. vectensis are more enriched with amino acids that tend to form ordered structures (Figure , blue), suggesting that these repeats may be better suited to form structural units. Similar trends were evident in comparisons with other vertebrate TR-proteins. The most dominant predicted secondary structure associated with the TR repeats is the β-sheet (not shown).
Sequence robustness of tandem repeats revealed by multiple open reading frames
Among the 24,906 ORFs predicted for N. vectensis
], 3875 TR-proteins were selected (Table ). We tested the extent of valid alternative reading frames among this large set of TR-proteins. In fact, due to the shortage in experimental evidence (ESTs and cDNAs), and the absence of any direct protein information, the reading frame may not be correct. Typically, computational inference for a specific frame is based on an appropriate Kozak sequence near the initiating methionine, on the knowledge of codon usage bias, and on conservation criteria from homology and paralogy. At present, ~50% of the TR-proteins lack an initiating methionine and a similar fraction (45%) holds for the rest of the predicted proteins.
We analyzed the properties of the alternative reading frames for TR-proteins. For this analysis, the set of all TR-segments from N. Vectensis
was compiled (4437 segments, JGI proteome). For each such TR-segment, we inspected all alternative reading frames (ARFs), a total of 22,185 potential ORFs. Averaging over all 5 frame shifts, 37.5% of these ARFs do not contain a stop codon (Figure ), i.e., they adhere to the minimal definition of being valid ORFs. To put this result within a statistical context, we performed a simulation for 4437 random ORF sequences, generated based on the nucleotide composition of the coding sequences of the N. vectensis
TR-proteins. In this random simulation, only 4.3% of the ARFs did not contain a stop codon, thus hinting at the possible significance of this beneficial use of repeated sequences to yield valid ARFs. Somewhat surprisingly, the high fraction of valid ARFs also carried over for those of the reverse complement strand (Figure ). In our analysis, we found that only ~600 TR-proteins (14%) can be read exclusively in the annotated reading frame. On the other hand, for most of the sequences, there are at least 2-3 additional ORFs (Figure ), many of which are potentially long. For example, when limiting the analysis to ARFs (ORF +2 or ORF +3) with length >500 nucleotides, 107 such instances were found (Additional file 2
Figure 5 Multiple valid ARFs in TR-segments. (A) Frequency of valid ARFs (i.e., an ORF without stop codons) for all 6 possible reading frames for 4437 TR-repeated segments ORFs from JGI N. vectensis proteome. The average frequency of valid ARFs in any of the alternative (more ...)
An analysis of the TR-proteins showed that the overall amino acid composition is rather similar for all 3 reading frames on the coding strand, when the alternative frames are open; (Figure , p-value for paired t-test <0.007). Note that in cases when the reverse complement frames are also valid, the translated sequences are also often similar in composition to the original frame (see Discussion). It is important to note that there is a unique scenario, for which repeats of a nucleotide unit will necessarily result in the same amino acid sequence for all 3 reading frames on a strand, with cyclic shifts only in the start and end of the sequence. Figure demonstrates an example where a repeat of 13 nucleotides leads to a TR-unit of 13 amino acids. This phenomenon is actually a byproduct of the basic repeat unit (at the DNA level) having any length that is not a multiple of 3. We found that only for 18% of all TR-proteins (806 instances) was this phenomenon present. Clearly, in such cases, the amino acid composition will remain fixed. Nonetheless, we found the composition to be fixed for almost all TR-proteins (Figure ).
Limited conservation of N. vectensis TR-units along the phylogenetic tree
We tested to what extent the 3300 TRs from N. vectensis
is evolutionarily conserved. We focused on representative proteomes for comparison with N. vectensis
. Figure shows a tree view of the major branches from the metazoan-fungi separation and within the metazoa kingdom. The Venn diagrams of TR-proteins indicate the number of TR-proteins that are shared among human, mouse, and Nematostella (Figure ). A broader evolutionary perspective is presented by comparing Nematostella with Hydra (Cnidaria) and Monosiga (Figure ). Shared TR-proteins are defined by the identity of their TR-units, as calculated by Xstream [20
]. We noted that even human and mouse share only 343 TR-proteins, while the overlap between N. vectensis
and human and N. vectensis
and mouse is even lower (160 and 112, respectively). The analysis of unique TRs shows that 6% of the unique TRs from N. vectensis
are also found in human and mouse (Figure ). Among these, only 64 TR-proteins are shared among all 3 proteomes. While some of the TR-units that are shared among the tested proteomes are rather long, ~50-60% of them are shorter than 7 amino acids. A similar comparison relative to the Hydra proteome indicated that 10% and 7.4% of the TR-proteins are shared with Hydra and Monosiga, respectively (Figure ). Only 20 TR-proteins are conserved among all 5 tested species. We concluded that a limited number of TR-units are shared in evolution and the expansion of the TR-proteome is indicative of Nematostella and to a much lesser degree of Hydra. For more details, see Additional file 3
Figure 6 Evolutionary conserved N. vectensis TR-units. (A) A schematic phylogenetic tree showing the main branches in metazoan origin. Proteomes that are compared are indicated in blue. (B) N. vectensis shares 160 TR-units with human and 112 TR-units with mouse. (more ...)
Expression of TR-proteins from N. vectensis: the multi-ubiquitin proteins
Table summarizes instances of TRs that are shared among N. vectensis
, human, and mouse and additional organisms. Among the longest TRs that were identified in human, mouse, and N. vectensis
(with almost no variability in their sequence) is a repeat of 76 amino acids that represents multi-ubiquitin proteins. For example, D. melanogaster
has 4 proteins with such TR-units, with copy number (n) of 4 (Q8MT02), 7 (Q9W418) 10 (A4V1F9) and 14 (Q8MSM5). In N. vectensis
, this TR-unit is detected in 2 independent sequences with a copy number of 3 and 7 units (A7SV54 and A7SUP6, respectively). This ubiquitin domain of 76 amino acids appears within a wide spectrum of the taxonomical tree. Representative proteins containing this TR, with n = 3 to 17, are shown in Table . Hs1-Cortactin is a TR of 37 amino acids that appears with n = 3 to 7. The 114 amino acid repeat of Calx-beta appears in 7 protein sequences in N. vectensis
. It is widely spread throughout the taxa, ranging in copy number from 3 to 41. The Calx-beta motif is present in the cytoplasmic domains of Na-Ca exchangers and in integrin-β4, which mediates signaling across the plasma membrane (see Additional file 4
). In all these examples, the TR units are those reported by the Pfam family collection [30
]. All N. vectensis
examples listed in Table are supported experimentally and confirmed by EST expressions from embryo, larva, and unfertilized egg (not shown).
Representatives of TRs shared between N. vectensis and other organisms.
A functional perspective of N. vectensis repeats
The fraction of Pfam and InterPro entries that are represented among all N. vectensis proteins is slightly lower than that for well-studied genomes (68% compared to 77% for all proteins in UniProt). When the set of TR-proteins is considered, only 31% appear with a Pfam entry that is repeated at least twice. Thus, most of the TR-proteins are undefined by Pfam and InterPro databases. Of course, the strict requirement of tandemness in the TRs excludes proteins with non-conserved linkers between domains (e.g., Annexin repeats) from the analysis.
We thus set out to test the appearance of repeats in the N. vectensis
proteome in view of the 252 repeat types that are reported by InterPro (Additional file 5
). Most of the repeats that are supported by Pfam (63%, see Methods) are not found in N. vectensis
. For those found in N. vectensis
, we focused on the 16 repeat types that are represented by at least 20 proteins in N. vectensis
. When compared to humans, no marked difference in copy number is shown for most of these repeats (Figure ). For 2 such repeat domains, N. vectensis
exhibits a moderate preference for a higher copy number. For almost half of the instances, the opposite tendency is detected. Since many of the N. vectensis
proteins are incomplete (i.e., annotated as fragments), this analysis may underestimate the actual copy number of the TRs in the full sequences. We conclude that, overall, the copy numbers for the Pfam repeats that are well-represented in N. vectensis
are rather similar to the copy numbers of these repeats in human (Figure ).
Figure 7 Pfam repeated domains in N. vectensis proteome. Pfam repeat domains based on N. vectensis InterPro annotations. Only Pfam entries with >20 proteins are listed. The histogram indicates the log-ratio of copy number for a particular TR-unit in human (more ...)
Evolutionary divergence rate for tandem repeat units
The large number of TR-proteins in N. vectensis and the observation that they are mostly uniquely used in its proteins raises the question: which evolutionary forces act on such repeats? We set out to study whether the evolution of these tandem repeats has been subject to neutral, purifying, or positive selection. We thus applied an analysis based on the ratio of asynonymous to synonymous substitutions (Ka/Ks ratio). Due to the shortage of experimental evidence for many of the predicted proteins from N. vectensis, we limited this analysis to the TR-proteins that are supported by ESTs (obtained from JGI genome center). For the 94 sequences analyzed, ~40% of them exhibit a ratio of Ka/Ks > 1.0, while ~50% show the opposite trend. Among the instances of TR-proteins with Ka/Ks>1, several of them seem to exhibit extremely high ratios, which strongly supports the notion of positive selection on these tandem repeats.