Protein family birth and death
Protein families were constructed from all sequences representing 11 eukaryotic taxa using Markov Clustering (MCL; [17
]) where MCL clusters with multiple sequences were defined as protein families. In total, 17,752 families were identified from 151,044 proteins of the following 11 species; Homo sapiens, Mus musculus, Gallus gallus, Drosophila melanogaster, Aedes aegypti, Bombyx mori, Caenorhabditis elegans, C. briggsae, Trichinella spiralis, Monosiga brevicollis
and Saccharomyces cerevisiae
( Additional file 1
and Additional file 2
). These protein families have different taxonomic distributions with the majority of them aligning with specific clades of nematodes, arthropods, or vertebrates (Table and Additional file 3
). There are only 810 protein families having members present in all 11 taxa (hereafter referred as universal families). Nematodes have the highest number of specific families (6,613) among which, 1,087 are specific to T. spiralis
. This is the highest number of families unique to a single species. In contrast, the arthropod lineage has the least number of specific families (2,045), and G. gallus
has only 60 species-unique families (Table ). The lineages leading to the Last Common Ancestor (LCA) of human and mouse (mammals) tend to have higher numbers of new family births, but it is the LCA of C. elegans
and C. briggsae
that has the highest number of family births. If normalized to branch lengths, family births in the LCA of mammals are the highest followed by births in the LCA of C. elegans
and C. briggsae
. Trichinella spiralis
and G. gallus
have twice as many family deaths as their neighboring taxa. After diverging, 932 families disappeared in T. spiralis
compared with less than 460 in C. elegans
or C. briggsae
, 487 families in G. gallus
, and less than 200 in H. sapiens
or M. musculus
. Among all the organisms examined, the lineage leading to T. spiralis
exhibited the most family deaths. Overall, the numbers of family births are higher than family deaths and vary more than deaths over the lineages examined.
Classification of protein families and domains
Birth and death evolutionary eventsa
Duplication and deletion in universal protein families
We selected 804 universal protein families containing members present in all 11 taxa and investigated duplications and deletions among the members. Focusing on universal families helped minimize the effects of species adaptation and detect signals associated with genomic evolutionary events. Six universal families were excluded because large numbers of sequences (more than 1,000) in those families prohibited further multiple sequence alignment and tree building. Of those examined, 12,507 duplications and 22,954 deletions were inferred, averaging 16 duplications and 29 deletions per family. In the majority of lineages, deletions outnumbered duplications; however, at the LCA of nematodes, deletions were 28 times greater than duplications suggesting protein families became smaller (Table ). It appears there were two rounds of duplication bursts, one in the LCA of metazoans with an average of 2.2 duplications per family, and one in the LCA of vertebrates, which averaged 1.95 duplications per family. All other branches shared less than one duplication per family on average. Despite the variation in deletion events over different lineages, the numbers of deletions from the LCA to each present taxon were less variable than duplications. Comparing the terminal lineages, G. gallus had the fewest duplication events.
Domain birth and death
We successfully identified 5,106 domains from 123,084 proteins. Unlike protein families where less than 5 percent were universal (810 out of 17,752) and more than 20 percent were species specific, more than 20 percent of domains (1,172 out of 5,106) were universal and less than 6 percent were species specific (Table ).
Birth/death events of the 5,106 identified domains were inferred in the same manner as protein family birth/death events (Table ). Domains had fewer birth/death events than protein families. Consistent with that observed in protein families, there was a burst of domain births in the LCA of metazoans and this was 2 times greater than that found in the LCA of arthropods and vertebrates after normalizing by branch lengths. However, different lineages exhibited dramatic variations in the number of death/birth events. The lineages leading to humans exhibited the largest number of domain births and the smallest number of domain deaths. In contrast, the lineages leading to T. spiralis
showed the smallest number of domain births and the largest number of domain deaths. After the split, 614 domains disappeared in T. spiralis
while approximately 250 domains disappeared in C. elegans
and C. briggsae
Since T. spiralis
is a nematode parasite and lateral gene transfer has been reported in parasitic nematodes [18
], details of the 15 domains born in T. spiralis
were examined. Interestingly, 13 out of 15 have been annotated as bacterial or viral protein domains ( Additional file 4
Domain duplications and deletions
Similar to family member duplications and deletions, domain duplications and deletions were analyzed for each phylogeny. For the purpose of comparability, only the 1,168 universal domains (domains present in all 11 species) were considered. In total, 49,958 duplications and 94,648 deletions were inferred for the universal domains; 5 domains were excluded because they have more than 1,000 members. As observed among universal protein family members, domain duplication and deletion varied substantially over the course of evolution, and sister lineages did not have similar numbers of duplications and deletions (Table ). However, domain duplications and deletions were more frequent than protein member duplications and deletions, averaging 43 duplications and 81 deletions per domain over the course of evolution for the species examined starting with the LCA of metazoa.
Correlation between protein domain evolution and protein family evolution
Pearson’s correlation coefficients were used to investigate the relationship between domain evolution and protein family evolution (Table ). Coefficients between different events of the same target (i.e., between death and birth of protein families) were all negative, suggesting no significant correlation. As expected, duplications of universal domains positively correlated with duplications of universal protein families (r
4.50E-10), as did their deletions (r
8.35E-8). Protein domain deaths and protein family deaths also were positively correlated (r
4.54E-6). Unlike the close correlation between universal domain duplication and domain birth (r
8.35E-8), the correlation between protein family birth and duplication was minimal (r
0.051); protein family birth was more strongly related to domain birth and duplication. These results suggest that new protein family generation is involved in both domain duplication and new domain formation, and implicate a role for domain shuffling. It is interesting that there is no positive correlation (r
= 0.553) between member deletions of the universal families and new protein family birth. It could indicate that lost members of universal families might not be a major source for new protein family formation. Conversely, there was a duplication burst in the LCA of the metazoa coincident with a large number of new families born in that lineage.
Pearson's correlation coefficients (bold text) and their significancea (regular text) of different evolutionary events
Dynamic evolutionary changes over the phylogeny
Reconstructed birth/death events within protein families and domains provided opportunities to better understand evolution and adaptation. Over evolution, changes in the number of protein families differed from those of protein domains. As shown in Figure , most lineages (except for G. gallus and the LCA of nematodes) exhibited a gain in protein families, represented by the positive protein family indices, but the majority of the lineages also exhibited a loss in protein domains, represented by the negative domain indices. Nevertheless, the lineages leading to mammals exhibited domain gain (Figure ). In contrast to the large gains in both protein families and domains of the LCA of vertebrates, the LCA of nematodes exhibited dramatic losses in both of these parameters. Interestingly, all other lineages of nematodes had gains in protein families. Compared to the lineages of nematodes and vertebrates (except for the LCA), arthropod lineages (except for the LCA) exhibited either less gain or more loss. In fact, all arthropod lineages exhibited a loss in protein domains. Among all organisms examined, however, the largest loss in domains was observed in T. spiralis (Figure ). Overall the three metazoan clades showed different patterns of change. Consistent with the weak correlation between protein family birth and domain birth, little correlation in changes among protein families and domains was observed over the course of metazoan evolution.
Figure 1 Protein family and protein domain change indices. At each lineage, the index for protein family change is followed by that of domain change (separated by back slash ‘/’). The index for protein family change was calculated using the log (more ...)
Given the lack of correlation between protein family and domain changes at all lineages, results suggest that domain shuffling played a large role in the formation of new families. To measure this effect we calculated the domain shuffling index i.e., the log ratio of protein family birth to protein domain birth, for each lineage (Figure ). It is clear that the effects of domain shuffling in vertebrate lineages were less than those in arthropod and nematode lineages. This is in stark contrast to the strong increase in protein family complexity observed in vertebrate lineages. Meanwhile, domain shuffling appeared to have the strongest effects in the evolution of nematodes, where the terminal lineage of T. spiralis had the highest value (Figure ). Consistent with the smallest number of duplications in protein families and domains, the terminal lineage of G. gallus exhibited the smallest domain shuffling index.
Domain shuffling indices associated with the lineages over metazoan evolution. The indices are the log ratio of protein family birth and protein domain birth events inferred in the corresponding lineage.
Complexity changes and domain shuffling indices did not inform us on temporal issues related to organism adaptation during evolution. To this end, we utilized the summation of the logarithm of protein family birth events and protein family death events normalized by lineage branch lengths as an adaptation index to define the speed of adaptation at the various lineages (Figure ). Although the values in Figure are additive and suggest a relatively constant but increasing adaptation index for all lineages, overall, this index did not exhibit significant differences among the lineages suggesting that adaptation has remained constant.
Adaptation indices associated with metazoan lineages. The indices were the summation of the logarithm of protein family birth events and death events, inferred at the corresponding lineages, normalized by the branch length of the lineage.
Domain shuffling and protein family formation
The above data suggest that domain shuffling has a strong impact on protein family complexity and organism adaptation. Consistent with this, a large number of domains of newly generated families were identified from existing domains. Figure shows the numbers of domains in the protein families born to the LCA of the three metazoan groups, and how they overlap with each other and those of the universal families. For example, 120 domains were found within the 115 families born to the LCA of nematodes, 56 of which were found in the universal families. In addition, 63 of the 120 domains were found in the families born at the LCA of arthropods and 57 were present in families born at the LCA of vertebrates. These data indicate that in the process of generating new protein families, existing protein domains play a major role that involves domain shuffling. For example, the PHD finger protein 3 of vertebrates (Cluster3894) could have been generated by first shuffling between members of the ancient proteins transcription elongation factor A (Cluster1010) and histone acetyltransferase (Cluster330), followed by the addition of a new functional domain (Figure ).
Distribution of protein domains among the protein families at the last common ancestor (LCA) of each of the three metazoan groups and the universal families.
Figure 5 A putative format for generating the vertebrate specific protein structure of PHD finger protein 3 (Cluster3894). The domain structure of PHD finger protein 3 was formed through domain shuffling between universal families, transcription elongation factor (more ...)
The functions of families born at the LCAs of the three major clades and those born at the LCA of metazoans were investigated by biological process GO term enrichment/depletion. The GO terms enriched/depleted in these families closely align with adaptation of the species (Table ). The most significant GO terms for the families born at the LCA of nematodes are G-protein coupled receptor protein signaling pathway (p
3.54E-116), cell wall catabolic process (p
7.49E-6), trehalose biosynthetic process (p
2.37E-5), and cation transport (p
2.98E-4). The most significant GO terms for families born at the LCA of arthropods are chitin metabolic process (p
= 1.08E-22), sodium ion transport (p
2.32E-18), response to stress (p
2.02E-11), and sensory perception of smell (p
4.74E-11). The top four enriched terms for the families at the LCA of vertebrates are G-protein-coupled receptor protein signaling pathway (p
7.41E-155), immune response (p
1.29E-39), regulation of cell growth (p
= 2.57E-10), and cell communication (p
4.19E-10). Upon a more broad examination of the data, the top four significantly enriched GO terms for the families born at the LCA of metazoans are regulation of DNA-dependent transcription (p
1.29E-106), neurotransmitter transport (p
7.59E-11), multicellular organismal development (p
= 1.25E-9), and acyl-CoA metabolic process (p
Enriched biological process GOa terms in protein families born at the LCAb of the three major metazoan groups and the LCA of metazoans
The functional association of family deaths at the LCAs of nematodes, arthropods, and vertebrates were also investigated through biological process GO term enrichment (Table ). The top four enriched GO terms in family deaths at the LCA of nematodes are DNA catabolic process (p
8.53E-8), DNA repair (p
7.29E-5), regulation of Rho protein signal transduction (p
2.1E-4), and porphyrin biosynthetic process (p
2.65E-4); the top four enriched GO biological processes in families deaths at the LCA of arthropods are acyl-CoA metabolic process (p
1.75E-18), vitelline membrane formation (p
= 9.07E-18), lipid transport (GO:0006869, p
2.08E-9), and sodium ion transport (p
1.81E-8); and those in families deaths at the LCA of vertebrates are G-protein coupled receptor protein signaling pathway (p
3.00E-12), intein-mediated protein splicing (p
1.00E-7), cell communication (p
3.28E-5), and chitin metabolic process (p
Enriched biological process GOa terms in protein families died at the LCAsb of three major metazoan groups
Duplication of whole genome, protein families and domains
As stated earlier, two family/domain duplication bursts were observed at the LCAs of metazoans and vertebrates. In order to evaluate the effects of whole genome duplication on these two bursts, the numbers of universal families/domains involved in duplications and/or deletions at these two LCAs were examined (Figure). Results show that there are more families involved in duplications at the LCA of metazoans than at the LCA of vertebrates. Furthermore, when the numbers of families/domains involved in duplication only at these two LCAs were compared to those of families/domains involved in deletion only, the LCA of vertebrates had significantly lower values. The ratios of the vertebrate LCA were lower at 0.3 and 0.2 for family and domain, respectively, compared to the ratios of metazoan LCA at 6.7 and 4.0. These data strongly support whole genome duplication in the LCA of metazoans. Consistent with this, the universal families with only one member per species (113 families) had only 8 duplications at the LCA of vertebrates while the duplications at the LCA of metazoans numbered 61. In addition, the numbers of deletions and duplications were very similar at the LCA of vertebrates, but duplications were substantially greater than deletions at the LCA of metazoans for both universal protein families and universal domains (Figures and ). As such, the support for whole genome duplications at the LCA of metazoans is much stronger than support at the LCA of vertebrates.
Figure 6 Protein families (A) and protein domains (B) exhibiting duplication and/or deletion at the last common ancestor (LCA) of metazoans and at the LCA of vertebrates (bold). The numbers of protein families and domains without any duplication or deletions are (more ...)