Characterizing protein pair types
We computed the levels of domain architecture and primary sequence conservation for pairs of orthologous proteins, and compared these with corresponding figures for paralogous proteins at the same evolutionary divergence. We considered sets of ortholog clusters as defined by InParanoid, between Homo sapiens and 40 other species. From these clusters, we extracted protein pairs falling into four types depending on their orthology status. See Figure for an illustration of how the orthology status of a protein is defined. Ortholog (O) pairs contain one protein from Homo sapiens and one orthologous protein from another species. Inparalog (iP) pairs are pairs of proteins within either Homo sapiens or another species, which are part of the same ortholog cluster, and which thus have arisen through a gene duplication following the divergence of the two species. Outparalog pairs of two types were considered: Closest cross-species (oPx) and same-species (oPs) outparalogs. The former pairs consist of a protein from one species and the protein in the other species to which it is the most similar outside of the ortholog cluster. The latter, analogously, consist of a protein in one species and the most similar protein in the same species outside the cluster.
Figure 2 Illustration of orthology definitions. Species A and B are compared. Proteins A1 and B1 are each others' closest cross-species homologs and are considered seed orthologs. Other proteins in A and B are inferred to have descended from the same ancestral (more ...)
We would expect function to be more often conserved in O pairs than in other pair types because in other pair types, gene duplications would have relaxed pressure on one of the copies to retain the ancestral function. We further subdivided the O pair category into pairs from clusters with only a single member in each species (1-1 orthologs), and pairs where gene duplication has occurred in either species. In the non-duplicated case, function conservation should be more frequent than in the case with duplication, as the latter offers more opportunity for undergoing and retaining functional shifts (neo- [32
] or subfunctionalization [33
Absolute domain architecture conservation
We developed a method to score the degree of domain architecture conservation between two proteins, called the DA-score. This extends the DCS score of Song et al [20
] by not only considering the domain content but also the actual alignment of domains (See Methods). This way the domain order is taken into account. In this, it is similar to the domain architecture distance used by Björklund et al [10
], but inverted and normalized to form an actual similarity measure. As previously stated, while the method of Lin et al [18
] also provides a similarity measure for domain architectures, it is less straightforward and optimized for detecting homology rather than for describing the degree of architectural differences between proteins.
To avoid biasing the results towards large clusters we calculated the average score for all protein pairs of a given pair category in each cluster first, and then calculated the average of these average cluster values for each category. This is equivalent to normalizing by cluster size, which makes sense as an approach to avoid biasing the results in favor of trends exhibited by only a few large clusters. As our analysis is pairwise, the presence of a small number of large clusters, either artifacts or consequences of intensive gene duplication within some gene families, could lead to a relatively small number of genes completely dominating the results of the analysis. This might hide general trends, as well as bias the analysis unfairly towards certain gene functions or families.
Figure shows the domain architecture conservation for comparisons between Homo sapiens
and successively more distant species (as defined by the NCBI Taxonomy) for the different pair types. Inparalog pairs, having diverged later than any of the other pair types, are in all cases the most conserved, and this difference becomes more pronounced the further back the speciation event happened. The graph shows a gradual drop in similarity across all pair types, except a striking jump for outparalog pairs at the vertebrate/invertebrate border. The point of this accelerated outparalog divergence supports the theory of whole-genome duplications at the vertebrate root [34
Figure 3 A. Absolute domain architecture conservation of ortholog and paralog pairs. The average DA-score for each pair type is shown for comparisons between human and other species, sorted by their distance to human. Inparalog pairs have diverged after the speciation (more ...)
As seen in Figure , increasing evolutionary distance is related to an increase in architectural differences, although there is a type of plateau behavior since sequences cannot grow too different and still be recognized as homologs. It is noticeable that outparalog pairs to exhibit higher domain architecture conservation in prokaryotes than in plants. Possibly this is a consequence of prokaryotic proteins in general having fewer domains [35
The analysis was also done for orthologs split into 1-1 orthologs and duplicated orthologs, as shown in Figure . As expected, 1-1 orthologs have more conserved domain architecture. Supplementary Table 2 shows the fraction of these two pair types that have identical domain architectures. Notably, for 1-1 ortholog pairs, the fraction of pairs with identical architectures is on average 11.2% higher than for duplicated orthologs.
Figure 4 A. Absolute domain architecture conservation of duplicated orthologs and 1-1 orthologs. The average DA-score for each protein pair type is shown for comparisons between human and other species, sorted by their distance to human. Domain architecture is (more ...)
Figure shows the mean sequence divergence for the same species comparisons. As a measurement of sequence divergence, we calculated the evolutionary distance as expected number of amino acid substitutions per position between each pair of proteins. Again, inparalog pairs were most similar, followed by ortholog pairs and paralogs. However, here the ortholog and paralog curves came much closer to each other in the more distant species. In other words, sequence divergence does not strongly distinguish orthologs from other homologs at high evolutionary distance. A clear jump also in sequence divergence is seen at the vertebrate/invertebrate border.
Figure shows the mean sequence divergence for duplicated orthologs relative to 1-1 orthologs. For all species comparisons, the duplicated orthologs had on average diverged more than 1-1 orthologs. Interestingly, this gap is considerably wider for most vertebrates than for non-vertebrates. This may again be a consequence of large-scale neofunctionalization following whole-genome duplications at the root of the vertebrate lineage [34
]. The fact that the gap is not larger for more distant vertebrates is harder to explain in this manner, and may instead hint at a generally higher degree of neofunctionalization following gene duplication within vertebrates.
While the overall trends in Figures , are clear, neither the DA-score nor the sequence divergence changes perfectly smoothly as we move to more distant species comparisons. This is to be expected for several reasons. The NCBI taxonomy is not perfect, and species that all share the same last common ancestor with human cannot be internally ranked. Furthermore, evolutionary rates may vary between lineages, so that distance and branching order in a rooted topology will not always correspond perfectly.
Domain architecture conservation relative to primary sequence conservation
Given that InParanoid assigns orthology status using relative BLAST [36
] scores, it is not surprising that orthologs on average should exhibit lower sequence divergence than paralogs.
We therefore analysed how DA-score is affected by sequence divergence, and whether this effect is the same for protein pairs of different orthology status. Data from all species comparisons were pooled, and the clusters were divided into bins based on average sequence divergence for each pair type. This can be seen as way to normalize the DA-score with sequence divergence, to determine whether there are differences between orthologs and non-orthologs with respect to how well architecture is conserved as primary sequence diverges. Average DA-score was first calculated for all pair types within each cluster, and then these values were averaged across all clusters. The first step was done to avoid biasing the results towards large clusters.
Are differences between categories significant in a bin analysis of this type? Sparsely populated bins might yield spuriously large differences between the mean values for the two categories just by chance. To avoid this we performed a randomization test for a significant (Bonferroni corrected p < 0.05) difference between the category means within each bin.
Figure compares the mean DA-scores of ortholog pairs and cross-species outparalog pairs at different levels of sequence divergence. For sequence divergences higher than ca 0.5 expected substitutions per site, we observe that orthologs have significantly higher mean DA-score, and that the difference increases with increasing sequence divergence. For shorter distances, a significant although very weak seemingly opposite trend can be observed. As a control, Figure is an equivalent comparison of paralog versus paralog, here inparalog pairs and same-species outparalog pairs. While some bins with significantly different means exist in this comparison as well, they are much fewer, and there is no visible separation between the categories. Figure compares ortholog pairs with inparalog pairs, and again the ortholog pairs exhibit significantly higher mean DA-score than the paralog pairs for most bins above about 0.7 expected substitutions per site. Supplementary Tables 3A-C show the exact bin borders, number of clusters in each bin, P-values, as well as mean sequence divergence and DA-scores.
Figure 5 A. Domain architecture conservation across all species for pairs of orthologs and closest cross-species outparalogs. DA-score is averaged within ranges (bins) of sequence divergence. The scores for each pair category were first averaged within each cluster (more ...)
Conceivably, an analysis of this type may yield errors if the categories have a significantly different distribution of data points (clusters in this case) within each bin with respect to the variable which is binned on. False positive trends resulting from such conditions should increase in proportion if bins are made broader, and diminish or disappear if bins are made narrower. To investigate this we repeated the analysis using 10, 20, and 100 bins, which all showed the same trend (see Additional file 2
, Figure S1A-C, Additional file 3
, Figure S2A-C, and Additional file 4
, Figure S3A-C). This indicates that differences in sequence divergence distribution within individual bins cannot explain the significant differences in DA-score we see between the categories.
These results, which appear to hold for ortholog and paralog pairs above an evolutionary separation of about 0.5-0.7 expected substitutions per site, indicate that conservation of domain architecture and of primary amino acid sequence are semi-independent properties, in the sense that protein pairs at the same level of sequence conservation will often vary predictably in architecture conservation depending on their orthology status. We interpret this as a higher relative conservation of function for orthologous protein pairs, which in turn confers a higher relative conservation of domain architecture than for other homologs. This in turn provides support for the widespread assumption that the domain architecture of a protein is informative with regards to its function.
Comparison with previous work
As stated previously, Lin et al [18
] investigated KOGs [19
] clusters and found that 81% of architectures in their dataset belonged to a single KOG only, and that 65% of the KOGs in their dataset contained only a single architecture. Additional file 1
, Table S4 presents equivalent results for InParanoid clusters in our dataset. The overall trend across species appears to be similar. However, using InParanoid clusters, slightly fewer architectures (75% on average) are found only in a single cluster, and significantly more clusters (82% on average) contain only a single architecture. The differences between the study outcomes could stem either from our approach using Pfam clans and collapsing repeat/motif families, which was not used by Lin et al [18
], or it might reflect differences between KOGs and InParanoid in that the former tends to merge clusters which are distinct in the latter [37
Degree of architectural similarity
In order to analyse the nature of architecture differences between proteins, we defined four basic classes of domain architecture similarity. Protein pairs with identical domain architectures belong to class Identical. Protein pairs with the same domain set but without all domains perfectly aligned (that is, pairs having a different order and possibly a different number of domains of each type) belong to class SameContentNotAligned. Protein pairs with overlapping but not identical domain sets belong to class DiffContent, and protein pairs with disjoint domain sets belong to class NoShared.
We compared how the ortholog (O) and cross-species outparalog (oPx) pairs fell into these classes. The overall trend across all species is shown in Figure . O pairs generally have more identical domain architectures (true in 38 of 40 species) while oPx pairs more often have different domain content (true in 39 of 40 species). It is worth noting that it is very rare for two proteins to have the same domain content but not have identical domain architecture (SameContentNotAligned). In fact, this is substantially less frequent than having different domain content. Apparently, when functional divergence is allowed, this normally happens by also changing the domain content rather than merely shuffling the existing domains.
Figure 6 A. Distribution of protein pairs across domain architecture similarity classes. All ortholog and cross-species paralog pairs in each species were divided into four qualitatively distinct classes (See Results), that are plotted as averaged fractions over (more ...)
The numbers for individual species comparisons are listed in Supplementary Table 5A. We did a statistical test (χ2) to assess the whether the distribution across the three classes Identical, SameContentNotAligned and DiffContent (the number of pairs in the NotShared class were often too small for the χ2 test to be suitable) was significantly different between O and oPx pairs. For 34 of the 40 species, the distributions were found to be significantly different (P < 0.05). All of the non-significant results were in prokaryotes.
Characterizing domain swapping events
For protein pairs in the classes SameContentNotAligned and DiffContent (i.e., those proteins that share at least one domain but do not have identical architectures), we further analysed the presence of the five different types of domain swapping events described in Figure (segment duplication/deletion, repetition difference, insertion/deletion of new domains, insertion/deletion of existing domains, domain shuffling).
The overall averaged results are shown in Figure , and the per-species details in Additional file 1
, Table S5B. Again, we compare the distributions among these categories for O versus oPx pairs. For 35 of the 40 species, the distributions among the categories repetition differences, insertion/deletion of new domains and insertion/deletion of existing domains were found to be significantly different (P < 0.05) under a χ2
test. (The numbers of pairs in the other two categories were too small for the test to be suitable). The non-significant species were all prokaryotes. Additionally, for two prokaryotes where the difference was significant, the applicability of the χ2
test is questionable because at least one expected cell count was less than 5. Mainly, this difference seems to occur in oPx pairs having fewer repetition differences than O pairs (true in 35 of 40 species) and more often undergoing insertion/deletion of new domains (true in 34 of 40 species).
For both pair types, it is striking how rarely segment duplication/deletion, domain shuffling, and insertion/deletion of existing domains is observed. Almost all architectural differences can be explained by repetition differences or insertion/deletion of new domains. The fact that oPx pairs have a higher degree of insertion/deletion of new domains supports their generally more relaxed functional constraints.
Position bias of domain architecture change events
As previous work [10
] has shown that domain architecture changes preferentially seem to occur (or be fixated) at protein termini, we investigated whether such a trend holds true for this dataset as well. The change events within each orthology cluster were tallied, and the ratio of each position was taken for each cluster, then averaged across all clusters. Additional file 1
, Table S6 shows the distribution of events across the N-terminal, middle, and C-terminal categories. The results are basically in agreement with previous studies [10
]: terminal events are more common than architecture changes in the middle of proteins, with some bias towards the N-terminal end. The same pattern held across distributions between different pair types.