This study utilizes three prokaryotic clades: Archaea, Proteobacteria, and Gram positive bacteria. In this paper we use A, P, and G to denote the subgroup of an ortholgy group restricted to the given clade (Archaea, Proteobacteria, and Gram positive bacteria respectively). The ortholog groups were extracted based on the COG database [12
] and were filtered so that that each ortholog group has a unique homolog in each of the selected organisms. Such stringent restriction leads to the trade-off between the number of species in a clade and the number of ortholog groups in the study. After confirming high correlation between values of cc(N, X)
for four and six species (R2
was 86, 78, 80 depending on N
, [see Additional file 1
]) we concluded using the four-element clades should still provide reliable result and at the same time allow for considering a broader range of ortholog groups (see Methods). The set of 63 ortholog groups obtained in this way was divided further into the "informational groups" containing 37 ortholog groups associated with functions related to information processing (i.e. translation, ribosomal proteins, transcription, DNA replication) and the "non-informational groups" containing 26 remaining ortholog groups, which are mostly proteins involved in metabolism (see Additional file 2
for full description).
2.1. Negative correlation of cc(N, X) and the relative root distance – global measurement
First, we tested if cc(N, X) can be predicted from the information encoded in the phylogenic tree of the ortholog group. Specifically, we used a value referred to as the relative root distance r(N, X), approximating the ratio of the distance between last common ancestors of the two subgroups to the average evolutionary distances of the sequences in the two subgroups (see Methods for definition). Note, that by relying on the phylogenetic tree, this test uses information about both subgroups N and X. Thus this constitutes a global measurement of the ortholog groups (Figure ). We observed a negative correlation between cc(N, X) and the relative root distance r(N, X) (Figure ). The coefficient of determination, R2, when one of the two subgroups belonged to Archaea ((G, A) and (P, A)) was respectively 0.74 and 0.72, while that where both subgroups corresponded to bacterial clades was 0.25. Hence, the value of r(N, X) is negatively correlated with cc(N, X) and thus can be used to predict the latter value. However, since r(N, X) is a global measure that uses information on both N and X we cannot conclude that cc(N, X) can be predicted from N alone.
Figure 2 The dependency of correlation between entropy profiles on the relative root distance for all three pairs of clades. Informational groups are shown as navy diamonds and to the non-informational groups as magenta squares. The linear regression line for (more ...)
This experiment suggested a strong dependency of the similarity of entropy profiles on the shape of the ortholog tree. In addition, it pointed out the first of a series of differences in the properties of the informational and non-informational groups: the relative root distances in the set of informational groups are on average larger than the relative root distances in the set of non-informational groups (see summary in Table ).
Average correlation between entropy profiles between various clades and average values of the entropy. P-values are computed based on the t-test
2.2. Dependency of cc(N, X) on sequence conservation in group N- local measurement
The previous test demonstrated a negative correlation between the relative root distance r(N, X) computed on the basis of pairwise distances between protein sequences in X and N and cc(N, X). Next, we tested if cc(N, X) is correlated with sequence divergence within the ortholog subgroup N (Figure ). For this purpose, we measured the correlation between negated average entropy E(N) of the subgroup N and the value of cc(N, X), for all choices of N and X (six experiments). We performed the same set of experiment using the percentage of perfectly conserved columns in N, PC(N), instead of E(N). We found that the two measures are strongly correlated (R2 > 0.95 for all subgroups) and the results obtained using with either of the two measures were very consistent. Therefore, we focused on the relation between E(N) and cc(N, X). Out of the six experiments only pairs E(P), cc(A, P)) and (E(G), cc(A, G)) were correlated with R2 > 0.1 (0.17 and 0.38 respectively).
Subsequently, we focused on comparing average properties of informational and non-informational groups. Although, on average, the entropy of non-informational subgroups is higher than that of informational subgroups (and the percent conservation lower) the difference is not statistically significant. In contrast, the values of cc(N, X) are significantly higher for non-informational groups (Table ). This clear difference between the two ortholog groups is suggestive non-uniformity of constraints on the informational groups. These constraints might preserve certain mutations specific to particular subgroups within the informational ortholog groups. Another striking observation was that for non-informational ortholog groups, the average correlation coefficient is approximately the same for all pairs of clades suggesting an additional level of uniformity of the these groups.
2.3. Uncovering the relation between the cc(N, X) for different pairs of subgroups – semi-local measurement
Given the above observations, we sought to understand the separation of informational and non-informational groups in greater detail. We observed a reasonable correlation of our global measurement, relative root distance r(N, X), and cc(N, X) (Section 2.1). In contrast, the correlation between our local measurement, average entropy, E(N) and cc(N, X) was very low (Section 2.2). Therefore we considered an intermediate, semi-local, measurement of ortholog groups (Figure ). Specifically, we studied the dependency of the correlation cc(N, X) between and cc(N, Y) where N, X, and Y are different subgroups of the same ortholog group corresponding to distinct clades. The coefficients of determination, R2, for the correlation between cc(N, Y) and cc(N, X) for the three possible combinations of subgroups were 0.78, 0.27 and 0.19 depending on the subgroups, with the highest correlation for the pair (cc(A, G), cc(A, P)) and the lowest for the pair (cc(P, G), cc(P, A)) (Figure ). Just as in the previous measurements, we found that informational and non-informational proteins have a distinct behavior with respect to this measure – the values for non-informational groups showed higher correlation. Specifically, the R2 values for non-informational groups are 0.68, 0.27 and 0.31 (listed in the same order as above) while the corresponding values for the informational groups are 0.54, 0.11 and 0.02.
Figure 3 The dependency between correlation profiles cc(N, X) and cc(N, Y) all three pairs of clades. Informational groups are shown as navy diamonds and to the non-informational groups as magenta squares. The linear regression line for full set of points is shown (more ...)
This provides yet more evidence for the observation that evolutionary pressure acts more uniformly on the non-informational groups than on the informational groups. These results also give further support to the observation that a significant fraction of the informational ortholog groups might be a subject to lineage specific evolutionary pressure. If so, this would imply that proteins in this group are not easily exchangeable between species through LGT. In contrast, the selective pressure acting on non-informational proteins is much more uniform and may more easily permit exchange of corresponding orthologs and corresponding xenologous displacement [13
2.4. Lateral gene transfer and evolutionary pressure
The above observations suggested that proteins in informational ortholog groups may be less prone to exchange between lineages, while exchanges in the non-informational groups are more likely. To test if this indeed is the case, we constructed evolutionary trees for all ortholog groups and manually looked for deviations from the species tree, which would imply lateral gene transfer (LGT) between the clades (see Material and methods). We found that only 3 out of 37 informational group trees had a signature of such putative LGTs while most (18 out of 27) non-informational groups show such a signature of possible lateral gene transfer consistent with our expectation. We found that non-informational groups have higher correlation between cc(N, X) and cc(N, Y) than informational groups (Table ). Surprisingly we observed lack of increased correlation between cc(A, P); cc(A, G) for non-informational groups with LGT and even a drop when only putative transfers from Archaea are considered. We noted also that, the non-informational groups without the above defined signature of LGT events show similar basic characteristics as the non-informational groups with such signature LGT.
Correlation (R2 value) between correlation coefficients for ortholog groups with putative LGA.
2.5. In-silico Lateral Gene Transfers (s-LGT) elucidate unifying role of Lateral Gene Transfer
We then explored more deeply this relation between LGT from Archaea to bacteria and the evolutionary pressure. Specifically, we performed a series of in-silico lateral gene transfers, s-LGT, where a random sequence from Proteobacteria or Gram-positive bacteria was replaced by a random sequence from Archaea. This process was repeated 100 times. Trends from the in-silico experiment agree with the trends seen in the real data (Table ). LGT does not always increase the correlation between the values of cc(N, X) and cc(N, Y) but can been seen as a unifying force within ortholog group as illustrated in Figure . That is, if we think of the correlation between (cc(N, X), cc(N, Y)) as a measure of the angle between (N, X) and (N, Y) then s-LGTs from Archaea shifts the triangle A, P, G towards the equilateral shape (Figure ).
Results of in silico LGT (s-LGT) from Archaea to one of the bacterial clades (A2G or A2P). R2 values for s-LGT are the average over 100 simulations.
Figure 4 Graphical illustration of the unifying role of s-LGT from A to P: decreasing the value of the R2 for (cc(A, P),cc(A, G)) corresponds to increasing the angle A and X and increasing the value of the R2 for (cc(G, A), cc(G, P)) corresponds to decreasing (more ...)