An overview of the Ortholuge approach for increasing the specificity of ortholog predictions is outlined in Figure . Based on the analyses described below, the details of this approach were formulated and the approach validated using both prokaryotic and eukaryotic data sets. Ortholuge software is available [28
] to assist with the analysis of data sets other than those reported here.
Figure 2 An overview of the Ortholuge method. (A) Flow-chart outlining the main steps of the method. (B) The three ratios computed by Ortholuge. The phylogenetic distances in the numerator (dark line) and denominator (dashed line) for each ratio is shown, overlaid (more ...)
Data sets exhibited little bias due the automated sequence alignment trimming approach
We investigated the behaviour and utility of Ortholuge through analysis of diverse eukaryotic and bacterial RBH-derived datasets. For the initial test eukaryotic data set, we chose predicted mouse-rat-human orthologs from the expressed sequence tag (EST) data in TIGR's Eukaryotic Gene Ortholog (EGO) database [5
] (for a mouse-rat comparison, with human as the outgroup). The majority of our subsequent analyses utilized the higher quality MGD-based dataset (see Methods describing datasets) and the RefSeq-based RBH dataset composed of these same species, as indicated. For the bacterial data set, we chose three gamma-proteobacteria: Escherichia coli
, Pseudomonas putida
, and Pseudomonas syringae
species comparison, with E. coli
as the outgroup). Orthologs between these three species (and other sets of species subsequently examined) were predicted using a transitive RBH approach, applied to the deduced proteins from complete genome sequences [10
Accurate sequence alignment is critical for phylogenetic analysis; thus, we wished to improve the automated alignment and trimming components of the Ortholuge method. We therefore performed a comprehensive examination of biases in our automated alignment editing process (see Methods). A sample of RBH-predicted ortholog sequence sets was analyzed to devise the gap-masking and sequence trimming approaches. The sequence sets were examined to identify both gaps introduced by misalignments and gaps introduced through sequence insertions and deletions. Our observations suggested that some of the noise introduced through the misalignment may be alleviated through the removal of the gapped-segment flanking portions. We also noted that there was no appreciable effect on the sequence distances when the flanking sequences around the sequence-variation gapped regions were removed. We manually introduced gap-masking simulations over the sequences using various window length criteria to establish a gap-masking approach with a relatively conservative worst-case scenario. Both the trimming and gap-masking methods were evaluated for the introduction of ratio distribution biases by selected alignment characteristics. No obvious bias was observed through the introduction of our gap masking approach or alignment trimming (Fig. ).
Figure 3 Ratio 1 (R1) ratio distribution curves for selected alignment characteristics. Higher quality mouse-rat-human ortholog sequence sets were analyzed to devise the gap-masking and sequence trimming approaches. These methods were evaluated for the introduction (more ...)
Ortholuge produces ratios which form distributions
Ortholuge was designed with the purpose of overcoming certain limitations of the RBH method, such as the problem illustrated in Figure . Ortholuge overcomes this problem by using ratios of phylogenetic distances between genes to evaluate orthology, and using an outgroup species as a reference for two ingroup species being compared (Fig. ). For these three species, the distances for the "ortholog triple" are calculated and the three possible ratios that can be generated are calculated (Fig. ). With this approach, the problem illustrated in Figure would be detected because the human-cattle distance is unexpectedly larger than the human-mouse distance – impacting on ratio values. We ran Ortholuge on three mouse-rat-human datasets: two sets of RBH-predicted orthologs – one based on EGO data and the other based on RefSeq data – and a third high-quality curated set. For all datasets, human was the outgroup used to help predict more precise orthologs between mouse (ingroup1) and rat (ingroup2). The resulting Ortholuge phylogenetic distance ratios are shown in Figure and Supplemental Figure as histograms. For each of the three ratios, we tabulated the frequency of putative orthologous groups within certain ratio value ranges. Ratio1, Ratio2, and Ratio3 each form clear distributions. Ratio3 is generally located around a ratio value of 1, which is expected if the chosen outgroup is more distant relative to the ingroups. It is centered to the left or right of 1 depending on which of the two ingroups is closer to the outgroup. The Ratio1 and Ratio2 distributions are generally located at a ratio much lower than 1, reflecting the closer relationship between the ingroup species versus any ingroup to the outgroup. We ran our analyses on both protein and nucleotide sequences and found that for closely related species such as these, nucleotide sequences provide a better ratio distribution resolution. However, the overall ratio distributions are similar, even when using different methods of initial ortholog detection (see Figure of [Additional file 1
Figure 4 Histogram illustrating the distribution of RBH-predicted (i.e. putative) orthologous groups across the three Ortholuge distance ratios. The results for predicted mouse-rat-human RBH ortholog sets (EGO RBH data set; 19,200 ortholog groups) are shown. Each (more ...)
We also performed this analysis with our bacterial P. putida-P. syringae-E. coli orthologs, comparing P. putida (ingroup1) and P. syringae (ingroup2) using E. coli as the outgroup. We observed very similar results: Both the eukaryotic and prokaryotic data sets are consistent in the distributions formed, and in the approximate position of the distributions. Since we expected most ssd-orthologs (see Introduction for definition) to evolve in a similar manner, we hypothesized that orthologs falling within the higher frequency ranges of the distributions are more likely to be ssd-orthologs compared to those that are outliers. In essence, what is defining the species divergence is the divergence observed for most genes (i.e. the highest frequency ranges).
Ortholuge ratios can also be conveniently visualized in an R1 × R2 plot
Instead of histograms (Fig. ), an alternative way to represent Ortholuge ratios is to use a 2-dimensional plot of two Ortholuge ratios, where each putative ortholog group is represented by one point in the graph. In principal, any two of the three ratios can be used for the plot, since the three ratios are related. That is, Ratio3 equals Ratio2 divided by Ratio1. Through subsequent analyses, we found that the Ratio1 and Ratio2 combination (i.e. an R1 × R2 plot) was the simplest to visualize and to work with.
For the R1 × R2 plots, the eukaryotic mouse-rat-human RBH-predicted putative orthologous groups appear to occupy three types of positions (Fig. and ). (1) The majority of points form a cluster (highest frequency range) at low Ratio1 and Ratio2 values. In fact, about 85% of orthologs have Ratio1 and Ratio2 values less than 1. (2) Some points with higher Ratio1 values are located along a curve that approaches, and then falls along, the line equation Ratio2 = 1. This is consistent with an unusually high divergence of a gene from ingroup 2. (3) Conversely, some points with higher Ratio2 values are located along a line that is roughly around line equation Ratio1 = 1. This is consistent with an unusually high divergence for a gene from ingroup 1. The RBH-predicted orthologous groups for P. putida-P. syringae-E. coli species show a similar R1 × R2 plot (Fig. and ). Consistent with the eukaryotic results, the vast majority of orthologous groups for this prokaryotic analysis also exhibit Ratio1 and Ratio2 values less than 1.
Figure 5 Ortholuge R1 × R2 plots (Ratio1 versus Ratio2) for selected eukaryotic data, where each point represents one putative ortholog group. (A) Putative orthologous groups identified using RBH for mouse-rat-human (Figure 4 shows the corresponding histogram). (more ...)
Figure 6 Ortholuge R1 × R2 plots for the prokaryotic data, illustrating two ortholog data sets and a true-negative data set. (A) Putative orthologous groups from an RBH-predicted data set. (B) Probable true orthologs from a higher quality (more precise) (more ...)
We expected most ssd-orthologs to evolve in a similar manner, and found that most orthologous groups form a cluster (high frequency range) in an R1 × R2 plot. Therefore, we hypothesized that orthologous groups falling within the high frequency range are more likely to contain ssd-orthologs. Conversely, those outside of this range (i.e. high Ratio1 or Ratio2 values) are more likely to contain, in an ingroup, either an ortholog that has undergone unusual divergence, or a paralog.
"Higher quality" orthologous groups are found primarily in "low" Ortholuge ratio ranges, in R1 × R2 plots
The data sets of tentative orthologs predicted above by an RBH approach will certainly contain genes that are being falsely identified as orthologs. It is difficult, if not impossible, to obtain a dataset of this size that contains only true orthologs, due to the inherent nature of inference associated with evolutionary study. However, data sets of "higher" and "lower" quality can be constructed and examined (see Methods), to observe how their Ortholuge ratios change in comparison to each other. These data sets should contain a notably greater or smaller proportion of true orthologs, respectively.
We therefore examined the behaviour of Ortholuge ratios for a higher quality data set of probable orthologs. Curated orthologs between human, mouse, and rat genomes were acquired from the Mouse Genome Database (MGD). Figure and illustrate that this higher quality data set occupies a smaller area of the R1 × R2 plot. This smaller area is observed, even when the number of points is normalized with the number plotted for the RBH-based data (data not shown). For this higher quality (more precise) data set there are notably fewer points along the Ratio1 = 1 line equation and the Ratio2 = 1 line in the plot, compared to the RBH-based data plot in Figure and .
Conversely, we examined the ratios associated with a "lower quality" data set, involving RBH-predicted orthologs for bovine, human, and mouse, from TIGR's EGO database (with mouse as the outgroup). The incomplete state of the bovine genome data at the time of this analysis should lead to more falsely predicted orthologs, since some true orthologs will be missing from the bovine dataset (see Fig. for a scenario). These results are shown in Figure and . Note the higher number of points with a high Ratio2 value, falling along the line equation Ratio1 = 1; these points are consistent with how the ratio would behave if the bovine data contained paralogs that were notably more divergent than expected for most orthologs.
To gain a sense of the differences in plots of different quality datasets, note that below Ratio1 and Ratio2 values of 1, there lies 97% of high quality dataset points (Fig. ), 86% of RBH-predicted ortholog group points (Fig. ), and only 73% of the low quality data set points (Fig. ). These results suggest that true orthologs (or at least more precise ortholog data sets) tend to fall within the bulk of the highest frequency range (i.e. relatively "low" Ratio values in an R1 × R2 plot), while orthologs with unusual divergence patterns (non-ssd-orthologs) and paralogs have either high Ratio1 or high Ratio2 values.
For the prokaryotic analysis, a higher quality data set was compared to the RBH-based data set as well. Figure and illustrate the same trend as the eukaryotic data, with respect to how the R1 × R2 plots look for more precise and less precise ortholog data sets.
Known paralogs (true-negatives) introduced into orthologous groups generate either high Ratio1 or high Ratio2 values, as shown in a gene loss/incomplete genome simulation
The above comparisons of higher quality (more precise) and lower quality (less precise) ortholog data sets support our hypothesis that orthologs and paralogs fall within different regions of the R1 × R2 plot. However, a stronger argument can be made by examining specifically where falsely predicted orthologs (true paralogs) occur in such distributions. A true-negative data set was therefore constructed by removing genes from one of the ingroup gene data sets and then identifying the next best reciprocal BLAST hit with the other ingroup (ensuring transitivity of this introduction with the other ingroup and outgroup). Therefore a true negative is essentially an ortholog triple which has been transformed into a false positive by introducing a less similar sequence for one of the species sequences. These true negatives represent the types of ortholog predictions that would result from an RBH-method in scenarios such as Figure . Since we know that RBH can make incorrect predictions when a genome is incomplete or when gene loss has occurred, this analysis simulates what would occur with the RBH method in such cases. The benefit of this analysis is that we specifically know the true-negatives introduced, allowing us to examine how the Ortholuge ratios for these true-negatives (paralogs) behave.
For the E. coli-P. putida-P. syringae input ortholog groups, we constructed two true-negative data sets. In the first, we replaced P. putida genes with their next best RBH hit to P. syringae, resulting in ingroup1 paralogs. In the second, we replaced P. syringae genes with their next best RBH hit to P. putida, resulting in ingroup2 paralogs. For both, we conservatively introduced all possible paralogs into the analysis, resulting in roughly 50% of the genes converted to true-negatives (i.e. conservative, because most data sets would never contain this many true-negatives). The results from these two data sets (Fig. and ), show that these true-negatives overlap very little with the RBH-predicted orthologs (Fig. ) or with the high quality (more precise) orthologs (Fig. ). This demonstrates that even with all possible true paralogs simulated, very few of them are falling within the higher frequency ranges of the RBH distributions.
We also constructed a third true-negative data set with all outgroup genes (E. coli) replaced by their next best RBH hit to both P. syringae and P. putida. The R1 × R2 plot (Figure ) shows that these true-negative cases plot at lower Ratio1 and Ratio2 values and do not separate well from what would be expected for true-orthologs. This is actually promising, since in the case of a paralog in an outgroup, the two ingroups should still be regarded as probable true orthologs and should still be falling within the main cluster of true-orthologs, as we observe. In other words, since the goal of Ortholuge is to improve ortholog identification between the two ingroups, it is beneficial that an outgroup paralog does not generally interfere with/affect the analysis.
Figure 7 R1 × R2 plots, for the prokaryotic data, illustrating the effect of introducing outgroup paralogs (outgroup ortholog true-negatives) in the analysis. Unlike for other figures of R1 × R2 plots in the paper, only ratio ranges from 0 to 2 (more ...)
Ortholuge ratio cut-offs, to separate orthologs from paralogs, can be determined based on an iterative-true-negative analysis
After determining that the introduced true-negatives almost never fall within certain ratio ranges, it became clear that ratio cut-offs could be derived to exclude most true-negatives, and thus improve the specificity (precision) of ortholog prediction. To do this, another strategy was employed to simulate the introduction of paralogs (true-negative ortholog predictions) and then formulate ortholog identification cut-offs. This second strategy, involving an iterative-true-negative analysis, allows one to view the variance in proportion of true-negatives in a particular ratio range, and is also amenable to high throughput use for the formulation of cut-offs. For both the eukaryotic (human-mouse-rat) RBH-predicted data set (RefSeq-based), and the prokaryotic RBH-predicted data set, we conservatively modeled an incomplete genome (or gene loss) scenario by randomly replacing 25% of the genes in the RBH-predicted data set with the "next best RBH" hit (i.e. a true-negative). This randomized introduction of true-negatives was iterated at least 50 times, and each iteration was evaluated by Ortholuge. The proportion of true-negative orthologs was averaged over all iterations and the standard deviation determined. We found that that once again, the ratio values of true-negative orthologs do not overlap well with those of the bulk of RBH-predicted orthologous groups (Figure and Supplemental Figures and ).
Figure 8 Example of the generation of cut-offs for classification of ssd-orthologs and probable paralogs, based on an iterative-true-negative analysis (i.e. based on an introduction of random sets of true-negatives). The particular analysis illustrated here is (more ...)
For both the prokaryotic and eukaryotic RBH-based data sets, this iterative true-negative analysis was used to determine ratio ranges where true paralogs were very unlikely to land and ranges where they were very likely to land. The borders of these ranges (described in Figure and Supplemental Figures and ) became the ratio cut-off values. This permitted classification of the RBH-predicted tentative orthologous groups into probable ssd-orthologs, probable paralogs, or "uncertain" categories. It should be noted that a more accurate name for the 'probable paralog' category might be 'probable non-ssd-ortholog,' because there may be true orthologs that have undergone unusual divergence in one ingroup species within this category. However, in such cases the non-ssd-orthologs may have functionally diverged, and therefore are cases that we would want to differentiate from our ssd-ortholog set. Regardless, for ease of comprehension, we propose to call those cases with very atypical ratios (in the range of what is observed for paralogs) "probable paralogs", since paralogs likely predominate in this region.
We chose a 25% true-negative introduction, since this is likely above a worst-case scenario in terms of the number of genes that may be missing in an incomplete genome, or most cases of naturally occurring gene loss. We felt it was important to "saturate" the data set with true-negatives, because any given RBH-based dataset will likely contain some proportion of false-positives in the putative orthologous groups (i.e. it is difficult to ensure one has a completely true-positive set of orthologs). Therefore, to effectively identify the ranges where true-negatives were becoming increasingly more common we needed to observe a large proportion of true-negatives. However, we did not want to transform a data set with all possible true-negatives, as this would not provide a sense of the variation in proportion of true-negatives within a given ratio range. Note that we also chose to report the results here for a transformation of an RBH-predicted data set with the true-negatives (i.e. a RefSeq-based RBH analysis), rather than a transforming a high quality dataset, since the RefSeq based analysis could be more easily fully automated (i.e. it did not require developing a curated set of high quality orthologs). However, transformation of a eukaryotic high quality dataset with true-negatives generated similar cut-off values (data not shown). Through an iterative sampling approach we were able to generate standard deviations of the proportion of true-negatives in a given ratio range (Figure ), providing a clearer picture of the likelihood of a true-negative occurring in that range.
Ortholuge ratios in combination can help predict which gene in a given putative orthologous group is likely a paralog
A closer inspection of the Ortholuge ratios shows that they behave in a predictable fashion when the ortholog group contains one or more false-positives (Table ). For example, if ingroup1 is actually a paralog, then the distance between ingroup1-outgroup and the distance between ingroup1-ingroup2 would be larger than the norm for an ssd-ortholog. This would cause Ratio2 to increase (the degree of increase would depend on how diverged the paralog is from the missing 'true' ortholog), and Ratio1 to increase a slighter amount (depending on how distant the outgroup is). Conversely, if ingroup2 is actually the paralog, then Ratio1 would be expected to increase and Ratio2 to increase slightly. These predictable changes do indeed occur, as illustrated by an analysis of true-negatives (Figure and ), an analysis of a dataset of tentative orthologs identified by RBH using an incomplete genome (Figure and ), and an additional manual review of selected cases (data not shown). We propose that when unusual ratio ranges are identified for a given orthologous group, the relative changes can facilitate predictions regarding which of the two ingroups may contain a paralog (or non-ssd-ortholog).
Ortholuge-ratios can help predict which gene in a given putative orthologous group is likely a paralogaa.
Note that an outgroup paralog cannot be well predicted, however this does not affect the utility of Ortholuge, since the method is focused on characterizing the orthology of the two ingroups. It should also be noted that multiple-paralog scenarios (last three rows in Table ), are more complex. Though relatively easy to predict on paper, they are more difficult to distinguish in reality, because the amount of divergence for the two paralogs may vary greatly. In most cases they would resemble one of the first three scenarios, depending on which of the two paralogs was more diverged. Nevertheless, in the end, these rare cases (two paralogs in a group of three) will still most frequently display atypical ratios, and will not fall within probable ortholog cut-offs.
Ortholuge in action: an estimation of probable ssd-orthologs and probable paralogs in RBH-based data sets
An example of ratio cut-offs generated based on our true-negative analysis is listed in Table (see also Figure and Supplemental Figures and ). Researchers are of course encouraged to choose their own cut-off to suit their needs (i.e. more sensitivity or specificity). However, based on our simulations, these cut-offs should effectively differentiate probable orthologs and paralogs for these data sets. We also propose that these cut-offs can identify those orthologs most closely following species divergence (i.e. ssd-orthologs) – orthologs which may be more functionally similar to each other versus those that have diverged at different evolutionary rates in each species.
Proportion of RBH-predicteda orthologs that are likely ssd-orthologsb and likely paralogs, according to Ortholuge analysis.
Using the derived ratio cut-offs, we have constructed several data sets of probable ssd-orthologs consisting of: mouse-rat comparisons (with human as the outgroup), and one for a P. putida
comparison (with E. coli
as the outgroup). These ssd-orthologs are particularly suited for comparative genomics analyses. In addition, notations are added to all the data analysed, indicating cases of probable gene duplication after species divergence ("possible in-paralog") – a scenario that can increase the likelihood of functional divergence of the genes. These higher quality sets of orthologs can be found via the Ortholuge website [28
]. The proportion of ssd-orthologs in the RBH-predicted data sets is summarized in Table . Note that cases of in-paralogs are not counted within the counts of ssd-orthologs in Table . Such cases, due to their uncertain potential to have diverged in function because of a gene duplication, are counted within the "uncertain" category.
Using the cut-offs, we were also able to estimate the proportion of RBH-predicted orthologs that are likely paralogs for these eukaryotic and prokaryotic data sets (Table ; see also data available on the Ortholuge website [28
], which includes a classification of the EGO dataset using the RefSeq analysis cut-offs). For the prokaryotic data about 5% of RBH-based predictions are probable paralogs. For the eukaryotic data, about 10% of the RBH-predictions are probable paralogs. These are significant numbers that validate the need for a method like Ortholuge, particularly if one is trying to use RBH-predicted orthologs for downstream analyses that require stringent ortholog prediction (for example, for regulatory element detection).
Application of these cut-offs to classify the curated eukaryote and prokaryote datasets suggest that the false negative rate in is in the range of 0.7% for prokaryote data and 3% for the eukaryote data.
To facilitate the analysis of other datasets, we have developed Ortholuge software that can be used to characterize any existing dataset of orthologs. If no pre-existing ortholog dataset is available, Ortholuge can also construct such a dataset using an RBH-based approach applied to whole genome datasets (or other adequate datasets of genes from three organisms that a user supplies). Ortholuge was developed using Perl under Linux (SuSE 9.0 and RH 9.0) and operates in any UNIX environment, provided all the needed tools (see Methods) are available for the user's operating system. This freely available, open source, software is available on the Ortholuge website [28