Figure illustrates the basic steps in analyzing a genome using the DarkHorse algorithm, with Escherichia coli strain K12 as an example. In addition to protein sequences from the test genome and a reference database, program input includes two user-modifiable parameters: a list of self-definition keywords and/or taxonomy id numbers, and a filter threshold setting. The self-definition keywords determine phylogenetic granularity of the search and relative age of potential horizontal transfer events being examined. The filter threshold setting is a numerical value used to adjust stringency for relative database abundance or scarcity of sequences from species closely related to the test genome. These parameters can be varied independently or iteratively in repeated runs to fine-tune the scope of the analysis.
Figure 1 Flow diagram illustrating DarkHorse work flow, with example numbers for Escherichia coli strain K12. Parallelograms indicate data, rectangles indicate processes. Parallelograms with dashed borders indicate intermediate data, output by one step and input (more ...)
The process begins with a low stringency BLAST search, performed for all predicted genomic proteins against the reference database. All BLAST matches containing self-definition keywords and/or taxonomy id numbers are eliminated from these search results. For each genomic protein, the remaining BLAST alignments are filtered to select a candidate match set, based on both query-specific BLAST scores and the global filter threshold setting. Database proteins with the maximum bit score from each candidate set are used to calculate preliminary 'lineage probability index' (LPI) scores. LPI is a new metric introduced in this paper that is key to the genome-wide identification of horizontally transferred candidates. Organisms closely related to the query genome receive higher LPI scores than more distant ones, and groups of phylogenetically related organisms receive similar scores to each other, regardless of their abundance or scarcity in the reference database. Details of the procedure used to calculate LPI scores are presented in the Materials and methods section.
Preliminary LPI scores are used to re-order the candidate sets, now choosing the candidate with the maximum LPI score from each set as top-ranking. These revised top-ranking matches are then used to refine preliminary LPI scores in a second round of calculation. Final results are presented in a tab-delimited table of results. An example of the program's tab-delimited output is provided as Additional data file 1.
GenBank nr was chosen as the reference database for this study to obtain the widest possible diversity of potential matches, but the algorithm could alternatively be implemented using narrower or more highly curated databases. The set of query protein sequences must be large enough to fairly represent the full range of diversity present in the entire genome. The easiest way to ensure unbiased sampling is to include all predicted protein sequences from a genome, but this requirement might also be met in other ways, for example, with a large set of cDNA sequences. Blast searches performed using predicted amino acid sequences were found to be more useful than nucleic acid searches, resulting in fewer false positive matches and giving a more favorable signal/noise ratio.
Parameter settings for the preliminary BLAST search are used as a coarse filter to reduce computation time and memory requirements, removing low scoring matches as early as possible. These initial settings need to be broad enough to include even very distant orthologs, but do not affect final LPI scores as long as no true protein orthologs have been prematurely eliminated. To reduce the frequency of single-domain matches to multi-domain proteins, initial filtering for this study included a requirement for each match to cover at least 60% of the query sequence length. BLAST bit score was used as a metric for subsequent ranking and filtering steps, to ensure fairness in analyzing sequences of varying lengths.
Selection and ranking of candidate match sets
One well-known problem in using the BLAST search algorithm to rank candidate matches is that highly conserved proteins can generate multiple database hits with similar scores, and quantitative differences between the first hit and many subsequent matches may be statistically insignificant. No single, absolute threshold value is suitable as a significance cutoff for all proteins within a genome, because degree of sequence conservation varies tremendously. In addition to variability among proteins, mutation rates and database representation can also vary widely between taxa, so appropriate threshold values may need adjustment by query organism, as well as by individual protein.
To overcome these problems, DarkHorse considers bit score differences relative to other BLAST matches against the same genomic query, rather than considering absolute differences. For each query protein, a set of ortholog candidates is generated by selecting all matches that fall within an individually calculated bit score range. The minimum of this range is set as a percentage of the best available score for any non-self hit against that particular query. The percentage is equal to the global filter threshold setting chosen by the user, which can, in theory, vary between 0% and 100%. A zero value requires that all candidate matches for a particular query have bit scores exactly equal to the top non-self match. Filter threshold settings intermediate between 0% and 100% require that candidate matches have bit scores in a range within the specified percentage of the highest scoring non-self match. In practice, values between 0% and 20% are found to be most useful in identifying valid horizontal transfer candidates. The effects of threshold settings on the phylogeny of top-ranking candidates are illustrated for genomes from four different organisms in Tables to .
Effect of filter threshold setting on best match lineages for E. coli
Effect of filter threshold setting on best match lineages for T. maritime
Once candidate match sets have been selected for each genomic protein, lineage information is retrieved from the taxonomy database. This information is used to calculate preliminary estimates of lineage frequencies among potential database orthologs of the query genome. These preliminary estimates are used as guide probabilities in a first round of candidate ranking, then later refined in a second round of ranking.
The probability calculation procedure, described in detail in the Materials and methods section, is based on the average relative position and frequency of lineage terms. More weight is given to broader, more general terms occurring at the beginning of a lineage (for example, kingdom, phylum, class), and less weight to narrower, more detailed terms that occur at the end (for example, family, genus, species). To compensate for the fact that some lineages contain more intermediate terms than others (for example, including super- and/or subclasses, orders, or families), the calculation normalizes for total number of terms, and weights each term according to its average position among all lineages tested, rather than an absolute taxonometric rank. The end result is a very fast, computationally simple technique to assign higher probability scores to lineages that occur more frequently, and lower scores to lineages that occur only rarely. Groups of phylogenetically related organisms receive similar lineage probability scores, even if actual matches to the query genome are unevenly distributed among individual members of the group.
The probability calculation is performed twice during each search for horizontal transfer candidates, once to obtain a set of preliminary guide probabilities, and a second time to obtain more refined LPI scores. Initial guide probabilities are calculated using one sequence from each candidate match set, selected on the basis of having the highest BLAST bit score in the set. Once guide probabilities are established, they are used to re-rank the members of each candidate set by lineage probability instead of bit score, in some cases resulting in the choice of a new top-ranking sequence. The lineage-probability calculation is then repeated using the revised set of top-ranking candidates as input, to obtain final LPI scores, which range between zero and one. Additional rounds of probability calculation and candidate selection would be possible but are unnecessary; lineage probability scores generally change only slightly between the preliminary guide step and final LPI assignments.
Filter threshold optimization
Selecting a global filter threshold value of zero maximizes the opportunity to identify horizontal transfer candidates, but may result in false positives if sequences from closely related organisms have BLAST scores that are slightly, but not significantly, lower than the top hit. Using a higher value for the threshold filter, allowing a wider range of hits to be considered in the candidate set for each query, helps eliminate false positive horizontal transfer candidates by promoting matches from closely related species over those from more distant species. However, as the range of acceptable scores for match candidates is progressively broadened, sensitivity to potential horizontal transfer events is correspondingly decreased, and true examples of horizontal transfer may be overlooked.
The effects of filter threshold cutoff settings on phylogenetic distribution of corrected best matches were examined in detail for E. coli strain K12. In this example, all protein matches to the genus Escherichia were excluded under the user-specified definition of self. In addition, matches containing the terms 'cloning', 'expression', 'plasmid', 'synthetic', 'vector', and 'construct' were also excluded to remove artificial sequences that might originally have been derived from E. coli.
Table summarizes the E. coli filter threshold results. BLAST matches above the initial screening threshold were found for 4,179 (97%) of the original 4,302 genomic query sequences. With a filter threshold cutoff of 0%, the great majority of lineage-corrected best matches are closely related Enterobacterial proteins, as expected. As the filter threshold is progressively broadened, this number increases from 4,000 to a maximum of 4,112, reflecting the promotion of matches from closely related species to a best candidate position. However, some E. coli proteins had no matches to Enterobacterial database entries, even at a filter threshold setting of 100%, where all BLAST hits above the initial screening minimum are considered equivalent. Matches to these sequences are found only in phage, eukaryotes, and more distantly related bacteria, and represent either database errors, gene loss in all other sequenced members of this lineage, hyper-mutated sequences unique to this strain of E. coli, or candidates for lateral acquisition.
Table shows detailed information for the eight eukaryotic sequences initially identified as best matches to E. coli. For each E. coli query sequence, the top hit match using a 0% threshold is shown first (bold). The second line for the same query (italicized) shows results at the lowest filter value where an alternative match with a higher LPI score was found. In five cases, increasing the filter threshold revealed additional BLAST matches to sequences with higher LPI values, suggesting the original match might be incorrect. In three cases, no better match was found, supporting statistical validity of the original result.
Effect of filter threshold setting and LPI score ranking on eukaryotic BLAST matches to E. coli
Interpreting BLAST search results for E. coli requires caution, because there is an especially high risk of finding matches to contaminating cloning vector and host sequences in genomic data for other organisms. This problem is illustrated by the first entry in Table , for the E. coli beta-galactosidase protein AAC74689, a common cloning vector component. The top ranking match for this query at a filter value of zero is Arabidopsis protein CAC43289. The BLAST alignment for this match is excellent, with 99% identity over all 603 amino acids of the query sequence, but application of a filter threshold setting of 2% reveals another extremely good match in the database, ZP_00698534 from E. coli's close relative Shigella boydii. In the original BLAST analysis, the Shigella protein received a bit score of 1,255, compared to 1,261 for the Arabidopsis protein, even though both proteins have the same percent identity and query coverage length. Clearly this difference in bit score is insignificant, and difficult to detect without adequate surveillance. Ranking the matches by decreasing LPI score solves this problem; the Arabidopsis match has an LPI score of 0.009, but the Shigella match has an LPI score of 0.98. This example shows how a combination of threshold range filtering and LPI score ranking can successfully eliminate false positive artifacts due to cloning vector contamination.
The second and third queries in Table , for the enzymes mannitol phosphate dehydrogenase and cytosine deaminase, also appear to have matched inappropriate database sequences when using a zero threshold setting. Using a filter threshold of 20% or lower overcomes these apparent errors, replacing them with nearly equal matches in a species closely related to the original query organism. In contrast, the fifth query of Table (AAC75891) illustrates the danger of setting threshold values that are too lenient. In this case, using a filter threshold of 80%, a BLAST hit from a phylogenetically closer organism (Salmonella) has been promoted even though it has only 28% identity to the query, versus 85% in the original top hit. This promotion is clearly unjustified.
For optimal DarkHorse performance, threshold values need to be set at a level that is neither too high nor too low. The best threshold setting for an individual query organism depends on the abundance of closely related sequences in the database used for BLAST searches. This value is difficult to measure directly, but can be calibrated approximately by measuring the maximum candidate set size returned using different threshold settings on a genome-wide basis, as shown in Figure . For this data set, the original BLAST search included a maximum possible number of 500 matches per query. Values shown in the graph indicate the highest number of candidate matches found for any single query in the test genome after filtering at the indicated threshold setting.
Effect of filter threshold setting on maximum number of candidate set members per query.
For an organism like E. coli, with sequences available for many closely related species, the maximum number of candidate set members appears to reach a plateau when using a filter threshold setting of 10% to 20%. After that point, further broadening of the threshold compromises the effectiveness of the filtering process. For query organisms from more sparsely represented phylogenetic groups, such as the archaeon Thermoplasma acidophilum, there are very few examples of closely related species in the database. In these cases, a lower filter threshold cutoff value is appropriate. For some organisms, it may make sense to limit the filter threshold setting to zero, promoting only those matches whose scores are exactly equivalent to the initial top hit.
Threshold filtering can help eliminate statistical anomalies of BLAST scoring, but there are some types of database ambiguities it cannot resolve. One such example is the sixth entry in Table , a match between E. coli
sequence AAC73796 and database entry BAB33410, isolated from snow pea pods (P. sativum
). This match covers 100% of the E. coli
query sequence at 100% identity, but only 46% of the pea protein. Sequences distantly related to the matched region exist in several other strains of E. coli
, but were not recognized by threshold filtering because they fall below the minimum BLAST match retention criteria. No related sequences are found in any eukaryotes other than snow pea, even at an e-value of 10.0. If this were a true case of horizontal transfer, closeness of the match would imply a very recent event, and phylogenetic distribution would suggest direction of transfer as moving from E. coli
to the seed pods of a eukaryotic plant. But this scenario is biologically unlikely. A more reasonable explanation is that the sequence identity is due to an undetected artifact introduced during cloning of the pea sequence. This sequence was obtained from a single isolated cDNA clone, and reported in a lone, unverified literature reference [38
]. This type of error is difficult to avoid in uncurated databases like GenBank nr.
Definition of database 'self' sequences
The definition of 'self' sequences for a query organism is configured by a list of user-defined self-exclusion terms. These terms, which can be either names or taxonomy ID numbers, provide a simple way to adjust phylogenetic granularity of the search, and to compensate for over-representation of closely related sequences in the source database. Although the LPI scoring method is naturally more sensitive to transfer events between distantly related taxa than to closely related species, adjusting breadth of the self-definition keywords for a test organism can reveal potential horizontal transfer events that are either very recent or progressively more distant in time. In practice, this is accomplished by choosing a narrow initial self-definition, then iteratively adding one or more species with high LPI scores to the list of self-definition keywords in the next round of analysis. Query sequences acquired since the divergence of two related genomes can be identified by comparing LPI scores and associated lineages plus or minus one of the relatives as a self-exclusion term.
As an example of this process, the self definition for E. coli strain K12 was first defined narrowly by a set of strain-specific names and NCBI taxonomy ID numbers (K12, 83333, 316407, 562). This self-definition includes strain K12, as well as matches where the E. coli strain is unspecified, but still permits matches to clearly identified genomic sequences from alternative strains, for example, O157:H7. A second self-definition list was created using genus name Escherichia alone, which eliminates all species and strains from this genus. The list was then iteratively broadened by adding the names Shigella and Salmonella. Table illustrates how this process changes the lineages of best matches chosen by DarkHorse. As the breadth of self-definition terms is expanded, the total number of matches declines, because fewer database proteins remain that meet minimum BLAST requirements. As total number of Enterobacterial matches declines, matches to other classes of bacteria increase because they are the best remaining alternative. The maximum LPI value (LPImax), which is assigned to the lineage with the greatest number of matches, becomes progressively lower as the self-definition is expanded. The total number of matches having this LPImax value also declines, and the lineage associated with the LPImax becomes phylogenetically more distant from the original test genome. The histograms in Figure , grouped into bins of 0.02 units, show how the overall distribution of LPI scores changes from high to low as the number of closely related database taxa are depleted by broader self-definition terms. In this respect, using a coarser set of self-exclusion terms for an abundantly represented organism mimics the distribution of organisms that are more sparsely represented in the database.
Effect of self-definition keywords on best match lineages for E. coli
Effect of expanding E. coli self definition terms on LPI score distribution histograms. Filter threshold setting was 10%. (a) Self = Escherichia (b) Self = Escherichia + Shigella + Salmonella.
Table illustrates how changing self-definition keywords affects predictions of horizontal transfer for some individual protein examples. The first two rows in Table contain sequences that are highly conserved among all strains of E. coli, as well as many closely related species. Matches to protein AAC75738 have lower e-values than matches to AAC74994 simply because AAC75738 is a much shorter protein (61 versus 495 amino acids). In these two rows, self-definition keywords do not affect LPI scores, which remain at maximum for both keyword sets.
Effect of self-definition keywords on LPI scores for individual protein examples from E. coli strain K12
LPI scores are also unchanged by self-definition keywords for the query sequences shown in rows 3 and 4, but for a different reason. Both of these sequences appear likely to have been recently acquired by E. coli strain K12, since its divergence from other E. coli strains. The closest database alignments to protein AAC75802 are with two species of delta-Proteobacteria, Geobacter sulfurreducens and Desulfuromonas acetoxiadans (not shown). This protein does not align well with any other strain of E. coli, nor with any other Enterobacterial genomes. Gene loss from such a large number of species seems unlikely as an alternative explanation to horizontal transfer.
Protein AAC75097 also appears to have been recently acquired by strain K12. Its origin is unclear; it aligns closely not only with a protein from Psychromonas ingrahamii, found in polar ice, but also with multiple examples among gamma-proteobacteria (Actinobacillus succinogenes and Mannheimia succiniciproducens), as well as epsilon-proteobacteria (Campylobacter jejuni) and eubacteria (several Lactobacillus and Streptococcus species). These organisms or their relatives could all potentially be found in human or bovine gut microflora, providing ample opportunity for gene exchange with both E. coli and each other. Differences in nucleotide composition between the proteins in rows 3 and 4 and the consensus for E. coli strain K12 (approximately 50% GC) also support recent lateral acquisition. Genomes from eubacteria in the Bacillus and Lactobacillus groups typically have a mean GC content around 35%.
The fifth row in Table illustrates an example of likely horizontal gene transfer that occurred less recently. Using the narrowest set of self-definition keywords, protein AAC76015 has an LPI score of 0.993, equal to the LPImax, but the score drops substantially when the self-definition is expanded to include all species in the genus Escherichia. Closest alignments to this protein are found in multiple species of gamma-proteobacteria from the Pseudomonas lineage, but not in any other Enterobacteria besides E. coli strains K12, 536, UTI89, and F11. The atypically high GC percentage of this E. coli sequence is also consistent with transfer from members of genus Pseudomonas, whose genomes typically have mean GC contents of 60% or higher.
Table illustrates a similar keyword expansion experiment performed with Arabidopsis thaliana. Adding Oryza to the self-definition list increases the number of bacterial matches from 162 to 812. Of these 812 matches, 336 are to cyanobacterial species, perhaps reflecting historical migration of chloroplast sequences derived from bacterial endosymbionts to the plant nucleus prior to the divergence of Arabidopsis and Oryza. The histograms in Figure show how expanding the self definition not only lowers the top LPI scores, but also clarifies the separation of matches into three distinct groups, representing viridiplantae (scores 0.5 to 0.7), metazoan, fungal, and apicomplexan eukaryotes (scores 0.3 to 0.4), and bacteria (scores below 0.03).
Effect of self-definition terms on best match lineages for A. thaliana
Effect of expanding A. thaliana self definition terms on LPI score distribution histograms. Filter threshold setting was 10%. (a) Self = Arabidopsis. (b) Self = Arabidopsis + Oryza.
One limitation to the technique of expanding self-definition terms is that it also reduces the total number of non-self BLAST matches. More than 90% of the original E. coli query sequences still have database matches above the BLAST initial screening criteria after excluding the three closest genera, but adding just a single genus to the Arabidopsis self-definition eliminated 20% of the original matches. For phylogenetic groups with less extensive database representation, exclusion of too many related groups may reduce the number of matches to a point where it is too low to reasonably represent the test genome.
LPI score significance
The DarkHorse algorithm does not provide explicit criteria for classifying sequences as horizontally transferred or not; rather it ranks all candidates within a genome relative to each other. Selecting a single absolute value as a universal cutoff between positive and negative candidates for horizontal transfer neither makes biological sense, nor can it be supported computationally in the absence of unambiguous, known, and generally accepted positive and negative examples. Score distributions vary widely according to the evolutionary history of a test organism, the definition of 'self' chosen, and the number of closely related sequences in the database that lie outside that definition of self for a particular query.
Despite the difficulty of defining exact classification boundaries, some solid general principles can be applied to interpreting LPI score distributions, as illustrated by histograms of binned data in Figures to . Query protein sequences with the highest LPI scores (LPImax) can be eliminated from consideration as horizontal transfer candidates with a high degree of confidence, because they are matched with proteins from lineages most closely related to the query organism. By definition, LPI scores must fall between zero and one. Within these limits, LPImax values cover a fairly broad range, with lower scores characteristic of organisms with few close relatives in the database, or with self-definition settings that have intentionally filtered out the closest relative sequences. Query protein sequences with intermediate LPI scores may or may not have been horizontally transferred, and will require analysis by independent methods to classify definitively. The number of query proteins with intermediate scores typically decreases as more closely related genomes are added to the underlying database. Scores at the lowest end of the LPI score distribution represent the best candidates for horizontal transfer, because their closest database matches belong to lineages that are most distantly related to the query organism. In the most extreme cases, if the closest match falls in a different kingdom, these sequences can have scores of 0.1 or lower.
LPI score distribution histogram for E. histolytica. Filter threshold setting was zero.
Bacterial and Archaeal examples
Two microbial organisms previously demonstrated by multiple bioinformatics methods to have high rates of horizontal gene transfer were re-analyzed for comparison using the DarkHorse algorithm. Euryarchaeotal species Thermoplasma acidophilum
has been suggested to have experienced lateral gene exchange specifically with Sulfolobus solfataricus
, a distantly related crenarchaeote that lives in the same ecological niche [39
]. The hyperthermophilic bacterium Thermotoga maritima
is believed to have undergone particularly high rates of horizontal gene exchange with archaeal species sharing its extreme habitat [40
]. Each of these genomes was analyzed using its genus name as a self-exclusion term, and filter threshold cutoff values ranging from 0% to 40%.
The 1,494 predicted protein sequences of T. acidophilum
had numerous best matches to distantly related organisms, including both Sulfolobus
, as expected, and a variety of bacterial species (Table , Figure ; raw data in Additional data file 2). Using a filter threshold of zero, the LPI score for the Sulfolobus
lineage was 0.42, substantially below the Picrophilus
lineages, with LPI scores of 0.76 to 0.79. The number of query proteins with best matches to Sulfolobus
proteins was 106, consistent with a previous study that found 93 laterally transferred proteins agreed upon by three different prediction methods, with an additional 90 agreed upon by two out of the three methods [34
]. In addition, DarkHorse analysis identified 97 query sequences most closely matched to bacterial proteins that were not examined in previous studies. These matches included species like Thermotoga maritima
, which may themselves have acquired archaeal sequences from a Thermoplasma
relative. This multi-level data complexity undoubtedly contributes to the inconsistency of horizontal transfer predictions from different bioinformatic methods.
Effect of filter threshold on best match lineages for T. acidophilum
LPI score distribution histogram for T. acidophilum. Filter threshold setting was zero.
Table and Figure summarize LPI score distributions for Thermotoga maritima
(raw data provided in Additional data file 3). Database matches scoring above the minimum BLAST criteria were found for 1,440 (78%) of 1,846 predicted proteins in the Thermotoga
genome. With a cutoff filter value of 0, the majority of matches, 617, were to bacteria of the Firmicutes/Clostridia lineage, generating LPI scores of 0.54 to 0.55 for these lineages. An LPImax
value of 0.55 is much lower than that observed for many other microbial genomes, reflecting the absence of a truly close relative in the source database. The most abundant genus in the Clostridia group was Thermoanaerobacter
, but this genus had only 265 matches. Other bacterial species from the Firmicutes lineage had LPI scores of 0.46 to 0.50, and more distant bacterial lineages had LPI scores between 0.33 and 0.41. At the lowest end of the score distribution were 208 matches to archaeal sequences, with LPI values of 0.1 or less. These archaeal matches represented 11.3% of the Thermotoga
genome, consistent with previous reports suggesting that between 11% and 24% of proteins in this species have been laterally acquired [1
]. The wide variability in literature predictions for numbers of horizontally transferred genes reflects the difficulty of assigning definitive classifications by any single bioinformatic method. However, LPI score distributions have captured and quantified the scarcity of orthologous sequences from closely related species in the source database, an important factor contributing to this discrepancy.
LPI score distribution histogram for T. maritima. Filter threshold setting was zero.
The parasitic amoeba Entamoeba histolytica
is believed to have lost its mitochondria and many enzymes associated with aerobic metabolism as an adaptation to its parasitic lifestyle and anaerobic habitat in the human gut. At the same time, this organism appears to have gained a set of enzymes not found in other eukaryotes, supporting anaerobic fermentation pathways. These enzymes may have been obtained by lateral gene transfer from phagocytized bacterial prey. In support of this hypothesis, a previous study has identified 96 genes considered most likely to have been laterally acquired, using a combination of automated and manual phylogenetic methods [43
To compare DarkHorse predictions with those obtained by other methods, the E. histolytica genome was analyzed using the genus name as a self-definition, and filter threshold settings of 0% to 40%. Out of 9,775 predicted protein sequences, only 3,573 (37%) had matches above the minimum BLAST criteria, reflecting the scarcity of database sequence relatives. The maximum number of best matches to a single query rose abruptly from 33 to 497 when raising the threshold filter setting from 0% to 2%. These results suggest that database coverage for this organism is so sparse that filter settings higher than zero, shown in Table , are probably too lenient.
Effect of filter threshold setting on best match lineages for E. histolytica
The LPI score distribution for E. histolytica
is divided into several distinct phylogenetic clusters (Figure ; raw data in Additional data file 4). The low LPImax
value of 0.56, associated with 694 matches to genus Dictyostelium
, confirms the scarcity of related species in the database. Best matches with LPI scores between 0.3 and 0.5 were associated with a wide diversity of other eukaryotic organisms, including plants, animals, and fungi as well as protozoa. The bacterial cluster of best matches had LPI scores between 0.04 and 0.07, and archaeal best matches had scores below 0.02. Previous work did not distinguish between archaeal and bacterial matches in E. histolytica
, but grouped them all together among the 96 predicted lateral transfer candidates. Finding the archaeal sequence matches is particularly interesting, because they represent potential evidence supporting the theory of archaeal contributions to virulence in bacterial human pathogens [10
Using a zero filter threshold cutoff, DarkHorse found non-eukaryotic best matches for 86 of the 96 E. histolytica
genes previously identified as lateral transfer candidates. Of the ten differences, four were due to revisions in E. histolytica
gene models - the older predicted Entamoeba
sequences are no longer present in the current GenBank version of the genome. One disagreement occurred because the bacterial match proposed by Loftus et al
. did not pass the initial DarkHorse BLAST pre-screening criteria for orthology, with an alignment length covering less than 60% of the query sequence [43
]. One of the remaining five differences was found by DarkHorse to have a best match in Mastigamoeba balamuthi
, and the remaining four to proteins in Dictyostelium discoideum
. These are both amoeboid species representing close database relatives of E. histolytica
. If these five E. histolytica
sequences were laterally acquired, it must have been prior to evolutionary divergence from other eukaryotic ameboid species. It is possible that the Dictyostelium
sequence matches missed by previous analysis were not yet available at the time the work was done, therefore representing false positives. If so, this highlights the importance of re-analyzing phylogenetic data as new sequences for relatives of the query organism become available.
The most abundant bacterial and archaeal matches in the E. histolytica
genome were to species known to inhabit the human digestive tract, including oral pathogen Tannerella forsythensis
(45 matches), gut symbiont Bacteroides thetaiotaomicron
(21 matches), and archaea from the genus Methanosarcina
(40 matches). All 45 T. forsythensis
matches point to a single bacterial cell surface-associated protein, BspA, previously shown to mediate dose-dependent binding to the human extracellular matrix components fibronectin and fibrinogen [44
]. Sixteen best matches in Methanosarcina
point to archaeal relatives of this same protein. Interestingly, there were no DarkHorse best matches to T. forsythensis
or BspA in the genome of Dictyostelium discoideum
, and only five matches to B. thetaiotaomicron
and three to Methanosarcina
The true biological relationships involved in E. histolytica
gene evolution are quite complex, probably including multiple horizontal transfer events between eukaryotes, archaea, and bacteria that may themselves contain previously acquired archaeal sequences. Using a filter threshold setting of zero, DarkHorse identified an additional 60 archaeal and 350 bacterial best matches that were not described in the original E. histolytica
genome paper. The most likely reason for this discrepancy is sub-optimal sensitivity of Pyphy [33
], the automated phylogenetic tree building software used by Loftus et al
., when dealing with complex data sets [43
]. The Pyphy tree-building parameters were originally designed to find simple paralogous sequence relationships between closely related clades. Lower than expected Pyphy sensitivity has been described by other authors attempting to use it for horizontal gene transfer analysis across wide phylogenetic distances [34