|Home | About | Journals | Submit | Contact Us | Français|
We highlight a selection of recent research on computational methods and associated challenges surrounding the prediction of bacterial horizontal gene transfer. This research area continues to face controversy, but is becoming more critical as the importance of horizontal gene transfer in medically and ecologically important prokaryotic evolution is further appreciated.
Horizontal gene transfer (HGT) is an important driving force in prokaryotic evolution that allows bacteria to quickly share genes not only from similar strains, but also from distantly related species, including phages . This enables bacteria to adapt to changing environmental pressures, but can also lead to problems with treating bacterial illnesses, due to the exchange of antibiotic resistance genes or virulence factors .
Although HGT has been shown to be widespread across bacterial strains, the rate of HGT is still debated. Some argue that HGT is so prevalent among bacteria that the ability to reconstruct a tree of life should be seriously reconsidered , while other recent research indicates that HGT may not be very prevalent . Overall, methods for the prediction of HGT in bacteria continue to improve, and many would agree that construction of species trees, despite the prevalence of HGT, is worthwhile if appropriate methodologies are applied .
Most HGT computational prediction methods can be roughly grouped into two main categories: compositional methods, which identify anomalous sequence signatures within a prokaryotic genome suggestive of a region of HGT, and phylogenetic methods, which analyse the incongruence of a gene tree versus its associated species tree. We briefly highlight here some of the recent improvements in such methodologies and the challenges still faced.
Sequence composition methods depend on different species having differences in their genome signatures. These methods identify HGT by searching for genomic regions that have an abnormal sequence composition (G+C, dinucleotide bias, and so on) compared to the rest of the genome.
An in-depth study of HGT in the Salmonella lineage indicated that ancient horizontally transferred gene sequences tended to share a greater similarity in sequence composition with their host compared to more recently acquired genes , clearly supporting the idea that transferred genes ameliorate to their host genome over time . Notably, however, very recently acquired prophage elements tended to have sequence compositions that were more similar to the host genome, not representing amelioration but rather specialization and adaptation to their hosts . Although this study may suggest that more sensitive measures of sequence composition are needed to better predict HGT events, these methods must be carefully designed so that they do not result in an increase in false positives. For example, a recent study of large viruses further supported previous work indicating that many genes with atypical sequence composition were not horizontally acquired and, instead, the anomalous sequence composition was likely related to certain functions and gene features such as expression level . A recent comparison of several HGT prediction programs showed that these sequence-composition-based methods can predict very different classes of genes, warning that the use of a single method could give biased results . Even with these disadvantages, detecting HGT by sequence composition is still an attractive method, since it usually does not require more than the query genome for analysis.
New research is also producing more intelligent methods; one such method takes into account that single transfer events often include multiple genes and uses the genome location of putative HGTs to further refine predictions . Similarly, there have been many methods focused on the prediction of genomic islands (large regions of HGT) and the accuracy of such genomic island predictors has been increased through the coupling of sequence composition analysis with the identification of additional gene features such as the presence of mobility genes (e.g. integrases and transposases) or tRNAs and direct repeats (known integration sites) [11-15]. This research direction is likely to continue as more sophisticated composition-based methods are developed that also examine other sequence features.
Many HGT prediction methods look for incongruence between gene trees and an associated species tree. Such methods could increasingly benefit from having a more universal ‘tree of life’ to use as a species tree reference. One notable study attempted to build such a tree by identifying genes that were present in all species that did not show signs of HGT . Identifying genes that have never been horizontally transferred is a difficult problem that remains controversial, however, some studies have suggested that particular genes could be more resistant to HGT and could therefore be better candidates for construction of a reference tree . However, genes subject to HGT can still provide valuable phylogenetic information, and one study actually embraced using HGTs for tree construction, demonstrating that ancient gene transfers to the ancestor of red algae and green plants can act as informative events that support a common origin of these two groups . Another method that tries to construct a large genome tree, using a selected list of genes that are shared across most genomes, is AMPHORA (a pipeline for AutoMated PHylogenOmic inference) . An automated pipeline was developed that uses 31 ‘marker’ genes, a hidden Markov model (HMM)-based multiple alignment program, and maximum likelihood to construct an organism tree for 578 species. The construction of these large trees is likely to lead to new insights and aid other analyses, but it is appreciated that they do not fully reflect bacterial evolution due to their lack of representation of HGTs. Therefore, complementing these approaches are phylogenetic methods that incorporate or predict HGT events [20-23]. These tools allow for reticulate evolutionary events, such as HGT, and result in a network-like phylogenetic tree that is often represented as a rooted, directed, acyclic graph; this is the same structure that is used by the Gene Ontology project . A software package called PhyloNet  was recently published and includes many tools to carry out prediction of HGT and tree construction that should be useful for many researchers. It makes use of a recently created eNewick (‘extended Newick’) format for containing network-like trees , which is based on the well established, classic Newick format.
Despite these promising advances, limitations of phylogenetic based HGT prediction methods still exist that must be considered; transfers between sister branches in a tree (often very closely related species) can’t usually be detected, and sparsely distributed genes may not be detected if the gene tree is consistent (or inconclusive) with the species tree. Future research is likely to try to overcome or at least minimize these limitations, either through increased species sampling or by combining the power of phylogenetic and sequence composition based approaches.
Prediction of HGTs in metagenomic datasets is somewhat limited due to the novelty of this type of genomic data, the fact that the organism sources of the sequences are unknown, and use of short sequence read lengths that can lower the statistical power of HGT prediction methods. However, one recent study has designed novel composition and phylogenetic methods for the detection of HGT in several environmental samples . They show that their composition method and phylogenetic method detect different levels of HGT at 0.8-1.5% and 2-8%, respectively. The authors note that these differences are likely to be due to the types of HGT detected by each method, illustrating just how far we still have to go in HGT prediction and the significant potential there is to improve HGT prediction by integrating approaches.
Regions of HGT are being repeatedly found to contain virulence genes or other genes of medical and/or ecological importance, so improved prediction of such regions from primary sequence data will continue to be of significant interest. Considering that science is still working out the details of how genes are transferred by conjugation , and we are unsure of the origins of most regions of predicted HGT, we should not be surprised that prediction of HGT still has a long way to go. New computational methods are likely to be developed that improve on algorithm design by inclusion of new biological insights gained from increased sampling of our genetic world, or by better statistical modelling. The role of phages and other vehicles of HGT, in particular, may help shape some predictive methods . Prediction of the more specific boundaries of regions of HGT is one research area that needs more focus. More accurate bioinformatic methods are becoming even more important now, and should be a major goal, as the number of completed microbial genomes increases dramatically and the number of sequences from metagenomic studies and next-generation sequencing eclipses all other sequence data combined. Research that provides unbiased analysis and reviews of the accuracy of HGT methods should be encouraged so that researchers can utilize those methods that work best for their data (akin to what has been done for phylogenomics methods  and genomic island prediction methods ). As sequence coverage of our genetic world continues to grow and HGT prediction methods continue to improve, hopefully the origins of many HGT events will become clearer, and we will better understand these events that have played such a pivotal role in bacterial adaptation.
We would like to acknowledge the reviewers of this article, including Dr Robert Beiko, who waived his anonymity and contributed useful suggestions.
The electronic version of this article is the complete one and can be found at: http://F1000.com/Reports/Biology/content/1/25
The authors declare that they have no competing interests.