Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.
molecular evolution; statistical inference; phylogenetics; evolutionary tree; statistical bias; variance
Rate heterogeneity among lineages is a common feature of molecular evolution, and it has long impeded our ability to accurately estimate the age of evolutionary divergence events. The development of relaxed molecular clocks, which model variable substitution rates among lineages, was intended to rectify this problem. Major subtypes of pandemic HIV-1 group M are thought to exemplify closely related lineages with different substitution rates. Here, we report that inferring the time of most recent common ancestor of all these subtypes in a single phylogeny under a single (relaxed) molecular clock produces significantly different dates for many of the subtypes than does analysis of each subtype on its own. We explore various methods to ameliorate this problem. We conclude that current molecular dating methods are inadequate for dealing with this type of substitution rate variation in HIV-1. Through simulation, we show that heterotachy causes root ages to be overestimated.
molecular clock; rate variation; HIV-1
Statistical methods for molecular dating of viral origins have been used extensively to infer the time of most common recent ancestor for many rapidly evolving pathogens. However, there are a number of cases, in which epidemiological, historical, or genomic evidence suggests much older viral origins than those obtained via molecular dating. We demonstrate how pervasive purifying selection can mask the ancient origins of recently sampled pathogens, in part due to the inability of nucleotide-based substitution models to properly account for complex patterns of spatial and temporal variability in selective pressures. We use codon-based substitution models to infer the length of branches in viral phylogenies; these models produce estimates that are often considerably longer than those obtained with traditional nucleotide-based substitution models. Correcting the apparent underestimation of branch lengths suggests substantially older origins for measles, Ebola, and avian influenza viruses. This work helps to reconcile some of the inconsistencies between molecular dating and other types of evidence concerning the age of viral lineages.
measles virus; rinderpest virus; Ebola virus; avian influenza virus; molecular clock; substitution rate; codon model; purifying selection
Standard genotypic antiretroviral resistance testing, performed by bulk sequencing, does not readily detect variants that comprise <20% of the circulating HIV-1 RNA population. Nevertheless, it is valuable in selecting an antiretroviral regimen after antiretroviral failure. In patients with poor adherence, resistant variants may not reach this threshold. Therefore, deep sequencing would be potentially valuable for detecting minority resistant variants. We compared bulk sequencing and deep sequencing to detect HIV-1 drug resistance at the time of a second-line protease inhibitor (PI)-based antiretroviral regimen failure. Eligibility criteria were virologic failure (HIV-1 RNA load of >500 copies/ml) of a first-line nonnucleoside reverse transcriptase inhibitor-based regimen, with at least the M184V mutation (lamivudine resistance), and second-line failure of a lopinavir/ritonavir (LPV/r)-based regimen. An amplicon-sequencing approach on the Roche 454 system was used. Six patients with viral loads of >90,000 copies/ml and one patient with a viral load of 520 copies/ml were included. Mutations not detectable by bulk sequencing during first- and second-line failure were detected by deep sequencing during second-line failure. Low-frequency variants (>0.5% of the sequence population) harboring major protease inhibitor resistance mutations were found in 5 of 7 patients despite poor adherence to the LPV/r-based regimen. In patients with intermittent adherence to a boosted PI regimen, deep sequencing may detect minority PI-resistant variants, which likely represent early events in resistance selection. In patients with poor or intermittent adherence, there may be low evolutionary impetus for such variants to reach fixation, explaining the low prevalence of PI resistance.
Adaptive evolution frequently occurs in episodic bursts, localized to a few sites in a gene, and to a small number of lineages in a phylogenetic tree. A popular class of “branch-site” evolutionary models provides a statistical framework to search for evidence of such episodic selection. For computational tractability, current branch-site models unrealistically assume that all branches in the tree can be partitioned a priori into two rigid classes—“foreground” branches that are allowed to undergo diversifying selective bursts and “background” branches that are negatively selected or neutral. We demonstrate that this assumption leads to unacceptably high rates of false positives or false negatives when the evolutionary process along background branches strongly deviates from modeling assumptions. To address this problem, we extend Felsenstein's pruning algorithm to allow efficient likelihood computations for models in which variation over branches (and not just sites) is described in the random effects likelihood framework. This enables us to model the process at every branch-site combination as a mixture of three Markov substitution models—our model treats the selective class of every branch at a particular site as an unobserved state that is chosen independently of that at any other branch. When benchmarked on a previously published set of simulated sequences, our method consistently matched or outperformed existing branch-site tests in terms of power and error rates. Using three empirical data sets, previously analyzed for episodic selection, we discuss how modeling assumptions can influence inference in practical situations.
episodic selection; random effects model; evolutionary model; branch-site model
The imprint of natural selection on protein coding genes is often difficult to identify because selection is frequently transient or episodic, i.e. it affects only a subset of lineages. Existing computational techniques, which are designed to identify sites subject to pervasive selection, may fail to recognize sites where selection is episodic: a large proportion of positively selected sites. We present a mixed effects model of evolution (MEME) that is capable of identifying instances of both episodic and pervasive positive selection at the level of an individual site. Using empirical and simulated data, we demonstrate the superior performance of MEME over older models under a broad range of scenarios. We find that episodic selection is widespread and conclude that the number of sites experiencing positive selection may have been vastly underestimated.
Identifying regions of protein coding genes that have undergone adaptive evolution is important to answering many questions in evolutionary biology and genetics. In order to tease out genetic evidence for natural selection, genes from a diverse array of taxa must be analyzed, only a subset of which may have undergone adaptive evolution; the same gene region may be under stabilizing or relaxed selection in lineages leading to other taxa. Most current computational methods designed to detect the imprint of natural selection at a site in a protein coding gene assume the strength and direction of natural selection is constant across all lineages. Here, we present a method to detect adaptive evolution, even when the selective forces are not constant across taxa. Using a variety of well-characterized genes, we find evidence suggesting that natural selection is generally episodic and that modeling it as such reveals that many more sites are subject to episodic positive selection than previously appreciated.
During the late 1980s and early 1990s, an estimated 10,000 Romanian children were infected with HIV-1 subtype F nosocomially through contaminated needles and blood transfusions. However, the geographic source and origins of this epidemic remain unclear.
Here we used phylogenetic inference and “relaxed” molecular clock dating analysis to further characterize the Romanian HIV-1 subtype F epidemic.
These analyses revealed a major lineage of Romanian HIV sequences consisting nearly entirely of virus sampled from adolescents and children and a distinct cluster that included a much higher ratio of adult sequences. Divergence time estimates inferred the time of most recent common ancestor of subtype F1 sequences to be 1973 (1966–1980) and for all Angolan sequences to 1975 (1968–1980). The most common ancestor of the Romanian sequences was dated to 1978 (1972–1983) with pediatric and adolescent sequences interspersed throughout the lineage. The phylogenetic structure of the entire subtype F epidemic suggests that multiple introductions of subtype F into Romania occurred either from the Angolan epidemic or from more distant ancestors. Since the historical records note that the Romanian pediatric epidemic did not begin until the late 1980s, the inferred time of most recent common ancestor of the Romanian lineage of 1978 suggests that there were multiple introductions of subtype F occurred into the pediatric population from HIV already circulating in Romania.
Analysis of the subtype F HIV-1 epidemic in an historical context allows for a deeper appreciation of how the HIV pandemic has been influenced by socio-political events.
Phylogeography; Romania; Subtype F; Socio-political; HIV
Much molecular-evolution research is concerned with sequence analysis. Yet these sequences represent real, three-dimensional molecules with complex structure and function. Here I highlight a growing trend in the field to incorporate molecular structure and function into computational molecular-evolution work. I consider three focus areas: reconstruction and analysis of past evolutionary events, such as phylogenetic inference or methods to infer selection pressures; development of toy models and simulations to identify fundamental principles of molecular evolution; and atom-level, highly realistic computational modeling of molecular structure and function aimed at making predictions about possible future evolutionary events.
The evolution of substitutions conferring drug resistance to HIV-1 is both episodic, occurring when patients are on antiretroviral therapy, and strongly directional, with site-specific resistant residues increasing in frequency over time. While methods exist to detect episodic diversifying selection and continuous directional selection, no evolutionary model combining these two properties has been proposed. We present two models of episodic directional selection (MEDS and EDEPS) which allow the a priori specification of lineages expected to have undergone directional selection. The models infer the sites and target residues that were likely subject to directional selection, using either codon or protein sequences. Compared to its null model of episodic diversifying selection, MEDS provides a superior fit to most sites known to be involved in drug resistance, and neither one test for episodic diversifying selection nor another for constant directional selection are able to detect as many true positives as MEDS and EDEPS while maintaining acceptable levels of false positives. This suggests that episodic directional selection is a better description of the process driving the evolution of drug resistance.
When exposed to treatment, HIV-1 and other rapidly evolving viruses have the capacity to acquire drug resistance mutations (DRAMs), which limit the efficacy of antivirals. There are a number of experimentally well characterized HIV-1 DRAMs, but many mutations whose roles are not fully understood have also been reported. In this manuscript we construct evolutionary models that identify the locations and targets of mutations conferring resistance to antiretrovirals from viral sequences sampled from treated and untreated individuals. While the evolution of drug resistance is a classic example of natural selection, existing analyses fail to detect the majority of DRAMs. We show that, in order to identify resistance mutations from sequence data, it is necessary to recognize that in this case natural selection is both episodic (it only operates when the virus is exposed to the drugs) and directional (only mutations to a particular amino-acid confer resistance while allowing the virus to continue replicating). The new class of models that allow for the episodic and directional nature of adaptive evolution performs very well at recovering known DRAMs, can be useful at identifying unknown resistance-associated mutations, and is generally applicable to a variety of biological scenarios where similar selective forces are at play.
Reports of a high frequency of the transmission of minority viral populations with drug-resistant mutations (DRM) are inconsistent with evidence that HIV-1 infections usually arise from mono- or oligoclonal transmission. We performed ultradeep sequencing (UDS) of partial HIV-1 gag, pol, and env genes from 32 recently infected individuals. We then evaluated overall and per-site diversity levels, selective pressure, sequence reproducibility, and presence of DRM and accessory mutations (AM). To differentiate biologically meaningful mutations from those caused by methodological errors, we obtained multinomial confidence intervals (CI) for the proportion of DRM at each site and fitted a binomial mixture model to determine background error rates for each sample. We then examined the association between detected minority DRM and the virologic failure of first-line antiretroviral therapy (ART). Similar to other studies, we observed increased detection of DRM at low frequencies (average, 0.56%; 95% CI, 0.43 to 0.69; expected UDS error, 0.21 ± 0.08% mutations/site). For 8 duplicate runs, there was variability in the proportions of minority DRM. There was no indication of increased diversity or selection at DRM sites compared to other sites and no association between minority DRM and AM. There was no correlation between detected minority DRM and clinical failure of first-line ART. It is unlikely that minority viral variants harboring DRM are transmitted and maintained in the recipient host. The majority of low-frequency DRM detected using UDS are likely errors inherent to UDS methodology or a consequence of error-prone HIV-1 replication.
Effective population screening of HIV and prevention of HIV transmission are only part of the global fight against AIDS. Community-level effects, for example those aimed at thwarting future transmission, are potential outcomes of treatment and may be important in stemming the epidemic. However, current clinical trial designs are incapable of detecting a reduction in future transmission due to treatment. We took advantage of the fact that HIV is an evolving pathogen whose transmission network can be reconstructed using genetic sequence information to address this shortcoming. Here, we use an HIV transmission network inferred from recently infected men who have sex with men (MSM) in San Diego, California. We developed and tested a network-based statistic for measuring treatment effects using simulated clinical trials on our inferred transmission network. We explored the statistical power of this network-based statistic against conventional efficacy measures and find that when future transmission is reduced, the potential for increased statistical power can be realized. Furthermore, our simulations demonstrate that the network statistic is able to detect community-level effects (e.g., reduction in onward transmission) of HIV treatment in a clinical trial setting. This study demonstrates the potential utility of a network-based statistical metric when investigating HIV treatment options as a method to reduce onward transmission in a clinical trial setting.
Summary: Datamonkey is a popular web-based suite of phylogenetic analysis tools for use in evolutionary biology. Since the original release in 2005, we have expanded the analysis options to include recently developed algorithmic methods for recombination detection, evolutionary fingerprinting of genes, codon model selection, co-evolution between sites, identification of sites, which rapidly escape host-immune pressure and HIV-1 subtype assignment. The traditional selection tools have also been augmented to include recent developments in the field. Here, we summarize the analyses options currently available on Datamonkey, and provide guidelines for their use in evolutionary biology.
Availability and documentation: http://www.datamonkey.org
The endangered Przewalski's horse is the closest relative of the domestic horse and is the only true wild horse species surviving today. The question of whether Przewalski's horse is the direct progenitor of domestic horse has been hotly debated. Studies of DNA diversity within Przewalski's horses have been sparse but are urgently needed to ensure their successful reintroduction to the wild. In an attempt to resolve the controversy surrounding the phylogenetic position and genetic diversity of Przewalski's horses, we used massively parallel sequencing technology to decipher the complete mitochondrial and partial nuclear genomes for all four surviving maternal lineages of Przewalski's horses. Unlike single-nucleotide polymorphism (SNP) typing usually affected by ascertainment bias, the present method is expected to be largely unbiased. Three mitochondrial haplotypes were discovered—two similar ones, haplotypes I/II, and one substantially divergent from the other two, haplotype III. Haplotypes I/II versus III did not cluster together on a phylogenetic tree, rejecting the monophyly of Przewalski's horse maternal lineages, and were estimated to split 0.117–0.186 Ma, significantly preceding horse domestication. In the phylogeny based on autosomal sequences, Przewalski's horses formed a monophyletic clade, separate from the Thoroughbred domestic horse lineage. Our results suggest that Przewalski's horses have ancient origins and are not the direct progenitors of domestic horses. The analysis of the vast amount of sequence data presented here suggests that Przewalski's and domestic horse lineages diverged at least 0.117 Ma but since then have retained ancestral genetic polymorphism and/or experienced gene flow.
wild horse; next-generation sequencing; mitochondrial DNA; nuclear DNA; phylogeny
Rapidly evolving viruses such as HIV-1 display extensive sequence variation in response to host-specific selection, while simultaneously maintaining functions that are critical to replication and infectivity. This apparent conflict between diversifying and purifying selection may be resolved by an abundance of epistatic interactions such that the same functional requirements can be met by highly divergent sequences. We investigate this hypothesis by conducting an extensive characterization of sequence variation in the HIV-1 nef gene that encodes a highly variable multifunctional protein. Population-based sequences were obtained from 686 patients enrolled in the HOMER cohort in British Columbia, Canada, from which the distribution of nonsynonymous substitutions in the phylogeny was reconstructed by maximum likelihood. We used a phylogenetic comparative method on these data to identify putative epistatic interactions between residues. Two interactions (Y120/Q125 and N157/S169) were chosen to further investigate within-host evolution using HIV-1 RNA extractions from plasma samples from eight patients. Clonal sequencing confirmed strong linkage between polymorphisms at these sites in every case. We used massively parallel pyrosequencing (MPP) to reconstruct within-host evolution in these patients. Experimental error associated with MPP was quantified by performing replicates at two different stages of the protocol, which were pooled prior to analysis to reduce this source of variation. Phylogenetic reconstruction from these data revealed correlated substitutions at Y120/Q125 or N157/S169 repeated across multiple lineages in every host, indicating convergent within-host evolution shaped by epistatic interactions.
coevolution; epistasis; HIV-1; next-generation sequencing; ancestral reconstruction; sequencing error
Over time, natural selection molds every gene into a unique mosaic of sites evolving rapidly or resisting change—an “evolutionary fingerprint” of the gene. Aspects of this evolutionary fingerprint, such as the site-specific ratio of nonsynonymous to synonymous substitution rates (dN/dS), are commonly used to identify genetic features of potential biological interest; however, no framework exists for comparing evolutionary fingerprints between genes. We hypothesize that protein-coding genes with similar protein structure and/or function tend to have similar evolutionary fingerprints and that comparing evolutionary fingerprints can be useful for discovering similarities between genes in a way that is analogous to, but independent of, discovery of similarity via sequence-based comparison tools such as Blast.
To test this hypothesis, we develop a novel model of coding sequence evolution that uses a general bivariate discrete parameterization of the evolutionary rates. We show that this approach provides a better fit to the data using a smaller number of parameters than existing models. Next, we use the model to represent evolutionary fingerprints as probability distributions and present a methodology for comparing these distributions in a way that is robust against variations in data set size and divergence. Finally, using sequences of three rapidly evolving RNA viruses (HIV-1, hepatitis C virus, and influenza A virus), we demonstrate that genes within the same functional group tend to have similar evolutionary fingerprints. Our framework provides a sound statistical foundation for efficient inference and comparison of evolutionary rate patterns in arbitrary collections of gene alignments, clustering homologous and nonhomologous genes, and investigation of biological and functional correlates of evolutionary rates.
adaptive evolution; codon models; evolutionary distance; machine classification
Although it is known that most HIV-1 infections worldwide result from exposure to virus in semen, it has not yet been established whether transmitted strains originate as RNA virions in seminal plasma or as integrated proviral DNA in infected seminal leukocytes. We present phylogenetic evidence that among six transmitting pairs of men who have sex with men, blood plasma virus in the recipient is consistently more closely related to the seminal plasma virus in the source. All sequences were subtype B, and the env C2V3 of transmitted variants tended to have higher mean isoelectric points, contain potential N-linked glycosylation sites, and favor CCR5 co-receptor usage. A statistically robust phylogenetically corrected analysis did not detect genetic signatures reliably associated with transmission, but further investigation of larger samples of transmitting pairs holds promise for determining which structural and genetic features of viral genomes are associated with transmission.
Although vaccines pose the best means of preventing influenza infection, strain selection and optimal implementation remain difficult due to antigenic drift and a lack of understanding global spread. Detecting viral movement by sequence analysis is complicated by skewed geographic and seasonal distributions in viral isolates. We propose a probabilistic method that accounts for sampling bias through spatiotemporal clustering and modeling regional and seasonal transmission as a binomial process. Analysis of H3N2 not only confirmed East-Southeast Asia as a source of new seasonal variants, but also increased the resolution of observed transmission to a country level. H1N1 data revealed similar viral spread from the tropics. Network analysis suggested China and Hong Kong as the origins of new seasonal H3N2 strains and the United States as a region where increased vaccination would maximally disrupt global spread of the virus. These techniques provide a promising methodology for the analysis of any seasonal virus, as well as for the continued surveillance of influenza.
As evidenced by several historic vaccine failures, the design and implementation of the influenza vaccine remains an imperfect science. The virus's rapid rate of evolution makes the selection of representative strains for vaccine composition a difficult process. From a global health viewpoint, how to optimally implement a limited stockpile of vaccines is another fundamental question that remains unanswered. An understanding of how influenza spreads around the world would greatly aid the design and implementation process, but regional and seasonal bias in collected virus samples hampers epidemiologic analysis. Here, we show that it is possible to counter this data bias through probabilistic modeling and represent the global viral spread as a network of seeding events between different regions of the world. On a local scale, our technique can output the most likely origins of a virus circulating in a given location. On a global scale, we can pinpoint regions of the world that would maximally disrupt viral transmission with an increase in vaccine implementation. We demonstrate our method on seasonal H3N2 and H1N1 and foresee similar application to other seasonal viruses, including swine-origin H1N1, once more seasonal data is collected.
Codon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of rate classes, where is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes.
Evolution in protein-coding DNA sequences can be modeled at three levels: nucleotides, amino acids or codons that encode the amino acids. Codon models incorporate nucleotide and amino acid information, and allow the estimation of the rate at which amino acids are replaced () versus the rate at which they are preserved (). The ratio has been used in thousands of studies to detect molecular footprints of natural selection. A serious limitation of most codon models is the unrealistic assumption that all non-synonymous substitutions occur at the same rate. Indeed, amino acid models have consistently demonstrated that different residues are exchanged more or less frequently, depending on incompletely understood factors. We derive and validate a computational approach for inferring codon models which combine the power to investigate natural selection with data-driven amino acid substitution biases from alignments. The addition of amino acid properties can lead to more powerful and accurate methods for studying natural selection and the evolutionary history of protein-coding sequences. The pattern of amino acid substitutions specific to a given alignment can be used to compare and contrast the evolutionary properties of different genes, providing an evolutionary analog to protein family comparisons.
Noroviruses are the most common cause of viral gastroenteritis. An increase in the number of globally reported norovirus outbreaks was seen the past decade, especially for outbreaks caused by successive genogroup II genotype 4 (GII.4) variants. Whether this observed increase was due to an upswing in the number of infections, or to a surveillance artifact caused by heightened awareness and concomitant improved reporting, remained unclear. Therefore, we set out to study the population structure and changes thereof of GII.4 strains detected through systematic outbreak surveillance since the early 1990s. We collected 1383 partial polymerase and 194 full capsid GII.4 sequences. A Bayesian MCMC coalescent analysis revealed an increase in the number of GII.4 infections during the last decade. The GII.4 strains included in our analyses evolved at a rate of 4.3–9.0×10−3 mutations per site per year, and share a most recent common ancestor in the early 1980s. Determinants of adaptation in the capsid protein were studied using different maximum likelihood approaches to identify sites subject to diversifying or directional selection and sites that co-evolved. While a number of the computationally determined adaptively evolving sites were on the surface of the capsid and possible subject to immune selection, we also detected sites that were subject to constrained or compensatory evolution due to secondary RNA structures, relevant in virus-replication. We highlight codons that may prove useful in identifying emerging novel variants, and, using these, indicate that the novel 2008 variant is more likely to cause a future epidemic than the 2007 variant. While norovirus infections are generally mild and self-limiting, more severe outcomes of infection frequently occur in elderly and immunocompromized people, and no treatment is available. The observed pattern of continually emerging novel variants of GII.4, causing elevated numbers of infections, is therefore a cause for concern.
Noroviruses, known as the viruses that cause the ‘stomach flu’ or as the ‘cruise ship virus’, cause sporadic cases and large outbreaks of gastrointestinal illness in humans. An increase in norovirus outbreaks was reported globally around 2002. Doubts remained as to whether this increase was real, or caused by improved detection-techniques and increased awareness. This study was performed to address this ambiguity, and to determine the possible virological causes for such changes. Using a population genetic approach, we studied sequences of epidemic norovirus strains collected through time and we indeed demonstrated expanding epidemic dynamics. Global epidemics were caused by subsequent variants of norovirus, observed in 2002, 2004 and 2006 and at a smaller scale in 1996, whereas no evidence for such epidemic evolutionary patterns occurring previous to these peaks. Based on the sequences analyzed the strains of the genotype under study here were shown to have circulated at least since the early 1980s, and likely earlier. We showed that not only surface exposed sites on the outside of the virus shell were under selective pressure, involved in avoiding host immune responses, but also codons that are apparently conserved for the purpose of virus replication.
Current public health efforts often use molecular technologies to identify and contain communicable disease networks, but not for HIV. Here, we investigate how molecular epidemiology can be used to identify highly-related HIV networks within a population and how voluntary contact tracing of sexual partners can be used to selectively target these networks.
We evaluated the use of HIV-1 pol sequences obtained from participants of a community-recruited cohort (n=268) and a primary infection research cohort (n=369) to define highly related transmission clusters and the use of contact tracing to link other individuals (n=36) within these clusters. The presence of transmitted drug resistance was interpreted from the pol sequences (Calibrated Population Resistance v3.0).
Phylogenetic clustering was conservatively defined when the genetic distance between any two pol sequences was <1%, which identified 34 distinct transmission clusters within the combined community-recruited and primary infection research cohorts containing 160 individuals. Although sequences from the epidemiologically-linked partners represented approximately 5% of the total sequences, they clustered with 60% of the sequences that clustered from the combined cohorts (O.R. 21.7; p=<0.01). Major resistance to at least one class of antiretroviral medication was found in 19% of clustering sequences.
Phylogenetic methods can be used to identify individuals who are within highly related transmission groups, and contact tracing of epidemiologically-linked partners of recently infected individuals can be used to link into previously-defined transmission groups. These methods could be used to implement selectively targeted prevention interventions.
molecular epidemiology; HIV; surveillance; contact tracing; drug resistance
Genetically diverse pathogens (such as Human Immunodeficiency virus type 1, HIV-1) are frequently stratified into phylogenetically or immunologically defined subtypes for classification purposes. Computational identification of such subtypes is helpful in surveillance, epidemiological analysis and detection of novel variants, e.g., circulating recombinant forms in HIV-1. A number of conceptually and technically different techniques have been proposed for determining the subtype of a query sequence, but there is not a universally optimal approach. We present a model-based phylogenetic method for automatically subtyping an HIV-1 (or other viral or bacterial) sequence, mapping the location of breakpoints and assigning parental sequences in recombinant strains as well as computing confidence levels for the inferred quantities. Our Subtype Classification Using Evolutionary ALgorithms (SCUEAL) procedure is shown to perform very well in a variety of simulation scenarios, runs in parallel when multiple sequences are being screened, and matches or exceeds the performance of existing approaches on typical empirical cases. We applied SCUEAL to all available polymerase (pol) sequences from two large databases, the Stanford Drug Resistance database and the UK HIV Drug Resistance Database. Comparing with subtypes which had previously been assigned revealed that a minor but substantial (≈5%) fraction of pure subtype sequences may in fact be within- or inter-subtype recombinants. A free implementation of SCUEAL is provided as a module for the HyPhy package and the Datamonkey web server. Our method is especially useful when an accurate automatic classification of an unknown strain is desired, and is positioned to complement and extend faster but less accurate methods. Given the increasingly frequent use of HIV subtype information in studies focusing on the effect of subtype on treatment, clinical outcome, pathogenicity and vaccine design, the importance of accurate, robust and extensible subtyping procedures is clear.
There are nine different subtypes of the main group of HIV-1, each originating as a distinct subepidemic of HIV-1. The distribution of subtypes is often unique to a given geographic region of the world and constitutes a useful epidemiological and surveillance resource. The effects of viral subtype on disease progression, treatment outcome and vaccine design are being actively researched, and the importance of accurate subtyping procedures is clear. In HIV-1, subtype assignment is complicated by frequent recombination among co-circulating strains, creating new genetic mosaics or recombinant forms: 43 have been characterized to date, and many more likely exist. We present an automated phylogenetic method (SCUEAL) to accurately characterize both simple and complex HIV-1 mosaics. Using computer simulations and biological data we demonstrate that SCUEAL performs very well under various conditions, especially when some of the existing classification procedures fail. Furthermore, we show that a small, but noticeable proportion of subtype characterization stored in public databases may be incomplete or incorrect. The computational technique introduced here should provide a much more accurate characterization of HIV-1 strains, especially novel recombinants, and lead to new insights into molecular history, epidemiology and geographical distribution of the virus.
Human populations are structured by social networks, in which individuals tend to form relationships based on shared attributes. Certain attributes that are ambiguous, stigmatized or illegal can create a ÔhiddenÕ population, so-called because its members are difficult to identify. Many hidden populations are also at an elevated risk of exposure to infectious diseases. Consequently, public health agencies are presently adopting modern survey techniques that traverse social networks in hidden populations by soliciting individuals to recruit their peers, e.g., respondent-driven sampling (RDS). The concomitant accumulation of network-based epidemiological data, however, is rapidly outpacing the development of computational methods for analysis. Moreover, current analytical models rely on unrealistic assumptions, e.g., that the traversal of social networks can be modeled by a Markov chain rather than a branching process.
Here, we develop a new methodology based on stochastic context-free grammars (SCFGs), which are well-suited to modeling tree-like structure of the RDS recruitment process. We apply this methodology to an RDS case study of injection drug users (IDUs) in Tijuana, México, a hidden population at high risk of blood-borne and sexually-transmitted infections (i.e., HIV, hepatitis C virus, syphilis). Survey data were encoded as text strings that were parsed using our custom implementation of the inside-outside algorithm in a publicly-available software package (HyPhy), which uses either expectation maximization or direct optimization methods and permits constraints on model parameters for hypothesis testing. We identified significant latent variability in the recruitment process that violates assumptions of Markov chain-based methods for RDS analysis: firstly, IDUs tended to emulate the recruitment behavior of their own recruiter; and secondly, the recruitment of like peers (homophily) was dependent on the number of recruits.
SCFGs provide a rich probabilistic language that can articulate complex latent structure in survey data derived from the traversal of social networks. Such structure that has no representation in Markov chain-based models can interfere with the estimation of the composition of hidden populations if left unaccounted for, raising critical implications for the prevention and control of infectious disease epidemics.
Spidermonkey is a new component of the Datamonkey suite of phylogenetic tools that provides methods for detecting coevolving sites from a multiple alignment of homologous nucleotide or amino acid sequences. It reconstructs the substitution history of the alignment by maximum likelihood-based phylogenetic methods, and then analyzes the joint distribution of substitution events using Bayesian graphical models to identify significant associations among sites.
Availability: Spidermonkey is publicly available both as a web application at http://www.data-monkey.org and as a stand-alone component of the phylogenetic software package HyPhy, which is freely distributed on the web (http://www.hyphy.org) as precompiled binaries and open source.
We develop a model-based phylogenetic maximum likelihood test for evidence of preferential substitution toward a given residue at individual positions of a protein alignment—directional evolution of protein sequences (DEPS). DEPS can identify both the target residue and sites evolving toward it, help detect selective sweeps and frequency-dependent selection—scenarios that confound most existing tests for selection, and achieve good power and accuracy on simulated data. We applied DEPS to alignments representing different genomic regions of influenza A virus (IAV), sampled from avian hosts (H5N1 serotype) and human hosts (H3N2 serotype), and identified multiple directionally evolving sites in 5/8 genomic segments of H5N1 and H3N2 IAV. We propose a simple descriptive classification of directionally evolving sites into 5 groups based on the temporal distribution of residue frequencies and document known functional correlates, such as immune escape or host adaptation.
directional selection; evolution of influenza; maximum likelihood; episodic selection