A phylogeny is a tree-based model of common ancestry that is an indispensable tool for studying biological variation. Phylogenies play a special role in the study of rapidly evolving populations such as viruses, where the proliferation of lineages is constantly being shaped by the mode of virus transmission, by adaptation to immune systems, and by patterns of human migration and contact. These processes may leave an imprint on the shapes of virus phylogenies that can be extracted for comparative study; however, tree shapes are intrinsically difficult to quantify. Here we present a comprehensive study of phylogenies reconstructed from 38 different RNA viruses from 12 taxonomic families that are associated with human pathologies. To accomplish this, we have developed a new procedure for studying phylogenetic tree shapes based on the ‘kernel trick’, a technique that maps complex objects into a statistically convenient space. We show that our kernel method outperforms nine different tree balance statistics at correctly classifying phylogenies that were simulated under different evolutionary scenarios. Using the kernel method, we observe patterns in the distribution of RNA virus phylogenies in this space that reflect modes of transmission and pathogenesis. For example, viruses that can establish persistent chronic infections (such as HIV and hepatitis C virus) form a distinct cluster. Although the visibly ‘star-like’ shape characteristic of trees from these viruses has been well-documented, we show that established methods for quantifying tree shape fail to distinguish these trees from those of other viruses. The kernel approach presented here potentially represents an important new tool for characterizing the evolution and epidemiology of RNA viruses.
Human immunodeficiency virus type 1 (HIV-1) V3 loop sequence can be used to infer viral coreceptor use. The effect of input copy number on population-based sequencing of the V3 loop of HIV-1 was examined through replicate deep and population-based sequencing of samples with known tropism, a heterogeneous clinical sample (624 population-based sequences and 47 deep-sequencing replicates), and a large cohort of clinical samples from phase III clinical trials of maraviroc including the MOTIVATE/A4001029 studies (n = 1,521). Proviral DNA from two independent samples from each of 101 patients from the MOTIVATE/A4001029 studies was also analyzed. Cumulative technical error occurred at a rate of 3 × 10−4 mismatches/bp, without observed effect on inferred tropism. Increasing PCR replication increased minority species detection with an ∼10% minority population detected in 18% of cases using a single replicate at a viral load of 1,072 copies/ml and in 44% of cases using three replicates. The nucleotide prevalence detected by population-based and deep sequencing were highly correlated (Spearman's ρ, 0.73), and the accuracy increased with increasing input copy number (P < 0.001). Triplicate sequencing was able to predict tropism changes in the MOTIVATE/A4001029 studies for both low (P = 0.05) and high (P = 0.02) viral loads. Sequences derived from independently extracted and processed samples of proviral DNA for the same patient were equivalent to replicates from the same extraction (P = 0.45) and had correlated position-specific scoring matrix scores (Spearman's ρ, 0.75; P ≪ 0.001); however, concordance in tropism inference was only 83%. Input copy number and PCR replication are important factors in minority species detection in samples with significant heterogeneity.
At the early stage of infection, human immunodeficiency virus (HIV)-1 predominantly uses the CCR5 coreceptor for host cell entry. The subsequent emergence of HIV variants that use the CXCR4 coreceptor in roughly half of all infections is associated with an accelerated decline of CD4+ T-cells and rate of progression to AIDS. The presence of a ‘fitness valley’ separating CCR5- and CXCR4-using genotypes is postulated to be a biological determinant of whether the HIV coreceptor switch occurs. Using phylogenetic methods to reconstruct the evolutionary dynamics of HIV within hosts enables us to discriminate between competing models of this process. We have developed a phylogenetic pipeline for the molecular clock analysis, ancestral reconstruction, and visualization of deep sequence data. These data were generated by next-generation sequencing of HIV RNA extracted from longitudinal serum samples (median 7 time points) from 8 untreated subjects with chronic HIV infections (Amsterdam Cohort Studies on HIV-1 infection and AIDS). We used the known dates of sampling to directly estimate rates of evolution and to map ancestral mutations to a reconstructed timeline in units of days. HIV coreceptor usage was predicted from reconstructed ancestral sequences using the geno2pheno algorithm. We determined that the first mutations contributing to CXCR4 use emerged about 16 (per subject range 4 to 30) months before the earliest predicted CXCR4-using ancestor, which preceded the first positive cell-based assay of CXCR4 usage by 10 (range 5 to 25) months. CXCR4 usage arose in multiple lineages within 5 of 8 subjects, and ancestral lineages following alternate mutational pathways before going extinct were common. We observed highly patient-specific distributions and time-scales of mutation accumulation, implying that the role of a fitness valley is contingent on the genotype of the transmitted variant.
At the start of infection, human immunodeficiency virus (HIV) generally requires a specific protein receptor (CCR5) on the cell surface to bind and enter the cell. In roughly half of all HIV infections, the virus population eventually switches to using a different receptor (CXCR4). This ‘HIV coreceptor switch’ is associated with an accelerated rate of progression to AIDS. Although it is not known why this switch occurs in some infections and not others, it is thought to be shaped by constraints on how HIV can evolve from one mode to another. In this study, we test this hypothesis by reconstructing the evolutionary histories of HIV within 8 patients known to have undergone an HIV coreceptor switch. Each history is recreated from samples of HIV genetic sequences that were derived from repeated blood samples by next-generation sequencing, an emerging technology that is rapidly becoming an essential tool in the study of rapidly-evolving populations such as viruses or cancerous cells. Because we have samples from different points in time, we can use models of evolution to extrapolate back in time to the ancestors of each infection. Our analysis reveals patient-specific dynamics in HIV evolution that sheds new light on the determinants of the coreceptor switch.
The heterogametic sex chromosomes (i.e. mammalian Y and avian W) do not usually recombine with the homogametic sex chromosomes which is known to lead into rapid degeneration of Y and W due to accumulation of deleterious mutations. On the other hand, some 96% of amphibian species have homomorphic, i.e. non-degenerate chromosomes. Nicolas Perrin's fountain-of-youth hypothesis states that this is a result of recombination between and chromosomes in sex-reversed individuals. In this study, I model the consequences of such recombination for the dynamics of a deleterious mutation occurring in chromosomes. As expected, even relatively low levels of sex reversal help to purge deleterious mutations. However, the population-dynamic consequences of this depend on the type of selection that operates on the population undergoing sex reversal. Under fecundity selection, sex reversal can be beneficial for some parameter values, whereas under survival selection, it seems to be always harmful.
Summary: Datamonkey is a popular web-based suite of phylogenetic analysis tools for use in evolutionary biology. Since the original release in 2005, we have expanded the analysis options to include recently developed algorithmic methods for recombination detection, evolutionary fingerprinting of genes, codon model selection, co-evolution between sites, identification of sites, which rapidly escape host-immune pressure and HIV-1 subtype assignment. The traditional selection tools have also been augmented to include recent developments in the field. Here, we summarize the analyses options currently available on Datamonkey, and provide guidelines for their use in evolutionary biology.
Availability and documentation: http://www.datamonkey.org
Environmental metagenomics provides snippets of genomic sequences from all organisms in an environmental sample and are an unprecedented resource of information for investigating microbial population genetics. Current analytical methods, however, are poorly equipped to handle metagenomic data, particularly of short, unlinked sequences. A custom analytical pipeline was developed to calculate dN/dS ratios, a common metric to evaluate the role of selection in the evolution of a gene, from environmental metagenomes sequenced using 454 technology of flow-sorted populations of marine Synechococcus, the dominant cyanobacteria in coastal environments. The large majority of genes (98%) have evolved under purifying selection (dN/dS<1). The metagenome sequence coverage of the reference genomes was not uniform and genes that were highly represented in the environment (i.e. high read coverage) tended to be more evolutionarily conserved. Of the genes that may have evolved under positive selection (dN/dS>1), 77 out of 83 (93%) were hypothetical. Notable among annotated genes, ribosomal protein L35 appears to be under positive selection in one Synechococcus population. Other annotated genes, in particular a possible porin, a large-conductance mechanosensitive channel, an ATP binding component of an ABC transporter, and a homologue of a pilus retraction protein had regions of the gene with elevated dN/dS. With the increasing use of next-generation sequencing in metagenomic investigations of microbial diversity and ecology, analytical methods need to accommodate the peculiarities of these data streams. By developing a means to analyze population diversity data from these environmental metagenomes, we have provided the first insight into the role of selection in the evolution of Synechococcus, a globally significant primary producer.
Microbial diseases are important selective agents in social insects and one major defense mechanism is the secretion of cuticular antimicrobial compounds. We hypothesized that given differences in group size, social complexity, and nest type the secretions of these antimicrobials will be under different selective pressures. To test this we extracted secretions from nine wasp species of varying social complexity and nesting habits and assayed their antimicrobial compounds against cultures of Staphylococcus aureus. These data were then combined with phylogenetic data to provide an evolutionary context. Social species showed significantly higher (18x) antimicrobial activity than solitary species and species with paper nests showed significantly higher (11x) antimicrobial activity than those which excavated burrows. Mud-nest species showed no antimicrobial activity. Solitary, burrow-provisioning wasps diverged at more basal nodes of the phylogenetic trees, while social wasps diverged from the most recent nodes. These data suggest that antimicrobial defences may have evolved in response to ground-dwelling pathogens but the most important variable leading to increased antimicrobial strength was increase in group size and social complexity.
Initial in vitro studies of bevirimat resistance failed to observe mutations in the clinically significant QVT motif in SP1 of HIV-1 gag. This study presents a novel screening method involving mixed, clinically derived gag-protease recombinant HIV-1 samples to more accurately mimic the selection of resistance seen in vivo. Bevirimat resistance was investigated via population-based sequencing performed with a large, initially antiretroviral-naïve cohort before (n = 805) and after (n = 355) standard HIV therapy (without bevirimat). The prevalence of any polymorphism in the motif comprising Q, V, and T was ∼6%, 29%, and 12%, respectively, and did not change appreciably over the course of therapy. From these samples, three groups of 10 samples whose bulk sequences were wild type at the QVT motif were used to generate gag-protease recombinant viruses that captured the existing diversity. Groups were mixed and passaged with various bevirimat concentrations for 9 weeks. gag variations were assessed by amplicon-based “deep” sequencing using a GS FLX sequencer (Roche). Unscreened mutations were present in all groups, and a V370A minority not originally detected by bulk sequencing was present in one group. V370A, occurring together with another preexisting, unscreened resistance mutation, was selected in all groups in the presence of a bevirimat concentration above 0.1 μM. For the two groups with V370A levels below consistent detectability by deep sequencing, the initial selection of V370A required 3 to 4 weeks of exposure to a narrow range of bevirimat concentrations, whereas for the group with the V370A minority, selection occurred immediately. This approach provides quasispecies diversity that facilitates the selection of mutations observed in clinical trials and, coupled with deep sequencing, could represent an efficient in vitro screening method for detecting resistance mutations.
Human immunodeficiency virus type 1 (HIV-1) genomes often carry one or more mutations associated with drug resistance upon transmission into a therapy-naïve individual. We assessed the prevalence and clinical significance of transmitted drug resistance (TDR) in chronically-infected therapy-naïve patients enrolled in a multi-center cohort in North America. Pre-therapy clinical significance was quantified by plasma viral load (pVL) and CD4+ cell count (CD4) at baseline. Naïve bulk sequences of HIV-1 protease and reverse transcriptase (RT) were screened for resistance mutations as defined by the World Health Organization surveillance list. The overall prevalence of TDR was 14.2%. We used a Bayesian network to identify co-transmission of TDR mutations in clusters associated with specific drugs or drug classes. Aggregate effects of mutations by drug class were estimated by fitting linear models of pVL and CD4 on weighted sums over TDR mutations according to the Stanford HIV Database algorithm. Transmitted resistance to both classes of reverse transcriptase inhibitors was significantly associated with lower CD4, but had opposing effects on pVL. In contrast, position-specific analyses of TDR mutations revealed substantial effects on CD4 and pVL at several residue positions that were being masked in the aggregate analyses, and significant interaction effects as well. Residue positions in RT with predominant effects on CD4 or pVL (D67 and M184) were re-evaluated in causal models using an inverse probability-weighting scheme to address the problem of confounding by other mutations and demographic or risk factors. We found that causal effect estimates of mutations M184V/I ( pVL) and D67N/G ( and pVL) were compensated by K103N/S and K219Q/E/N/R. As TDR becomes an increasing dilemma in this modern era of highly-active antiretroviral therapy, these results have immediate significance for the clinical management of HIV-1 infections and our understanding of the ongoing adaptation of HIV-1 to human populations.
DNA barcoding and other DNA sequence-based techniques for investigating and
estimating biodiversity require explicit methods for associating individual
sequences with taxa, as it is at the taxon level that biodiversity is
assessed. For many projects, the bioinformatic analyses required pose
problems for laboratories whose prime expertise is not in bioinformatics.
User-friendly tools are required for both clustering sequences into
molecular operational taxonomic units (MOTU) and for associating these MOTU
with known organismal taxonomies.
Here we present jMOTU, a Java program for the analysis of DNA barcode
datasets that uses an explicit, determinate algorithm to define MOTU. We
demonstrate its usefulness for both individual specimen-based Sanger
sequencing surveys and bulk-environment metagenetic surveys using long-read
next-generation sequencing data. jMOTU is driven through a graphical user
interface, and can analyse tens of thousands of sequences in a short time on
a desktop computer. A companion program, Taxonerator, that adds traditional
taxonomic annotation to MOTU, is also presented. Clustering and taxonomic
annotation data are stored in a relational database, and are thus amenable
to subsequent data mining and web presentation.
jMOTU efficiently and robustly identifies the molecular taxa present in
survey datasets, and Taxonerator decorates the MOTU with putative
identifications. jMOTU and Taxonerator are freely available from http://www.nematodes.org/.
Rapidly evolving viruses such as HIV-1 display extensive sequence variation in response to host-specific selection, while simultaneously maintaining functions that are critical to replication and infectivity. This apparent conflict between diversifying and purifying selection may be resolved by an abundance of epistatic interactions such that the same functional requirements can be met by highly divergent sequences. We investigate this hypothesis by conducting an extensive characterization of sequence variation in the HIV-1 nef gene that encodes a highly variable multifunctional protein. Population-based sequences were obtained from 686 patients enrolled in the HOMER cohort in British Columbia, Canada, from which the distribution of nonsynonymous substitutions in the phylogeny was reconstructed by maximum likelihood. We used a phylogenetic comparative method on these data to identify putative epistatic interactions between residues. Two interactions (Y120/Q125 and N157/S169) were chosen to further investigate within-host evolution using HIV-1 RNA extractions from plasma samples from eight patients. Clonal sequencing confirmed strong linkage between polymorphisms at these sites in every case. We used massively parallel pyrosequencing (MPP) to reconstruct within-host evolution in these patients. Experimental error associated with MPP was quantified by performing replicates at two different stages of the protocol, which were pooled prior to analysis to reduce this source of variation. Phylogenetic reconstruction from these data revealed correlated substitutions at Y120/Q125 or N157/S169 repeated across multiple lineages in every host, indicating convergent within-host evolution shaped by epistatic interactions.
coevolution; epistasis; HIV-1; next-generation sequencing; ancestral reconstruction; sequencing error
Over time, natural selection molds every gene into a unique mosaic of sites evolving rapidly or resisting change—an “evolutionary fingerprint” of the gene. Aspects of this evolutionary fingerprint, such as the site-specific ratio of nonsynonymous to synonymous substitution rates (dN/dS), are commonly used to identify genetic features of potential biological interest; however, no framework exists for comparing evolutionary fingerprints between genes. We hypothesize that protein-coding genes with similar protein structure and/or function tend to have similar evolutionary fingerprints and that comparing evolutionary fingerprints can be useful for discovering similarities between genes in a way that is analogous to, but independent of, discovery of similarity via sequence-based comparison tools such as Blast.
To test this hypothesis, we develop a novel model of coding sequence evolution that uses a general bivariate discrete parameterization of the evolutionary rates. We show that this approach provides a better fit to the data using a smaller number of parameters than existing models. Next, we use the model to represent evolutionary fingerprints as probability distributions and present a methodology for comparing these distributions in a way that is robust against variations in data set size and divergence. Finally, using sequences of three rapidly evolving RNA viruses (HIV-1, hepatitis C virus, and influenza A virus), we demonstrate that genes within the same functional group tend to have similar evolutionary fingerprints. Our framework provides a sound statistical foundation for efficient inference and comparison of evolutionary rate patterns in arbitrary collections of gene alignments, clustering homologous and nonhomologous genes, and investigation of biological and functional correlates of evolutionary rates.
adaptive evolution; codon models; evolutionary distance; machine classification
Most of our knowledge about how antiretrovirals and host immune responses influence the HIV-1 protease gene is derived from studies of subtype B virus. We investigated the effect of protease resistance-associated mutations (PRAMs) and population-based HLA haplotype frequencies on polymorphisms found in CRF01_AE pro.
We used all CRF01_AE protease sequences retrieved from the LANL database and obtained regional HLA frequencies from the dbMHC database. Polymorphisms and major PRAMs in the sequences were identified using the Stanford Resistance Database, and we performed phylogenetic and selection analyses using HyPhy. HLA binding affinities were estimated using the Immune Epitope Database and Analysis.
Overall, 99% of CRF01_AE sequences had at least 1 polymorphism and 10% had at least 1 major PRAM. Three polymorphisms (L10 V, K20RMI and I62 V) were associated with the presence of a major PRAM (P < 0.05). Compared to the subtype B consensus, six additional polymorphisms (I13 V, E35D, M36I, R41K, H69K, L89M) were identified in the CRF01_AE consensus; all but L89M were located within epitopes recognized by HLA class I alleles. Of the predominant HLA haplotypes in the Asian regions of CRF01_AE origin, 80% were positively associated with the observed polymorphisms, and estimated HLA binding affinity was estimated to decrease 19–40 fold with the observed polymorphisms at positions 35, 36 and 41.
Polymorphisms in CRF01_AE protease gene were common, and polymorphisms at residues 10, 20 and 62 most likely represent selection by use of protease inhibitors, whereas R41K and H69K were more likely attributable to recognition of epitopes by the HLA haplotypes of the host population.
CRF01_AE; HIV; HLA; polymorphisms; protease; resistance
The accumulation of deleterious mutations can drastically reduce population mean fitness. Self-fertilization is thought to be an effective means of purging deleterious mutations. However, widespread linkage disequilibrium generated and maintained by self-fertilization is predicted to reduce the efficacy of purging when mutations are present at multiple loci.
We tested the ability of self-fertilizing populations to purge deleterious mutations at multiple loci by exposing obligately self-fertilizing populations of Caenorhabditis elegans to a range of elevated mutation rates and found that mutations accumulated, as evidenced by a reduction in mean fitness, in each population. Therefore, purging in obligate selfing populations is overwhelmed by an increase in mutation rate. Surprisingly, we also found that obligate and predominantly self-fertilizing populations exposed to very high mutation rates exhibited consistently greater fitness than those subject to lesser increases in mutation rate, which contradicts the assumption that increases in mutation rate are negatively correlated with fitness. The high levels of genetic linkage inherent in self-fertilization could drive this fitness increase.
Compensatory mutations can be more frequent under high mutation rates and may alleviate a portion of the fitness lost due to the accumulation of deleterious mutations through epistatic interactions with deleterious mutations. The prolonged maintenance of tightly linked compensatory and deleterious mutations facilitated by self-fertilization may be responsible for the fitness increase as linkage disequilibrium between the compensatory and deleterious mutations preserves their epistatic interaction.
Many software packages have been developed to address the need for generating phylogenetic trees intended for print. With an increased use of the web to disseminate scientific literature, there is a need for phylogenetic trees to be viewable across many types of devices and feature some of the interactive elements that are integral to the browsing experience. We propose a novel approach for publishing interactive phylogenetic trees.
jsPhyloSVG is an open-source solution for rendering dynamic phylogenetic trees. It is capable of generating complex and interactive phylogenetic trees across all major browsers without the need for plugins. It is novel in supporting the ability to interpret the tree inference formats directly, exposing the underlying markup to data-mining services. The library source code, extensive documentation and live examples are freely accessible at www.jsphylosvg.com.
In 1977, H1N1 influenza A virus reappeared after a 20-year absence. Genetic analysis indicated that this strain was missing decades of nucleotide sequence evolution, suggesting an accidental release of a frozen laboratory strain into the general population. Recently, this strain and its descendants were included in an analysis attempting to date the origin of pandemic influenza virus without accounting for the missing decades of evolution. Here, we investigated the effect of using viral isolates with biologically unrealistic sampling dates on estimates of divergence dates. Not accounting for missing sequence evolution produced biased results and increased the variance of date estimates of the most recent common ancestor of the re-emergent lineages and across the entire phylogeny. Reanalysis of the H1N1 sequences excluding isolates with unrealistic sampling dates indicates that the 1977 re-emergent lineage was circulating for approximately one year before detection, making it difficult to determine the geographic source of reintroduction. We suggest that a new method is needed to account for viral isolates with unrealistic sampling dates.
Compensatory mutations improve fitness in genotypes that contain deleterious mutations but have no beneficial effects otherwise. As such, compensatory mutations represent a very specific form of epistasis. We show that intragenic compensatory mutations occur non-randomly over gene sequence. Compensatory mutations are more likely to appear at some sites than others. Moreover, the sites of compensatory mutations are more likely than expected by chance to be near the site of the original deleterious mutation. Furthermore, compensatory mutations tend to occur more commonly in certain regions of the protein even when controlling for clustering around the site of the deleterious mutation. These results suggest that compensatory evolution at the protein level is partially predictable and may be convergent.
compensatory mutation; deleterious mutations; experimental evolution; epistasis; primary structure
Piscidins constitute a family of cationic antimicrobial peptides that are thought to play an important role in the innate immune response of teleosts. On the one hand they show a remarkable diversity, which indicates that they are shaped by positive selection, but on the other hand they are ancient and have specific targets, suggesting that they are constrained by purifying selection. Until now piscidins had only been found in fish species from the superorder Acanthopterygii but we have recently identified a piscidin gene in Atlantic cod (Gadus morhua), thus showing that these antimicrobial peptides are not restricted to evolutionarily modern teleosts. Nucleotide diversity was much higher in the regions of the piscidin gene that code for the mature peptide and its pro domain than in the signal peptide. Maximum likelihood analyses with different evolution models revealed that the piscidin gene is under positive selection. Charge or hydrophobicity-changing amino acid substitutions observed in positively selected sites within the mature peptide influence its amphipathic structure and can have a marked effect on its function. This diversification might be associated with adaptation to new habitats or rapidly evolving pathogens.
Genetically diverse pathogens (such as Human Immunodeficiency virus type 1, HIV-1) are frequently stratified into phylogenetically or immunologically defined subtypes for classification purposes. Computational identification of such subtypes is helpful in surveillance, epidemiological analysis and detection of novel variants, e.g., circulating recombinant forms in HIV-1. A number of conceptually and technically different techniques have been proposed for determining the subtype of a query sequence, but there is not a universally optimal approach. We present a model-based phylogenetic method for automatically subtyping an HIV-1 (or other viral or bacterial) sequence, mapping the location of breakpoints and assigning parental sequences in recombinant strains as well as computing confidence levels for the inferred quantities. Our Subtype Classification Using Evolutionary ALgorithms (SCUEAL) procedure is shown to perform very well in a variety of simulation scenarios, runs in parallel when multiple sequences are being screened, and matches or exceeds the performance of existing approaches on typical empirical cases. We applied SCUEAL to all available polymerase (pol) sequences from two large databases, the Stanford Drug Resistance database and the UK HIV Drug Resistance Database. Comparing with subtypes which had previously been assigned revealed that a minor but substantial (≈5%) fraction of pure subtype sequences may in fact be within- or inter-subtype recombinants. A free implementation of SCUEAL is provided as a module for the HyPhy package and the Datamonkey web server. Our method is especially useful when an accurate automatic classification of an unknown strain is desired, and is positioned to complement and extend faster but less accurate methods. Given the increasingly frequent use of HIV subtype information in studies focusing on the effect of subtype on treatment, clinical outcome, pathogenicity and vaccine design, the importance of accurate, robust and extensible subtyping procedures is clear.
There are nine different subtypes of the main group of HIV-1, each originating as a distinct subepidemic of HIV-1. The distribution of subtypes is often unique to a given geographic region of the world and constitutes a useful epidemiological and surveillance resource. The effects of viral subtype on disease progression, treatment outcome and vaccine design are being actively researched, and the importance of accurate subtyping procedures is clear. In HIV-1, subtype assignment is complicated by frequent recombination among co-circulating strains, creating new genetic mosaics or recombinant forms: 43 have been characterized to date, and many more likely exist. We present an automated phylogenetic method (SCUEAL) to accurately characterize both simple and complex HIV-1 mosaics. Using computer simulations and biological data we demonstrate that SCUEAL performs very well under various conditions, especially when some of the existing classification procedures fail. Furthermore, we show that a small, but noticeable proportion of subtype characterization stored in public databases may be incomplete or incorrect. The computational technique introduced here should provide a much more accurate characterization of HIV-1 strains, especially novel recombinants, and lead to new insights into molecular history, epidemiology and geographical distribution of the virus.
Human populations are structured by social networks, in which individuals tend to form relationships based on shared attributes. Certain attributes that are ambiguous, stigmatized or illegal can create a ÔhiddenÕ population, so-called because its members are difficult to identify. Many hidden populations are also at an elevated risk of exposure to infectious diseases. Consequently, public health agencies are presently adopting modern survey techniques that traverse social networks in hidden populations by soliciting individuals to recruit their peers, e.g., respondent-driven sampling (RDS). The concomitant accumulation of network-based epidemiological data, however, is rapidly outpacing the development of computational methods for analysis. Moreover, current analytical models rely on unrealistic assumptions, e.g., that the traversal of social networks can be modeled by a Markov chain rather than a branching process.
Here, we develop a new methodology based on stochastic context-free grammars (SCFGs), which are well-suited to modeling tree-like structure of the RDS recruitment process. We apply this methodology to an RDS case study of injection drug users (IDUs) in Tijuana, México, a hidden population at high risk of blood-borne and sexually-transmitted infections (i.e., HIV, hepatitis C virus, syphilis). Survey data were encoded as text strings that were parsed using our custom implementation of the inside-outside algorithm in a publicly-available software package (HyPhy), which uses either expectation maximization or direct optimization methods and permits constraints on model parameters for hypothesis testing. We identified significant latent variability in the recruitment process that violates assumptions of Markov chain-based methods for RDS analysis: firstly, IDUs tended to emulate the recruitment behavior of their own recruiter; and secondly, the recruitment of like peers (homophily) was dependent on the number of recruits.
SCFGs provide a rich probabilistic language that can articulate complex latent structure in survey data derived from the traversal of social networks. Such structure that has no representation in Markov chain-based models can interfere with the estimation of the composition of hidden populations if left unaccounted for, raising critical implications for the prevention and control of infectious disease epidemics.
Spidermonkey is a new component of the Datamonkey suite of phylogenetic tools that provides methods for detecting coevolving sites from a multiple alignment of homologous nucleotide or amino acid sequences. It reconstructs the substitution history of the alignment by maximum likelihood-based phylogenetic methods, and then analyzes the joint distribution of substitution events using Bayesian graphical models to identify significant associations among sites.
Availability: Spidermonkey is publicly available both as a web application at http://www.data-monkey.org and as a stand-alone component of the phylogenetic software package HyPhy, which is freely distributed on the web (http://www.hyphy.org) as precompiled binaries and open source.
We develop a model-based phylogenetic maximum likelihood test for evidence of preferential substitution toward a given residue at individual positions of a protein alignment—directional evolution of protein sequences (DEPS). DEPS can identify both the target residue and sites evolving toward it, help detect selective sweeps and frequency-dependent selection—scenarios that confound most existing tests for selection, and achieve good power and accuracy on simulated data. We applied DEPS to alignments representing different genomic regions of influenza A virus (IAV), sampled from avian hosts (H5N1 serotype) and human hosts (H3N2 serotype), and identified multiple directionally evolving sites in 5/8 genomic segments of H5N1 and H3N2 IAV. We propose a simple descriptive classification of directionally evolving sites into 5 groups based on the temporal distribution of residue frequencies and document known functional correlates, such as immune escape or host adaptation.
directional selection; evolution of influenza; maximum likelihood; episodic selection
Microsatellites have been used extensively in the field of comparative genomics. By studying microsatellites in coding regions we have a simple model of how genotypic changes undergo selection as they are directly expressed in the phenotype as altered proteins. The simplest of these tandem repeats in coding regions are the tri-nucleotide repeats which produce a repeat of a single amino acid when translated into proteins. Tri-nucleotide repeats are often disease associated, and are also known to be unstable to both expansion and contraction. This makes them sensitive markers for studying proteome evolution, in closely related species.
The evolutionary history of the family of malarial causing parasites Plasmodia is complex because of the life-cycle of the organism, where it interacts with a number of different hosts and goes through a series of tissue specific stages. This study shows that the divergence between the primate and rodent malarial parasites has resulted in a lineage specific change in the simple amino acid repeat distribution that is correlated to A–T content. The paper also shows that this altered use of amino acids in SAARs is consistent with the repeat distributions being under selective pressure.
The study shows that simple amino acid repeat distributions can be used to group related species and to examine their phylogenetic relationships. This study also shows that an outgroup species with a similar A–T content can be distinguished based only on the amino acid usage in repeats, and suggest that this might be a useful feature for proteome clustering. The lineage specific use of amino acids in repeat regions suggests that comparative studies of SAAR distributions between proteomes gives an insight into the mechanisms of expansion and the selective pressures acting on the organism.
After acute HIV infection, CD8+ T cells are able to control viral replication to a set point. This control is often lost after superinfection, although the mechanism behind this remains unclear. In this study, we illustrate in an HLA-B27+ subject that loss of viral control after HIV superinfection coincides with rapid recombination events within two narrow regions of Gag and Env. Screening for CD8+ T cell responses revealed that each of these recombination sites (∼50 aa) encompassed distinct regions containing two immunodominant CD8 epitopes (B27-KK10 in Gag and Cw1-CL9 in Env). Viral escape and the subsequent development of variant-specific de novo CD8+ T cell responses against both epitopes were illustrative of the significant immune selection pressures exerted by both responses. Comprehensive analysis of the kinetics of CD8 responses and viral evolution indicated that the recombination events quickly facilitated viral escape from both dominant WT- and variant-specific responses. These data suggest that the ability of a superinfecting strain of HIV to overcome preexisting immune control may be related to its ability to rapidly recombine in critical regions under immune selection pressure. These data also support a role for cellular immune pressures in driving the selection of new recombinant forms of HIV.
We assessed the effect of herpes simplex virus type 2 (HSV-2) acquisition on the plasma HIV RNA and CD4 cell levels among individuals with primary HIV infection using a retrospective cohort analysis. We studied 119 adult, antiretroviral-naive, recently HIV-infected men with a negative HSV-2–specific enzyme immunoassay (EIA) result at enrollment. HSV-2 acquisition was determined by seroconversion on HSV-2 EIA, confirmed by Western blot analysis. Ten men acquired HSV-2 infection a median of 1.3 years after HIV infection (HSV-2 incidence rate of 7.4 per 100 person-years of follow-up). The median time of follow-up after acquiring HSV-2 infection was 303 days. All men except 1 were asymptomatic during HSV-2 acquisition, and only 1 HSV-2 seroconverter, who was asymptomatic, had a transient increase in blood HIV load (0.5 log10 copies/mL over 11 days). The HSV-2 incidence rate was high in our cohort of recently HIV-infected individuals; however, HSV-2 acquisition did not significantly change the plasma HIV dynamics and CD4 cell levels.
HIV RNA; incident herpes simplex virus-2; viral dynamics