Next generation, “deep”, sequencing has increasing applications both clinically and in disparate fields of research. This study investigates the accuracy and reproducibility of “deep” sequencing as applied to co-receptor prediction using the V3 loop of Human Immunodeficiency Virus-1. Despite increasing use in HIV co-receptor prediction, the accuracy and reproducibility of deep sequencing technology, and the factors which can affect it, have received only a limited level of investigation. To accomplish this, repeated deep sequencing results were generated using the Roche GS-FLX (454) from a number of sources including a non-homogeneous clinical sample (N = 47 replicates over 18 deep sequencing runs), and a large clinical cohort from the MOTIVATE and A400129 studies (N = 1521). For repeated measurements of a non-homogeneous clinical sample, increasing input copy number both decreased variance in the measured proportion of non-R5 using virus (p<<0.001 and 0.02 for single replicates and triplicates respectively) and increased measured viral diversity (p<0.001; multiple measures). Detection of sequences with a mean abundance less than 1% abundance showed a 2 fold increase in median coefficient of variation (CV) in repeated measurements of a non-homogeneous clinical sample, and a 2.7 fold increase in CV in the MOTIVATE/A400129 dataset compared to sequences with ≥1% abundance. An unexpected source of error included read position, with low accuracy reads occurring more frequently towards the edge of sequencing regions (p<<0.001). Overall, the primary source of variability was sampling error caused by low input copy number/minority species prevalence, though other sources of error including sequence intrinsic, temporal, and read-position related errors were detected.
A population of human immunodeficiency virus (HIV) within a host often descends from a single transmitted/founder virus. The high mutation rate of HIV, coupled with long delays between infection and diagnosis, make isolating and characterizing this strain a challenge. In theory, ancestral reconstruction could be used to recover this strain from sequences sampled in chronic infection; however, the accuracy of phylogenetic techniques in this context is unknown. To evaluate the accuracy of these methods, we applied ancestral reconstruction to a large panel of published longitudinal clonal and/or single-genome-amplification HIV sequence data sets with at least one intrapatient sequence set sampled within 6 months of infection or seroconversion (n = 19,486 sequences, median [interquartile range] = 49 [20 to 86] sequences/set). The consensus of the earliest sequences was used as the best possible estimate of the transmitted/founder. These sequences were compared to ancestral reconstructions from sequences sampled at later time points using both phylogenetic and phylogeny-naive methods. Overall, phylogenetic methods conferred a 16% improvement in reproducing the consensus of early sequences, compared to phylogeny-naive methods. This relative advantage increased with intrapatient sequence diversity (P < 10−5) and the time elapsed between the earliest and subsequent samples (P < 10−5). However, neither approach performed well for reconstructing ancestral indel variation, especially within indel-rich regions of the HIV genome. Although further improvements are needed, our results indicate that phylogenetic methods for ancestral reconstruction significantly outperform phylogeny-naive alternatives, and we identify experimental conditions and study designs that can enhance accuracy of transmitted/founder virus reconstruction.
IMPORTANCE When HIV is transmitted into a new host, most of the viruses fail to infect host cells. Consequently, an HIV infection tends to be descended from a single “founder” virus. A priority target for the vaccine research, these transmitted/founder viruses are difficult to isolate since newly infected individuals are often unaware of their status for months or years, by which time the virus population has evolved substantially. Here, we report on the potential use of evolutionary methods to reconstruct the genetic sequence of the transmitted/founder virus from its descendants at later stages of an infection. These methods can recover this ancestral sequence with an overall error rate of about 2.3%—about 15% more information than if we had ignored the evolutionary relationships among viruses. Although there is no substitute for sampling infections at earlier points in time, these methods can provide useful information about the genetic makeup of transmitted/founder HIV.
HIV-1 tropism can be predicted using V3 genotypic algorithms. The performance of these prediction algorithms for non-B subtypes is poorly characterized. Here, we use these genotypic algorithms to predict viral tropism of HIV-1 subtype A, B, C, and D to find apparent sensitivity, specificity, and concordance against a recombinant phenotypic assay, the original Trofile assay. This is a substudy of an epidemiological study (Pfizer A4001064). Plasma samples were selected to represent a large number of DM/X4 and R5 viruses. The HIV-1 env gene V3 loop was genotyped by Sanger sequencing (N=260) or 454 “deep” sequencing (N=280). Sequences were scored with g2p[coreceptor], PSSM X4/R5, PSSM SI/NSI, and PSSM subtype C matrices. Overall, non-B subtypes tropism prediction had similar concordance and apparent sensitivity and specificity as subtype B in predicting Trofile's results in both population sequencing (81.3%, 65.6%, and 90.5% versus 84.2%, 78.5%, and 88.2%) and 454 “deep” sequencing (82.3%, 80.0%, and 83.6% versus 86.8%, 92.0%, and 82.6%) using g2p[coreceptor]. By population sequencing, subtype A had lower sensitivity, whereas subtype D had lower specificity for non-R5 predictions, both in comparison to subtype B. 454 “deep” sequencing improved subtype A sensitivity but not subtype D. Subtype C had greater concordance than subtype B regardless of sequencing methods. In conclusion, genotypic tropism prediction algorithms may be applied to non-B HIV-1 subtypes with caution. Collective analysis of non-B subtypes revealed a performance similar to subtype B, whereas a subtype-specific analysis revealed overestimation (subtype D) or underestimation (subtype A).
Primer IDs (pIDs) are random oligonucleotide tags used in next-generation sequencing to identify sequences that originate from the same template. These tags are produced by degenerate primers during the reverse transcription of RNA molecules into cDNA. The use of pIDs helps to track the number of RNA molecules carried through amplification and sequencing, and allows resolution of inconsistencies between reads sharing a pID. Three potential issues complicate the above applications. First, multiple cDNAs may share a pID by chance; we found that while preventing any cDNAs from sharing a pID may be unfeasible, it is still practical to limit the number of these collisions. Secondly, a pID must be observed in at least three sequences to allow error correction; as such, pIDs observed only one or two times must be rejected. If the sequencing product contains copies from a high number of RT templates but produces few reads, our findings indicate that rejecting such pIDs will discard a great deal of data. Thirdly, the use of pIDs could influence amplification and sequencing. We examined the effects of several intrinsic and extrinsic factors on sequencing reads at both the individual and ensemble level.
HLA-restricted immune escape mutations that persist following HIV transmission could gradually spread through the viral population, thereby compromising host antiviral immunity as the epidemic progresses. To assess the extent and phenotypic impact of this phenomenon in an immunogenetically diverse population, we genotypically and functionally compared linked HLA and HIV (Gag/Nef) sequences from 358 historic (1979–1989) and 382 modern (2000–2011) specimens from four key cities in the North American epidemic (New York, Boston, San Francisco, Vancouver). Inferred HIV phylogenies were star-like, with approximately two-fold greater mean pairwise distances in modern versus historic sequences. The reconstructed epidemic ancestral (founder) HIV sequence was essentially identical to the North American subtype B consensus. Consistent with gradual diversification of a “consensus-like” founder virus, the median “background” frequencies of individual HLA-associated polymorphisms in HIV (in individuals lacking the restricting HLA[s]) were ∼2-fold higher in modern versus historic HIV sequences, though these remained notably low overall (e.g. in Gag, medians were 3.7% in the 2000s versus 2.0% in the 1980s). HIV polymorphisms exhibiting the greatest relative spread were those restricted by protective HLAs. Despite these increases, when HIV sequences were analyzed as a whole, their total average burden of polymorphisms that were “pre-adapted” to the average host HLA profile was only ∼2% greater in modern versus historic eras. Furthermore, HLA-associated polymorphisms identified in historic HIV sequences were consistent with those detectable today, with none identified that could explain the few HIV codons where the inferred epidemic ancestor differed from the modern consensus. Results are therefore consistent with slow HIV adaptation to HLA, but at a rate unlikely to yield imminent negative implications for cellular immunity, at least in North America. Intriguingly, temporal changes in protein activity of patient-derived Nef (though not Gag) sequences were observed, suggesting functional implications of population-level HIV evolution on certain viral proteins.
Upon HIV transmission, many – though not all – immune escape mutations selected in the previous host will revert to the consensus residue. The persistence of certain escape mutations following transmission has led to concerns that these could gradually accumulate in circulating HIV sequences over time, thereby undermining host antiviral immune potential as the epidemic progresses. As certain immune-driven mutations reduce viral fitness, their spread through the population could also have consequences for the average replication capacity and/or protein function of circulating HIV sequences. Here, we characterized HIV sequences, linked to host immunogenetic information, from patients enrolled in historic (1979–1989) and modern (2000–2011) HIV cohorts from four key cities in the North American epidemic. We reconstructed the epidemic's ancestral (founder) HIV sequence and assessed the subsequent extent to which known HIV immune escape mutations have spread in the population. Our data support the gradual spread of many - though not all - immune escape mutations in HIV sequences over time, but to an extent that is unlikely to have major immediate immunologic consequences for the North American epidemic. Notably, in vitro assessments of ancestral and patient-derived HIV sequences suggested functional implications of ongoing HIV evolution for certain viral proteins.
We applied an efficient method to characterize the relative fitness levels of multiple nonnucleoside reverse transcriptase (NNRTI)-resistant HIV-1 variants by simultaneous competitive culture and 454 deep sequencing. Using this method, we show that the Y181V mutation in the HIV-1 reverse transcriptase in particular confers a clear selective advantage to the virus over 14 other NNRTI resistance mutations in the presence of etravirine in vitro.
A phylogeny is a tree-based model of common ancestry that is an indispensable tool for studying biological variation. Phylogenies play a special role in the study of rapidly evolving populations such as viruses, where the proliferation of lineages is constantly being shaped by the mode of virus transmission, by adaptation to immune systems, and by patterns of human migration and contact. These processes may leave an imprint on the shapes of virus phylogenies that can be extracted for comparative study; however, tree shapes are intrinsically difficult to quantify. Here we present a comprehensive study of phylogenies reconstructed from 38 different RNA viruses from 12 taxonomic families that are associated with human pathologies. To accomplish this, we have developed a new procedure for studying phylogenetic tree shapes based on the ‘kernel trick’, a technique that maps complex objects into a statistically convenient space. We show that our kernel method outperforms nine different tree balance statistics at correctly classifying phylogenies that were simulated under different evolutionary scenarios. Using the kernel method, we observe patterns in the distribution of RNA virus phylogenies in this space that reflect modes of transmission and pathogenesis. For example, viruses that can establish persistent chronic infections (such as HIV and hepatitis C virus) form a distinct cluster. Although the visibly ‘star-like’ shape characteristic of trees from these viruses has been well-documented, we show that established methods for quantifying tree shape fail to distinguish these trees from those of other viruses. The kernel approach presented here potentially represents an important new tool for characterizing the evolution and epidemiology of RNA viruses.
Human immunodeficiency virus type 1 (HIV-1) V3 loop sequence can be used to infer viral coreceptor use. The effect of input copy number on population-based sequencing of the V3 loop of HIV-1 was examined through replicate deep and population-based sequencing of samples with known tropism, a heterogeneous clinical sample (624 population-based sequences and 47 deep-sequencing replicates), and a large cohort of clinical samples from phase III clinical trials of maraviroc including the MOTIVATE/A4001029 studies (n = 1,521). Proviral DNA from two independent samples from each of 101 patients from the MOTIVATE/A4001029 studies was also analyzed. Cumulative technical error occurred at a rate of 3 × 10−4 mismatches/bp, without observed effect on inferred tropism. Increasing PCR replication increased minority species detection with an ∼10% minority population detected in 18% of cases using a single replicate at a viral load of 1,072 copies/ml and in 44% of cases using three replicates. The nucleotide prevalence detected by population-based and deep sequencing were highly correlated (Spearman's ρ, 0.73), and the accuracy increased with increasing input copy number (P < 0.001). Triplicate sequencing was able to predict tropism changes in the MOTIVATE/A4001029 studies for both low (P = 0.05) and high (P = 0.02) viral loads. Sequences derived from independently extracted and processed samples of proviral DNA for the same patient were equivalent to replicates from the same extraction (P = 0.45) and had correlated position-specific scoring matrix scores (Spearman's ρ, 0.75; P ≪ 0.001); however, concordance in tropism inference was only 83%. Input copy number and PCR replication are important factors in minority species detection in samples with significant heterogeneity.
At the early stage of infection, human immunodeficiency virus (HIV)-1 predominantly uses the CCR5 coreceptor for host cell entry. The subsequent emergence of HIV variants that use the CXCR4 coreceptor in roughly half of all infections is associated with an accelerated decline of CD4+ T-cells and rate of progression to AIDS. The presence of a ‘fitness valley’ separating CCR5- and CXCR4-using genotypes is postulated to be a biological determinant of whether the HIV coreceptor switch occurs. Using phylogenetic methods to reconstruct the evolutionary dynamics of HIV within hosts enables us to discriminate between competing models of this process. We have developed a phylogenetic pipeline for the molecular clock analysis, ancestral reconstruction, and visualization of deep sequence data. These data were generated by next-generation sequencing of HIV RNA extracted from longitudinal serum samples (median 7 time points) from 8 untreated subjects with chronic HIV infections (Amsterdam Cohort Studies on HIV-1 infection and AIDS). We used the known dates of sampling to directly estimate rates of evolution and to map ancestral mutations to a reconstructed timeline in units of days. HIV coreceptor usage was predicted from reconstructed ancestral sequences using the geno2pheno algorithm. We determined that the first mutations contributing to CXCR4 use emerged about 16 (per subject range 4 to 30) months before the earliest predicted CXCR4-using ancestor, which preceded the first positive cell-based assay of CXCR4 usage by 10 (range 5 to 25) months. CXCR4 usage arose in multiple lineages within 5 of 8 subjects, and ancestral lineages following alternate mutational pathways before going extinct were common. We observed highly patient-specific distributions and time-scales of mutation accumulation, implying that the role of a fitness valley is contingent on the genotype of the transmitted variant.
At the start of infection, human immunodeficiency virus (HIV) generally requires a specific protein receptor (CCR5) on the cell surface to bind and enter the cell. In roughly half of all HIV infections, the virus population eventually switches to using a different receptor (CXCR4). This ‘HIV coreceptor switch’ is associated with an accelerated rate of progression to AIDS. Although it is not known why this switch occurs in some infections and not others, it is thought to be shaped by constraints on how HIV can evolve from one mode to another. In this study, we test this hypothesis by reconstructing the evolutionary histories of HIV within 8 patients known to have undergone an HIV coreceptor switch. Each history is recreated from samples of HIV genetic sequences that were derived from repeated blood samples by next-generation sequencing, an emerging technology that is rapidly becoming an essential tool in the study of rapidly-evolving populations such as viruses or cancerous cells. Because we have samples from different points in time, we can use models of evolution to extrapolate back in time to the ancestors of each infection. Our analysis reveals patient-specific dynamics in HIV evolution that sheds new light on the determinants of the coreceptor switch.
The evolution of drug resistance mutations in plasma samples is relatively well-characterized. However, the viral population and diversity in other body compartments such as peripheral blood mononuclear cells (PBMC) remains poorly understood. Previous studies have mostly focused on protease and reverse transcriptase drug resistance mutations (DRMs). In this study, we used 454 “deep” sequencing technology to observe and quantify longitudinally the prevalence of resistance mutations associated with the integrase inhibitor, raltegravir, in plasma versus PBMC samples from a San Francisco-based cohort. Four heavily treatment-experienced subjects were monitored in this study over a median of 1.2 years since the initiation of raltegravir-containing regimens. We observed a consistent discordance in the prevalence of DRMs, but not resistance pathway(s), in the plasma versus PBMC viral populations. In the final paired samples that were tested while the subjects were on a raltegravir-containing regimen, DRM prevalence reached 100% in plasma but remained 1% in PBMC on day 177 post-therapy in Subject 3180 (Q148H/G140S), 100% in plasma and 36% in PBMC on day 224 in Subject 3242 (N155H), 78% in plasma and 11–12% in PBMC on day 338 in Subject 3501 (Q148H/G140S), and 100% in plasma and 0% in PBMC on day 197 in Subject 3508 (Y143R). Furthermore, absolute sequence homology comparison between the two compartments revealed that 21% - 99% of PBMC sequences had no match in plasma, whereas 14% - 100% of plasma sequences had no match in PBMC. Overall, our observations suggested that plasma and PBMC hosted drastically different HIV-1 populations even after a prolonged exposure to raltegravir selection pressure.
Summary: Datamonkey is a popular web-based suite of phylogenetic analysis tools for use in evolutionary biology. Since the original release in 2005, we have expanded the analysis options to include recently developed algorithmic methods for recombination detection, evolutionary fingerprinting of genes, codon model selection, co-evolution between sites, identification of sites, which rapidly escape host-immune pressure and HIV-1 subtype assignment. The traditional selection tools have also been augmented to include recent developments in the field. Here, we summarize the analyses options currently available on Datamonkey, and provide guidelines for their use in evolutionary biology.
Availability and documentation: http://www.datamonkey.org
Environmental metagenomics provides snippets of genomic sequences from all organisms in an environmental sample and are an unprecedented resource of information for investigating microbial population genetics. Current analytical methods, however, are poorly equipped to handle metagenomic data, particularly of short, unlinked sequences. A custom analytical pipeline was developed to calculate dN/dS ratios, a common metric to evaluate the role of selection in the evolution of a gene, from environmental metagenomes sequenced using 454 technology of flow-sorted populations of marine Synechococcus, the dominant cyanobacteria in coastal environments. The large majority of genes (98%) have evolved under purifying selection (dN/dS<1). The metagenome sequence coverage of the reference genomes was not uniform and genes that were highly represented in the environment (i.e. high read coverage) tended to be more evolutionarily conserved. Of the genes that may have evolved under positive selection (dN/dS>1), 77 out of 83 (93%) were hypothetical. Notable among annotated genes, ribosomal protein L35 appears to be under positive selection in one Synechococcus population. Other annotated genes, in particular a possible porin, a large-conductance mechanosensitive channel, an ATP binding component of an ABC transporter, and a homologue of a pilus retraction protein had regions of the gene with elevated dN/dS. With the increasing use of next-generation sequencing in metagenomic investigations of microbial diversity and ecology, analytical methods need to accommodate the peculiarities of these data streams. By developing a means to analyze population diversity data from these environmental metagenomes, we have provided the first insight into the role of selection in the evolution of Synechococcus, a globally significant primary producer.
Initial in vitro studies of bevirimat resistance failed to observe mutations in the clinically significant QVT motif in SP1 of HIV-1 gag. This study presents a novel screening method involving mixed, clinically derived gag-protease recombinant HIV-1 samples to more accurately mimic the selection of resistance seen in vivo. Bevirimat resistance was investigated via population-based sequencing performed with a large, initially antiretroviral-naïve cohort before (n = 805) and after (n = 355) standard HIV therapy (without bevirimat). The prevalence of any polymorphism in the motif comprising Q, V, and T was ∼6%, 29%, and 12%, respectively, and did not change appreciably over the course of therapy. From these samples, three groups of 10 samples whose bulk sequences were wild type at the QVT motif were used to generate gag-protease recombinant viruses that captured the existing diversity. Groups were mixed and passaged with various bevirimat concentrations for 9 weeks. gag variations were assessed by amplicon-based “deep” sequencing using a GS FLX sequencer (Roche). Unscreened mutations were present in all groups, and a V370A minority not originally detected by bulk sequencing was present in one group. V370A, occurring together with another preexisting, unscreened resistance mutation, was selected in all groups in the presence of a bevirimat concentration above 0.1 μM. For the two groups with V370A levels below consistent detectability by deep sequencing, the initial selection of V370A required 3 to 4 weeks of exposure to a narrow range of bevirimat concentrations, whereas for the group with the V370A minority, selection occurred immediately. This approach provides quasispecies diversity that facilitates the selection of mutations observed in clinical trials and, coupled with deep sequencing, could represent an efficient in vitro screening method for detecting resistance mutations.
Human immunodeficiency virus type 1 (HIV-1) genomes often carry one or more mutations associated with drug resistance upon transmission into a therapy-naïve individual. We assessed the prevalence and clinical significance of transmitted drug resistance (TDR) in chronically-infected therapy-naïve patients enrolled in a multi-center cohort in North America. Pre-therapy clinical significance was quantified by plasma viral load (pVL) and CD4+ cell count (CD4) at baseline. Naïve bulk sequences of HIV-1 protease and reverse transcriptase (RT) were screened for resistance mutations as defined by the World Health Organization surveillance list. The overall prevalence of TDR was 14.2%. We used a Bayesian network to identify co-transmission of TDR mutations in clusters associated with specific drugs or drug classes. Aggregate effects of mutations by drug class were estimated by fitting linear models of pVL and CD4 on weighted sums over TDR mutations according to the Stanford HIV Database algorithm. Transmitted resistance to both classes of reverse transcriptase inhibitors was significantly associated with lower CD4, but had opposing effects on pVL. In contrast, position-specific analyses of TDR mutations revealed substantial effects on CD4 and pVL at several residue positions that were being masked in the aggregate analyses, and significant interaction effects as well. Residue positions in RT with predominant effects on CD4 or pVL (D67 and M184) were re-evaluated in causal models using an inverse probability-weighting scheme to address the problem of confounding by other mutations and demographic or risk factors. We found that causal effect estimates of mutations M184V/I ( pVL) and D67N/G ( and pVL) were compensated by K103N/S and K219Q/E/N/R. As TDR becomes an increasing dilemma in this modern era of highly-active antiretroviral therapy, these results have immediate significance for the clinical management of HIV-1 infections and our understanding of the ongoing adaptation of HIV-1 to human populations.
Rapidly evolving viruses such as HIV-1 display extensive sequence variation in response to host-specific selection, while simultaneously maintaining functions that are critical to replication and infectivity. This apparent conflict between diversifying and purifying selection may be resolved by an abundance of epistatic interactions such that the same functional requirements can be met by highly divergent sequences. We investigate this hypothesis by conducting an extensive characterization of sequence variation in the HIV-1 nef gene that encodes a highly variable multifunctional protein. Population-based sequences were obtained from 686 patients enrolled in the HOMER cohort in British Columbia, Canada, from which the distribution of nonsynonymous substitutions in the phylogeny was reconstructed by maximum likelihood. We used a phylogenetic comparative method on these data to identify putative epistatic interactions between residues. Two interactions (Y120/Q125 and N157/S169) were chosen to further investigate within-host evolution using HIV-1 RNA extractions from plasma samples from eight patients. Clonal sequencing confirmed strong linkage between polymorphisms at these sites in every case. We used massively parallel pyrosequencing (MPP) to reconstruct within-host evolution in these patients. Experimental error associated with MPP was quantified by performing replicates at two different stages of the protocol, which were pooled prior to analysis to reduce this source of variation. Phylogenetic reconstruction from these data revealed correlated substitutions at Y120/Q125 or N157/S169 repeated across multiple lineages in every host, indicating convergent within-host evolution shaped by epistatic interactions.
coevolution; epistasis; HIV-1; next-generation sequencing; ancestral reconstruction; sequencing error
Over time, natural selection molds every gene into a unique mosaic of sites evolving rapidly or resisting change—an “evolutionary fingerprint” of the gene. Aspects of this evolutionary fingerprint, such as the site-specific ratio of nonsynonymous to synonymous substitution rates (dN/dS), are commonly used to identify genetic features of potential biological interest; however, no framework exists for comparing evolutionary fingerprints between genes. We hypothesize that protein-coding genes with similar protein structure and/or function tend to have similar evolutionary fingerprints and that comparing evolutionary fingerprints can be useful for discovering similarities between genes in a way that is analogous to, but independent of, discovery of similarity via sequence-based comparison tools such as Blast.
To test this hypothesis, we develop a novel model of coding sequence evolution that uses a general bivariate discrete parameterization of the evolutionary rates. We show that this approach provides a better fit to the data using a smaller number of parameters than existing models. Next, we use the model to represent evolutionary fingerprints as probability distributions and present a methodology for comparing these distributions in a way that is robust against variations in data set size and divergence. Finally, using sequences of three rapidly evolving RNA viruses (HIV-1, hepatitis C virus, and influenza A virus), we demonstrate that genes within the same functional group tend to have similar evolutionary fingerprints. Our framework provides a sound statistical foundation for efficient inference and comparison of evolutionary rate patterns in arbitrary collections of gene alignments, clustering homologous and nonhomologous genes, and investigation of biological and functional correlates of evolutionary rates.
adaptive evolution; codon models; evolutionary distance; machine classification
The difference between regional rates of HIV-associated dementia (HAD) in patients infected with different subtypes of HIV suggests that genetic determinants exist within HIV that influence the ability of the virus to replicate in the central nervous system (in Uganda, Africa, subtype D HAD rate is 89%, while subtype A HAD rate is 24%). HIV-1 nef is a multifunctional protein with known toxic effects in the brain compartment. The goal of the current study was to identify if specific three-dimensional nef structures may be linked to patients who developed HAD. HIV-1 nef structures were computationally derived for consensus brain and non-brain sequences from a panel of patients infected with subtype B who died due to varied disease pathologies and consensus subtype A and subtype D sequences from Uganda. Site directed mutation analysis identified signatures in brain structures that appear to change binding potentials and could affect folding conformations of brain-associated structures. Despite the large sequence variation between HIV subtypes, structural alignments confirmed that viral structures derived from patients with HAD were more similar to subtype D structures than to structures derived from patient sequences without HAD. Furthermore, structures derived from brain sequences of patients with HAD were more similar to subtype D structures than they were to their own non-brain structures. The potential finding of a brain-specific nef structure indicates that HAD may result from genetic alterations that alter the folding or binding potential of the protein.
Most of our knowledge about how antiretrovirals and host immune responses influence the HIV-1 protease gene is derived from studies of subtype B virus. We investigated the effect of protease resistance-associated mutations (PRAMs) and population-based HLA haplotype frequencies on polymorphisms found in CRF01_AE pro.
We used all CRF01_AE protease sequences retrieved from the LANL database and obtained regional HLA frequencies from the dbMHC database. Polymorphisms and major PRAMs in the sequences were identified using the Stanford Resistance Database, and we performed phylogenetic and selection analyses using HyPhy. HLA binding affinities were estimated using the Immune Epitope Database and Analysis.
Overall, 99% of CRF01_AE sequences had at least 1 polymorphism and 10% had at least 1 major PRAM. Three polymorphisms (L10 V, K20RMI and I62 V) were associated with the presence of a major PRAM (P < 0.05). Compared to the subtype B consensus, six additional polymorphisms (I13 V, E35D, M36I, R41K, H69K, L89M) were identified in the CRF01_AE consensus; all but L89M were located within epitopes recognized by HLA class I alleles. Of the predominant HLA haplotypes in the Asian regions of CRF01_AE origin, 80% were positively associated with the observed polymorphisms, and estimated HLA binding affinity was estimated to decrease 19–40 fold with the observed polymorphisms at positions 35, 36 and 41.
Polymorphisms in CRF01_AE protease gene were common, and polymorphisms at residues 10, 20 and 62 most likely represent selection by use of protease inhibitors, whereas R41K and H69K were more likely attributable to recognition of epitopes by the HLA haplotypes of the host population.
CRF01_AE; HIV; HLA; polymorphisms; protease; resistance
Compensatory mutations improve fitness in genotypes that contain deleterious mutations but have no beneficial effects otherwise. As such, compensatory mutations represent a very specific form of epistasis. We show that intragenic compensatory mutations occur non-randomly over gene sequence. Compensatory mutations are more likely to appear at some sites than others. Moreover, the sites of compensatory mutations are more likely than expected by chance to be near the site of the original deleterious mutation. Furthermore, compensatory mutations tend to occur more commonly in certain regions of the protein even when controlling for clustering around the site of the deleterious mutation. These results suggest that compensatory evolution at the protein level is partially predictable and may be convergent.
compensatory mutation; deleterious mutations; experimental evolution; epistasis; primary structure
Genetically diverse pathogens (such as Human Immunodeficiency virus type 1, HIV-1) are frequently stratified into phylogenetically or immunologically defined subtypes for classification purposes. Computational identification of such subtypes is helpful in surveillance, epidemiological analysis and detection of novel variants, e.g., circulating recombinant forms in HIV-1. A number of conceptually and technically different techniques have been proposed for determining the subtype of a query sequence, but there is not a universally optimal approach. We present a model-based phylogenetic method for automatically subtyping an HIV-1 (or other viral or bacterial) sequence, mapping the location of breakpoints and assigning parental sequences in recombinant strains as well as computing confidence levels for the inferred quantities. Our Subtype Classification Using Evolutionary ALgorithms (SCUEAL) procedure is shown to perform very well in a variety of simulation scenarios, runs in parallel when multiple sequences are being screened, and matches or exceeds the performance of existing approaches on typical empirical cases. We applied SCUEAL to all available polymerase (pol) sequences from two large databases, the Stanford Drug Resistance database and the UK HIV Drug Resistance Database. Comparing with subtypes which had previously been assigned revealed that a minor but substantial (≈5%) fraction of pure subtype sequences may in fact be within- or inter-subtype recombinants. A free implementation of SCUEAL is provided as a module for the HyPhy package and the Datamonkey web server. Our method is especially useful when an accurate automatic classification of an unknown strain is desired, and is positioned to complement and extend faster but less accurate methods. Given the increasingly frequent use of HIV subtype information in studies focusing on the effect of subtype on treatment, clinical outcome, pathogenicity and vaccine design, the importance of accurate, robust and extensible subtyping procedures is clear.
There are nine different subtypes of the main group of HIV-1, each originating as a distinct subepidemic of HIV-1. The distribution of subtypes is often unique to a given geographic region of the world and constitutes a useful epidemiological and surveillance resource. The effects of viral subtype on disease progression, treatment outcome and vaccine design are being actively researched, and the importance of accurate subtyping procedures is clear. In HIV-1, subtype assignment is complicated by frequent recombination among co-circulating strains, creating new genetic mosaics or recombinant forms: 43 have been characterized to date, and many more likely exist. We present an automated phylogenetic method (SCUEAL) to accurately characterize both simple and complex HIV-1 mosaics. Using computer simulations and biological data we demonstrate that SCUEAL performs very well under various conditions, especially when some of the existing classification procedures fail. Furthermore, we show that a small, but noticeable proportion of subtype characterization stored in public databases may be incomplete or incorrect. The computational technique introduced here should provide a much more accurate characterization of HIV-1 strains, especially novel recombinants, and lead to new insights into molecular history, epidemiology and geographical distribution of the virus.
Human populations are structured by social networks, in which individuals tend to form relationships based on shared attributes. Certain attributes that are ambiguous, stigmatized or illegal can create a ÔhiddenÕ population, so-called because its members are difficult to identify. Many hidden populations are also at an elevated risk of exposure to infectious diseases. Consequently, public health agencies are presently adopting modern survey techniques that traverse social networks in hidden populations by soliciting individuals to recruit their peers, e.g., respondent-driven sampling (RDS). The concomitant accumulation of network-based epidemiological data, however, is rapidly outpacing the development of computational methods for analysis. Moreover, current analytical models rely on unrealistic assumptions, e.g., that the traversal of social networks can be modeled by a Markov chain rather than a branching process.
Here, we develop a new methodology based on stochastic context-free grammars (SCFGs), which are well-suited to modeling tree-like structure of the RDS recruitment process. We apply this methodology to an RDS case study of injection drug users (IDUs) in Tijuana, México, a hidden population at high risk of blood-borne and sexually-transmitted infections (i.e., HIV, hepatitis C virus, syphilis). Survey data were encoded as text strings that were parsed using our custom implementation of the inside-outside algorithm in a publicly-available software package (HyPhy), which uses either expectation maximization or direct optimization methods and permits constraints on model parameters for hypothesis testing. We identified significant latent variability in the recruitment process that violates assumptions of Markov chain-based methods for RDS analysis: firstly, IDUs tended to emulate the recruitment behavior of their own recruiter; and secondly, the recruitment of like peers (homophily) was dependent on the number of recruits.
SCFGs provide a rich probabilistic language that can articulate complex latent structure in survey data derived from the traversal of social networks. Such structure that has no representation in Markov chain-based models can interfere with the estimation of the composition of hidden populations if left unaccounted for, raising critical implications for the prevention and control of infectious disease epidemics.
We develop a model-based phylogenetic maximum likelihood test for evidence of preferential substitution toward a given residue at individual positions of a protein alignment—directional evolution of protein sequences (DEPS). DEPS can identify both the target residue and sites evolving toward it, help detect selective sweeps and frequency-dependent selection—scenarios that confound most existing tests for selection, and achieve good power and accuracy on simulated data. We applied DEPS to alignments representing different genomic regions of influenza A virus (IAV), sampled from avian hosts (H5N1 serotype) and human hosts (H3N2 serotype), and identified multiple directionally evolving sites in 5/8 genomic segments of H5N1 and H3N2 IAV. We propose a simple descriptive classification of directionally evolving sites into 5 groups based on the temporal distribution of residue frequencies and document known functional correlates, such as immune escape or host adaptation.
directional selection; evolution of influenza; maximum likelihood; episodic selection
Spidermonkey is a new component of the Datamonkey suite of phylogenetic tools that provides methods for detecting coevolving sites from a multiple alignment of homologous nucleotide or amino acid sequences. It reconstructs the substitution history of the alignment by maximum likelihood-based phylogenetic methods, and then analyzes the joint distribution of substitution events using Bayesian graphical models to identify significant associations among sites.
Availability: Spidermonkey is publicly available both as a web application at http://www.data-monkey.org and as a stand-alone component of the phylogenetic software package HyPhy, which is freely distributed on the web (http://www.hyphy.org) as precompiled binaries and open source.
After acute HIV infection, CD8+ T cells are able to control viral replication to a set point. This control is often lost after superinfection, although the mechanism behind this remains unclear. In this study, we illustrate in an HLA-B27+ subject that loss of viral control after HIV superinfection coincides with rapid recombination events within two narrow regions of Gag and Env. Screening for CD8+ T cell responses revealed that each of these recombination sites (∼50 aa) encompassed distinct regions containing two immunodominant CD8 epitopes (B27-KK10 in Gag and Cw1-CL9 in Env). Viral escape and the subsequent development of variant-specific de novo CD8+ T cell responses against both epitopes were illustrative of the significant immune selection pressures exerted by both responses. Comprehensive analysis of the kinetics of CD8 responses and viral evolution indicated that the recombination events quickly facilitated viral escape from both dominant WT- and variant-specific responses. These data suggest that the ability of a superinfecting strain of HIV to overcome preexisting immune control may be related to its ability to rapidly recombine in critical regions under immune selection pressure. These data also support a role for cellular immune pressures in driving the selection of new recombinant forms of HIV.
We assessed the effect of herpes simplex virus type 2 (HSV-2) acquisition on the plasma HIV RNA and CD4 cell levels among individuals with primary HIV infection using a retrospective cohort analysis. We studied 119 adult, antiretroviral-naive, recently HIV-infected men with a negative HSV-2–specific enzyme immunoassay (EIA) result at enrollment. HSV-2 acquisition was determined by seroconversion on HSV-2 EIA, confirmed by Western blot analysis. Ten men acquired HSV-2 infection a median of 1.3 years after HIV infection (HSV-2 incidence rate of 7.4 per 100 person-years of follow-up). The median time of follow-up after acquiring HSV-2 infection was 303 days. All men except 1 were asymptomatic during HSV-2 acquisition, and only 1 HSV-2 seroconverter, who was asymptomatic, had a transient increase in blood HIV load (0.5 log10 copies/mL over 11 days). The HSV-2 incidence rate was high in our cohort of recently HIV-infected individuals; however, HSV-2 acquisition did not significantly change the plasma HIV dynamics and CD4 cell levels.
HIV RNA; incident herpes simplex virus-2; viral dynamics