Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Theor Biol. Author manuscript; available in PMC 2010 November 21.
Published in final edited form as:
PMCID: PMC2760689

Modeling Sequence Evolution in Acute HIV-1 Infection


We describe a mathematical model and Monte-Carlo (MC) simulation of viral evolution during acute infection. We consider both synchronous and asynchronous processes of viral infection of new target cells. The model enables an assessment of the expected sequence diversity in new HIV-1 infections originating from a single transmitted viral strain, estimation of the most recent common ancestor (MRCA) of the transmitted viral lineage, and estimation of the time to coalesce back to the MRCA. We also calculate the probability of the MRCA being the transmitted virus or an evolved variant. Excluding insertions and deletions, we assume HIV-1 evolves by base substitution without selection pressure during the earliest phase of HIV-1 infection prior to the immune response. Unlike phylogenetic methods that follow a lineage backwards to coalescence, we compare the observed data to a model of the diversification of a viral population forward in time. To illustrate the application of these methods, we provide detailed comparisons of the model and simulations results to 306 envelope sequences obtained from 8 newly infected subjects at a single time point. The data from 6/8 patients were in good agreement with model predictions, and hence compatible with a single-strain infection evolving under no selection pressure. The diversity of the samples from the other two patients was too great to be explained by the model, suggesting multiple HIV-1-strains were transmitted. The model can also be applied to longitudinal patient data to estimate within-host viral evolutionary parameters.

Keywords: HIV-1, population dynamics, viral evolution


The HIV-1 population in a chronically infected individual is subject to continuous immune selection (Richman et al., 2003; Wei et al., 2003), and evolves to become a complex set of related viruses, often referred to as a quasispecies, through the course of an infection (Lee et al., 2009; Shankarappa et al., 1999; Wolinsky et al., 1996). A reduction in viral complexity at transmission was originally noted in the context of mother to infant transmission (Wolinsky et al., 1992), and has been extensively studied in recent years (Derdeyn et al., 2004; Dickover et al., 2006; Edwards et al., 2006; Painter et al., 2003). During sexual transmission of HIV-1, a genetic bottleneck usually occurs since a limited number of viral strains are transmitted from the complex quasispecies typically found in a donor (Delwart et al., 2001; Derdeyn et al., 2004; Zhang et al., 1993), although other studies have found multiple transmitted variants at a relatively high frequency (Long et al., 2002; Ritola et al., 2004; Sagar et al., 2004; Vernazza et al., 1999). Infection with multiple genetic variants has been associated with genital tract ulcers and use of hormonal contraceptives (Sagar et al., 2004). Recent studies have identified patients in the earliest weeks of infection, many prior to the selective pressure imposed by the newly infected host’s initial immune response (Abrahams et al., 2009; Keele et al., 2008; Salazar-Gonzalez et al., 2008). Sequence data from the HIV-1 env gene collected during acute HIV-1 infection in 102 subjects by Keele et al. (2008) show a wide range of diversity, with the average number of bases differing between pairs of sequences from the same patient varying between 0.01% and 2.18%, suggesting that both single and multiple viral strain transmissions may have occurred in this cohort.

In this paper, we develop simple models of HIV-1 evolution early in infection with the aim of quantitatively assessing whether infections were established by single or multiple viral strains. Further, in the case of single strain, i.e., homogeneous, infections we aim to identify the most-likely initiating strain or a close descendent that gave rise to the observed lineage.

We derive analytical results and approximate formulas. We used Monte Carlo (MC) simulations to capture the randomness in early HIV-1 evolution and to compare with our analytical results. The analytical results were derived from idealized models whereas the MC simulations allowed for more accurate models that incorporate unequal base composition and an evolutionary based substitution matrix that defines the frequency at which base i is replaced by base j during a mutation event. Previously, Monte Carlo methods have been used to study the within-host dynamics of HIV-1 infection (Heffernan and Wahl, 2005; Kamina et al., 2001; Ribeiro and Bonhoeffer, 1999; Ruskin, 2002; Tan and Wu, 1998; Tuckwell and Le Corfec, 1998), but here our focus is on sequence evolution and not viral and T cell dynamics as in these earlier works. Keele and colleagues applied a variant of this model to the analysis of 102 B-clade infected patients (Keele et al., 2008) and of 18 experimentally SIV-infected rhesus macaques (Keele, 2009). Abrahams et al. (2009) have used the same techniques to analyze a set of 69 C-clade infected subjects.

In this study we provide a complete mathematical description of the model, and explore the implications of varying the baseline assumptions, the input parameters, as well as purely analytical versus computational outcomes. We use just 8 of the patients described in Keele et al. (Keele et al., 2008) to illustrate nuances in the application of our model. These 8 subjects were chosen to be representative of the full set of 102 patients with 80% being characterized as having homogenous infection and 20% heterogeneous infections.

The main problem we focus on here is developing a systematic, reasoned way to determine whether a single strain or multiple stains of HIV-1 infected an individual. This is not straight-forward; even if an individual is infected by a single strain, with time this transmitted virus will diversify. Thus, given a set of sequences one needs to compute how much diversity would be expected by a given time from infection. For sequences collected early after infection, if the diversity is much greater than what is expected from a homogeneous infection, then multiple strain infection is a likely explanation. Other signatures of multiple strain infection might also be present, e.g., the sequences cluster into groups with very different most recent common ancestors. Our model also provides an estimate of the time from the origin of the most recent common ancestor (MRCA) of the sampled variants. If the sampled variants are representative of a lineage that was initiated by the infecting strain, the time to the MRCA should correspond to the time of infection; if the lineage arose in the donor, the time to the MRCA would be longer than the infection; if the lineage arose post-infection in the newly infected individual as a consequence of selection, the time estimated to the MRCA would be less than the time from infection.

To characterize HIV-1 evolution in samples in the earliest weeks of infection, we have modeled the events that occur between virus transmission and peak replication, approximately 21-35 days later. Our model assumes random drift prior to the initial immune response and exponential viral expansion prior to peak viremia. We ignored the effects of selection—based on the premise that the sequences we analyze were obtained sufficiently early in infection that immune responses would not yet provide substantial selective pressure. Although other sources of selection may be present, these also are ignored in our model. The effects of recombination were not modeled either. Comparing the expectations based on the mathematical model and the simulations to observed acute infection sequence data enables us to explore the validity of these assumptions in real infections. We show that under the hypothesis of exponential growth in the infected cell population the frequency distribution of the genetic distances between pairs of HIV-1 sequences follow an approximate Poisson distribution and star-phylogeny topology. This was first shown by Slatkin and Hudson in the context of the evolution of mitochondrial DNA (Slatkin and Hudson, 1991).

Since the precise time of initial HIV-1 infection cannot usually be known with certainty, the status of acute/early subjects can be classified using the “Fiebig” staging system (Table 1), which is based on an orderly appearance of viral RNA, antigen and antibodies in plasma during early infection (Fiebig et al., 2005; Fiebig et al., 2003). Prior to stage I, where plasma viral RNA first becomes detectable, is the eclipse phase. The length of this period can be roughly estimated based on clinical histories from a series of studies (Clark et al., 1991; Gaines et al., 1988; Lindback et al., 2000a; Lindback et al., 2000b; Schacker et al., 1996) and suggests an average length of about 10 days (range 7-21 days, see Table 1). For each patient in our dataset the Fiebig stage is known. This provides a rough estimate of the time since infection, which can be compared with the estimated time since the MRCA computed with our models from the sampled sequences.

Table 1
Fiebig stage classification for sub-stages of HIV-1 primary infection, and the average and cumulative duration of each phase. The duration of each stage is presented in Keele et al. (Keele et al., 2008). We obtained 95% CIs, shown in parenthesis, by recalculating ...

Coalescent theory can also be used to find common ancestors and estimate the time to coalescence. Further, results have been obtained using coalescent theory for the type of linear birth death process that underlies the model described in this paper (Kuhner et al., 1998; Rannala, 1997). When the infection is indeed homogeneous and it meets our model assumptions (exponential viral growth, no selection, etc.), we are under a very particular evolutionary scenario that allows us to use more direct and computationally efficient methods to derive the approximate time to the MRCA. Even though appropriate, Bayesian estimation methods (Drummond et al., 2005; Drummond et al., 2006; Kuhner and Smith, 2007), as the ones provided by the software BEAST by (Drummond and Rambaut, 2007), are computationally intensive even when the number of sequences is small as they perform a whole suite of additional tests that are redundant for our particular evolutionary scenario. The methods presented below are computationally efficient, and our model results can be obtained in minutes for the entire dataset, while in our hands BEAST runs for a single patient took over 3 hours. Further, as we shall show, the estimated time to the MRCA using BEAST and using our method are very similar. Thus, we believe the methods we present below enable a rapid exploration of the implications of a forward evolution model, provide consistent results with a coalescence model, and allow us to compare the model to the data to infer biologically interesting information from the sample.


Mathematical Models and Monte-Carlo Simulations

HIV-1 can be transmitted from one individual to another either through the transmission of virus particles or infected cells. Since transmission of virus will quickly generate infected cells, we describe the transmission as if it occurred through infected cells.

In the case of sexually transmitted HIV-1, most often only a single viral sequence initiates the new infection, or only a single sequence grows out to yield a detectable level of viremia {Delwart, 2001 #61; Derdeyn, 2004 #20; Zhang, 1993 #59; Abrahams, 2009 #185}. Thus, we shall model the case in which infection is initiated by a single HIV-1 sequence. The simplest implementation of this hypothesis is to assume that a single cell, carrying this sequence as a provirus, starts the infection. We shall also examine the case in which multiple cells carrying identical sequences start the infection. When the predictions of these models fail to explain the extent of sequence heterogeneity observed, this is evidence that either multiple sequences have been transmitted, or that immune or other selective pressures are driving the observed level of diversification (Long et al., 2002; Ritola et al., 2004; Sagar et al., 2004; Vernazza et al., 1999).

We assume that the initial infected cell produces virus; most are cleared, but some successfully infect a new generation of cells. The number of secondary infections caused by one infected cell placed in the population of cells of an uninfected individual and fully susceptible to infection is called the basic reproductive ratio, R0. While data on early viral kinetics in humans is limited, the available data in humans infected with HIV-1 and in monkeys infected with SIV and SHIV show that the virus grows exponentially until a viral load peak is attained a few weeks after infection (Little et al., 1999; Nowak et al., 1997; Stafford et al., 2000). Following the peak, viral levels decline and establish a set-point. At the set-point each infected cell, on average, successfully infects one other cell during its lifetime. If it infected more than one other cell viral levels would increase, whereas if it infected less than one cell viral levels would fall. Thus, once the set-point is established we can assume that the number of infected cells remains constant.

In our model we assume a homogeneous infection in which the virus grows exponentially with no selection pressure, no recombination, no occurrence of back mutations and a constant mutation rate across positions and across lineages. We also assume the virus grows with a fixed generation length and that each infected cell produces the same number of progeny. Clearly, the approximation of exponential growth must break down as target cells become limited and the number of cells infected per generation must decrease from R0, stabilizing at 1 when the viral set point is established. However, we present evidence indicating that what is important in determining viral diversity is the number of reverse transcription events that have occurred along a lineage, i.e., the number of generations that separate the initial infection from the time at which sequences were obtained, and not the number of cells infected at each generation. In our first, simplest model, we also assume that the infection is synchronous, and thus can be characterized by discrete generations, with the virus infecting exactly R0 new cells at each generation.

Let all genomes be of same length NB, let ε be the reverse transcriptase point mutation rate, which we assume to be constant throughout the genome, and let s0 denote the infecting strain. Then, after the first replication cycle, the R0 daughter cells will each differ from the infecting strain at exactly n bases (positions) with a probability given by


After the second replication cycle, the total number of mutations in the new generation of infected cells will follow the probability distribution given by

P(mutations=n[mid ]gen.=2)=k=0nBinom(nk;NB,ε)Binom(k;NB,ε)=Binom(n;2NB,ε).

In other words, n mutations in the second generation are the result of all possible combinations of k mutations occurring in the first replication, and n-k occurring in the second, for all integers k between 0 and n. Mathematically, this is given by the convolution of two Binomial distributions with the same probability parameter (ε, in our case), which is in turn a Binomial distribution (Casella and Berger, 1990). One can see that after a replication cycles,

P(mutations=n[mid ]gen.=a)=Binom(n;aNB,ε).

Let HD0 denote the Hamming distance, i.e., the number of base differences from the infecting strain s0. In general, given n mutations, we have HD0n , as some of the n mutations may occur at the same site. However, after a replication cycles, the probability that one site has mutated at most once is


Therefore, the probability that at least one site across the entire genome has mutated more than once is Q = 1-PNB , which is approximately 12NBa(a1)ε2 when ε is small.

For the HIV-1 base substitution rate ε, we use the value of 2.16 × 10-5 base substitutions per replication cycle. The base substitution rate we use is derived from the results of (Mansky and Temin, 1995), where after a single round of HIV-1 replication they found an overall mutation rate of 3.4 × 10-5, but this included insertions and deletions. Restricting their results to only base substitutions, we calculated from their data a base substitution rate of 2.16 × 10-5 per base per replication. The sequences that we analyze later are envelope gene sequences with NB ~2600 bases. Substituting these values into Q, we find that the probability of two or more mutations at the same site increases with the number of replication cycles, yet even for a=50 one finds Q < 0.0015. Hence we expect that we can ignore back mutations when we analyze patient sequence data obtained early in infection.

Under this assumption, HD0 coincides with the number of mutations and hence it follows the same Binomial distribution:

P(HD0=d[mid ]a)=Binom(d;aNB,ε)+O(a2ε2NB)(aNBd)εd(1ε)aNBd

Notice that this formula assigns a nonzero probability to the Hamming distance being larger than the total number of bases NB, and this is an artifact of our approximations. In fact, using Chebyshev’s inequality (Feller, 1957), one can show (see Materials and Methods) that P(mutations>NB[mid ]a)O(a2ε2NB)<<1, and thus indeed, within our approximation, P(mutations > NB) ≈ 0.

Synchronous Infection, Mathematical Model

In the synchronous model, we assume that each cell infects R0 other cells at the end of its life cycle. At generation a from the infection event, there will be R0a infected cells, each one carrying one HIV-1 genome. Further, at generation a


In order to analyze single time point data, it is useful to compute the intersequence HD distribution rather than HD0 (since for the latter we would need to assume that we know the infecting sequence s0). In other words, rather than comparing any given sequence to the founder sequence, we take all possible sequence pairs in the sample and compute their relative HD. Given two sequences, s1 and s2, picked at random at time t, that have evolved independently from the common ancestor s0, the distribution of HD[s1, s2] is

P(HD[s1,s2]=d[mid ]a)=k=0dP(HD0=k[mid ]a)P(HD0=dk[mid ]a)==k=0dBinom(k;aNB,ε)Binom(dk;aNB,ε)=Binom(d;2aNB,ε)

The assumption of independent evolution of sequences s1 and s2 from the common ancestor s0 is equivalent to assuming that the phylogeny of these sequences is star-like with s0 being the only common ancestor. However, it is also conceivable that the common ancestor of s1 and s2 is not the transmitted sequence s0 but rather a sequence that evolved from s0. This is an important issue, as the identity of the transmitted sequence is generally not known. As illustrated in Figure 1, the probability of s1 and s2 coalescing m generations from the ancestor, with m < a, is the number of all sequences coalescing m generations from the ancestor (minus the one we have already picked), divided by the total number of sequences at generation a (again, minus the one we picked already)

P(MRCA[s1,s2]am[mid ]s1,s2[set membership]x)=R0am1R0a1.

As shown in Material and Methods, this probability approaches zero rapidly as a and m get large.

Figure 1
Probability of early coalescence when R0=2

We can use Eq. (6) to determine the probability that two sequences from a given individual coalesce at a time point other than at the transmitted sequence. For example, for samples obtained a few weeks into the infection and under no selective pressure, e.g. a=5 and R0=6 (see below) the probability of coalescence later than generation 0, from Eq. (6) is 0.167 for m=1, 0.028 for m=2 and 0.0045 for m=3. For greater values of a, the probability of coalescing at generation 3 or higher is < 0.005. Thus, we expect that use of Eq. (5) and the assumption of a star-like phylogeny to hold for all patient samples in which a single viral strain was transmitted. However, even for large a, the probability of coalescing 1 or 2 generations away from the founder strain still remains non-negligible: even at generation 40, the probability of coalescence later than generation 0, is still 0.167 for m=1, and 0.028 for m=2. In a set of sequences from an individual, these estimates of less than a 0.5% chance of being 3 generations away from the actual founder, a 3% chance of being 2 generations away, and a 17% chance of being 1 generation away hold for each independent pair of comparisons.

Samples that diverge from a star phylogeny often show patterns of shared mutations across sequences. The above calculation shows that coalescence not at the founder strain is most likely one generation away from the actual founder. Therefore, a mutation observed very early in the infection, could possibly carry on and develop as a second lineage. Also, we can not rule out that the founders of these two lineages were simply transmitted from the donor and hence coalesced in the donor. For infections transmitted during the acute phase before much diversification occurred in the donor this may even be likely. Lastly, we observe that since the base substitution rate ε is so small it is very likely that the sequences at a=0 and a=1 are the same. In fact, with probability (1-ε)RoNB all R0 proviruses at generation 1 are identical to the founder; e.g., with R0=6 and NB=2600 this probability is 0.71.

For ease of notation, let the intersequence Hamming distance distribution be denoted HDI . Thus,

P(HDI=d[mid ]a)=P(HD[s1,s2]=d[mid ]a)=Binom(d;2NBa,ε).

We also define some basic quantities based on the calculated Hamming distance distribution from the initial strain and the intersequence Hamming distance distribution between all possible sequence pairs. The divergence at generation a is defined as the average Hamming distance per base from the initial founder strain. From Eq. (4), this can be estimated as the mean of the binomial distribution divided by the number of bases, and is εa . The diversity at generation a is the average intersequence Hamming distance per base between sequence pairs and from Eq. (7), diversity = 2εa. The variance of the intersequence HD distribution at generation a divided by the number of bases is 2εa(1-ε). We also introduce the % sequence identity, defined as the proportion of sequences identical to the sequence, s0, of the infecting strain. As the virus evolves, the population diversifies from the founder strain and the proportion of the total population that is identical to the MRCA decays exponentially. By setting d = 0 in Eq. (4), we have % sequence identity = Binom(0;aNB) = (1-ε)aNB .

Finally, we wish to compute the expected maximum HDI. This is motivated by the observation that more extensive diversity is seen in some transmission cases, presumably the result of infections established by two or more distinct HIV-1 strains (Long et al., 2002; Ritola et al., 2004; Sagar et al., 2004; Vernazza et al., 1999). Thus, a strategy for systematic classification of homogeneous, single strain infections versus heterogeneous infections is needed. To address this, we ask what is the maximum diversity that one would expect in a single strain infection within a time frame compatible with the subject’s Fiebig stage at the time of sampling. The maximum diversity expected depends on the sample size: the larger the sample, the higher the expected maximum, as outliers become more likely to be sampled.,The expected maximum HD at generation a is computed in the Materials and Methods.

Monte Carlo Simulation of the Synchronous Infection Model

The mathematical model derived above assumes that all mutations have equal probability. However, in HIV-1 the four bases are not equally frequent and the rates of base substitution are not all equal. Further, we neglected more than one mutation occurring at the same site, although this potentially can occur. To examine the influence of these biological details on the predictions of HIV-1 sequence evolution, and to capture stochastic effects, we developed a simple Monte Carlo simulation and compared its predictions with those of the mathematical model.

The simulation starts at “generation 0” with one cell infected with a single HIV-1 sequence NB bases long. The initial sequence is generated randomly with base frequencies of 0.19 (T and C), 0.37 (A) and 0.24 (G), based on typical frequencies measured in the full-length envelope gene (see Materials and Methods). We assume that the initially infected cell infects R0 other cells synchronously. We chose R0 =6, representing the average value of R0 across 10 different patients studied by (Stafford et al., 2000). During the infection of these cells mutations can occur. In the simulation, the sites for the occurrence of the base substitutions are chosen randomly from the NB bases. The probability that base i is replaced by j is given by a 4×4 transition matrix deduced from a maximum likelihood General Time Reversible (GTR) model of substitutions that occur in the full length HIV-1 envelope gene (for the matrix and the details see Materials and Methods). Unlike in our analytical model, we allow the same site to mutate more than once. For a given sample size NS, the MC simulation produces an HD frequency distribution for each generation step within a certain time frame chosen by the user. At each generation exactly NS sequences are sampled (with replacement) and the intersequence HD distribution is thus constructed.

We implemented the MC code so that the user has the option of running either one or multiple iterations. When multiple iterations are done, the HD frequencies computed at each generation step are averaged over all iterations, thus retrieving the mean field description described by the analytical method. On the other hand, when not averaging over all iterations, one captures the stochastic effects due to rare mutations, e.g. the run-to-run variance, which, in real life, corresponds to the host-to-host variation. This is completely neglected in the analytical method, which is a mean field description.

In the synchronous model the population of infected cells grows exponentially as R0a. Once the number of infected cells reaches 104, which is close to some estimates of the effective population size Ne of the virus in a mature infection (Achaz et al., 2004; Leigh Brown, 1997; Wakeley, 2008), there is no need to expand the population further in order to study the diversity of genetic sequences as long as we allow the proviral sequences contained in these 104 to continue to evolve. We thus cap the total population of infected cells at 104 and let those cells produce 104 R0 progeny. We then randomly sample 104 cells out of the 104 R0, and continue the simulation. In this way, we are able to maintain the population of infected cells constant over the rest of the simulation, which is equivalent to what occurs when the viral set-point is reached. We confirmed that our simulation results (i.e., means and 95% CIs) did not depend on the cutoff value of infected cells as long as it is above 103 (data not shown).

Using the MC simulation, we computed divergence, diversity, maximum HD and % sequence identity (defined in the previous section) as functions of time, and confirmed that their dependence on the number of generation steps matched the dependency found analytically. Figure 2 shows this comparison as well as 95% confidence intervals computed from 1,000 MC simulations using a typical number of sampled sequences, NS=30.

Figure 2
Synchronous Infection Model

Asynchronous Infection

The synchronous infection model is a simplification. In reality, cell infections occur at random times by the viruses released from an infected cell. Viral production starts on average about 24 hours after a cell is initially infected (Perelson et al., 1996), and most likely continues until cell death. In order to understand the dependence of our results on the synchronous approximation, we devised a second model in which infections are asynchronous. Again we will assume each cell infects R0 others during its lifetime. While each of the R0 infections could occur at different times, here we take a first step in assessing the role of asynchrony by assuming the infections occurs at two times. Each infected cell infects α cells at an intermediate time, and the rest at the end of its life-time. Even though this scenario is again a simplification, it is useful to analyze the differences from the previous model, and as we shall show below it preserves the same basic mathematical structure as before. However, because an infected cell contributes progeny to the next generation before the end of its life there is now a faster growth rate of the infected cell population.

Let τa and 2τa be the times at which a newly infected cell infects other cells, and suppose it infects α cells at τa, and γ at 2τa, with α+γ=R0, the total number of cells one cell infected during its lifetime. Notice that if γ=0, then infections only occur at one time, α=R0, and τa is the generation time. Hence this model reduces to the synchronous model with generation time τs= τa, where τs denotes the duration of a single generation step in the synchronous model.

Denote by I(a,t) the number of infected cells of age a at time t. Here, by age, we mean the age of the infecting genome, measured in terms of the number of times the viral genome has undergone reverse transcription, rather than the cell’s physical age. Thus, a is equivalent to the generation variable we used in the synchronous model. As a consequence, at any given time t, under the synchronous model all cells belong to the same age group, whereas in the asynchronous model, the infecting genomes have undergone a minimum of [left ceiling]t2τa[right ceiling] replication cycles, and a maximum of [left floor]tτa[right floor] (see Figure 3), where [left ceiling] [right ceiling] and [left floor] [right floor] are the ceiling and floor functions defined as [left ceiling]x[right ceiling] = max {n [set membership] Z | nx} and [left ceiling]x[right ceiling] = min {n [set membership] Z | nx}, where Z is the set of integers.

Figure 3
Schematic diagrams of the synchronous and asynchronous MC infection models

Let I0 denote the initial number of infected cells and let n=[left floor]tτa[right floor], i.e., n represents time measured in units of τa. In the following we shall use n as the time variable. A general expression for I(a,n) is derived in Materials and Methods, and is given by

I(a,n)=I0αa(γα)na(ana)fora=[left ceiling]n2[right ceiling],,nand0otherwise.

As illustrated in Figure 3, at any given time n, the HIV-1 strains in an individual have undergone a minimum of [left ceiling]n2[right ceiling] and a maximum of n replication cycles. Then, as a function of time, the random variable HD0 follows the probability distribution

P(HD0=d[mid ]n)=a=[left ceiling]n2[right ceiling]n(aNBd)εd(1ε)aNBdI(a,n)a=[left ceiling]n2[right ceiling]nI(a,n)

In Materials and Methods we show the following relationship

a=[left ceiling]n2[right ceiling]nI(a,n)=I02[var phi](α2)n[(1+[var phi])n+1(1[var phi])n+1]=I0αnFn

where [var phi]=1+4γα2 and Fn=(1+[var phi])n+1(1[var phi])n+12n+1[var phi]. By substituting Eq. (10) into Eq. (9) we obtain

P(HD0=d[mid ]n)=1Fna=[left ceiling]n2[right ceiling]n(ana)(aNBd)(γα)naεd(1ε)aNBd,

with mean μn = λnεNB and variance σn2=λnε(1ε)NB, where

λn=n1+[var phi]2[var phi]+1[var phi][var phi]2.

Notice, as one would expect, that in neither the synchronous nor the asynchronous model, does the probability distribution depend on the initial number of infected cells, I0.

We show in Materials and Methods that up to terms in O(ε2) P(HD0 =d | n) is a Poisson distribution with mean μn = λnεNB, in other words

P(HD0=d[mid ]n)Pois(d,λnεNB)+O(ε2).

As a consequence, given that the intersequence HD distribution, HDI, is to a good approximation the self-convolution of the HD0 distribution, we get that HDI also follows a Poisson distribution with parameter 2λnεNB . Therefore, analogously to the synchronous model, divergence = λnεNB , diversity = 2λnεNB = variance and % sequence identity = Exp(-2λnεNB) . The expected maximum HD is calculated as in the synchronous infection model, Eq. (8) and (9), except we now insert Pk(a)=P(HD0=k[mid ]a) from Eq. (11).

In the synchronous model, the intersequence Hamming distance, Eq. (5), follows a binomial distribution, which is also well approximated by a Poisson distribution with equal mean since 2aNB is large, ε is small and λ = 2aNBε is of reasonable size (Casella and Berger, 1990). Hence in both scenarios the HD follows essentially the same probability distribution, but with different divergence and diversity, implying a different rate of evolution due to the asynchrony in infection.

Let N(n) be the total number of newly infected cells at time n. For I0=1 and R0 >> 1 , we show in Material and Methods that N(n)(α2(1+[var phi]))n=(α2(1+[var phi]))[left floor]tτa[right floor], whereas for the same time t, in the synchronous model, N(n)(R0)[left floor]tτs[right floor], where τs denotes the generation time in the synchronous model. Notice that τaτs see Materials and Methods), hence the two scenarios both provide descriptions of an exponentially growing infected cell population but with different exponents.

We also devised a partially randomized model, in which the α, instead of being a fixed parameter, is a random number uniformly distributed with mean R0/2, and then γ is always chosen to be R0-α. This introduces an additional variability the effect of which is inversely proportional to the total population size. Therefore, large fluctuations from the above description will occur within the first 2-3 generation steps, but will be negligible after that, thus recovering the above mean field description.

An important difference between the asynchronous and synchronous models is in the dependence on R0. In both models, divergence and diversity depend linearly on the number of generation steps. In the synchronous model, the slope of the diversity is 2ε does not depend on R0 and it. In the asynchronous model, the slope has the additional factor

λnn=1+[var phi]2[var phi],

with [var phi]=1+4γα2 and γ+α=R0. One sees that in the asynchronous model there is a dependence on R0. Let α=ηR0 for some 0<η≤1, and γ=R0(1-η). Then, one gets the expression of the slope in diversity as


Notice that the when η=1, we retrieve as a special case the synchronous model, and indeed Eq. (15) yields exactly 2ε. However, when η1, the above expression is strictly smaller than 2ε, and approaches 2ε as R0 increases. Figure 4 illustrates the dependence of the slope of the diversity as a function of the number of generation steps on R0.

Figure 4
The slope of the diversity versus time for different values of η and R0

Table 2 summarizes the above formulas for both the synchronous and the asynchronous models. In both models, divergence, diversity, and variance increase linearly as a function of time. Diversity and variance are expressed by the same mathematical formula, which is a consequence of the fact that the intersequence HDs follow a Poisson distribution. Divergence, diversity, and HDI variance do not depend on the number of bases per sequence. In contrast, the maximum HD and the % sequence identity depend on the sequence length NB. In the synchronous infection model, all the quantities are independent of the basic reproductive ratio, while in the asynchronous model, they do depend on the R0, as discussed above. Quantitatively, however, the asynchronous infection model provides minor corrections compared to the synchronous infection model.

Table 2
Analytical formulas for the basic quantities in the synchronous and asynchronous infection models. The expression for λn is given by Eq. (14)

Monte Carlo simulation of the asynchronous infection model

We keep the basic framework used for the synchronous MC simulation, and choose parameters R0=6, ε=2.16×10-5, and τa=1.5 days (see Materials and Methods). Furthermore, we simulate the situation where I0=1 and each infected cells infects R0/2 cells at τa, and the remaining R0/2 at time 2τa. We also devised a second scenario under which each infected cells infects α cells at time τa, where α is drawn from a random uniform distribution with mean R0/2, and R0-α at time 2τa. We confirmed that both types of simulations, after generation step 4, are in excellent agreement, and also confirmed our analytical results described in the previous section. Furthermore, both % divergence and % diversity grow linearly with time (in units of τa) at rates of 1.79 × 10-5 and 3.57 × 10-5, respectively (i.e., 1.2 ×10-5 and 2.4 ×10-5 per day, respectively).

At times τa, 2τa, 3τa, and so on, we sample NS sequences, construct the HD distribution and compute the divergence, diversity, HDI variance, % sequence identity and the expected maximum HD. As for the synchronous model, we find that the MC results agree well with the analytical results given in Table 2 (Figure 5).

Figure 5
Asynchronous infection model

We examined the dependency of our results on the values chosen for the parameters τ, R0, I0 and ε (Figure 6). In agreement with our calculations (Table 2), the diversity increases linearly with time (not shown) at a rate proportional to ε , inversely proportional to τa (Figures 6A and 6B), dependent on R0 (Figure 6C) and independent of I0 (not shown). The mutation rate controls the rate of increase of the divergence and the diversity. The larger the mutation rate, the faster the genomes mutate, hence the steeper the growth of the diversity (Figure 6A). The greater the generation time, the slower the genomes diversify, hence the smaller the growth in diversity (Figure 6B). The computed diversity depends on R0 (Figure 6C). At low values there is a strong dependence, e.g., from R0=2 to R0=6 there is 15.9% increase in the slope of the diversity. On the other hand, for larger values of R0, the dependence is much weaker (Figure 6C).

Figure 6
Parameter dependence (Asynchronous infection model)

Sample Size Dependence

When patients are sampled, only a limited number of sequences are obtained. The Monte Carlo simulations provide a means of assessing the variability introduced by limited sequencing and can be used to choose the number of samples to be sequenced based on the trade-offs between cost and accuracy. Figure 7 shows the variation in the HD frequency distribution as the sample size NS varies. Clearly, the larger the sample size, the smaller the variation.

Figure 7
Sample size dependence of the Poisson fit

We also asked what is the probability of erroneously classifying an infection as homogeneous due to poor sampling. In other words, what is the chance of missing a second lineage that represents a small percentage of the entire population when NS sequences are sampled. We assumed a range of hypothetical low population prevalence, from 0.1% to 20%, and computed the probability that, out of a sample size NS, all sequences come from the dominant lineage. This is given by a binomial probability of getting zero “successes” out of NS trials, where a “success” is defined as a sequence belonging to the minor lineage, and the probability p of success is given by its population prevalence


Thus, for example, for NS ≥30, we can be 95% confident that any missed variant would comprise less than 10% of the total viral population (see Fig. pS9 in (Keele et al., 2008)). The number of sequences sampled from our 8 patients ranged from 16 to 50. When NS =16, we are 95% certain that we have sampled all lineages that represent at least 20% of the population. On the other hand, when NS =50, we are 95% confident that any missed lineage would have to represent 5% or less of the total population.

Analysis of Sequence Data from 8 Acute Patients

Plasma samples were obtained from 8 subjects with acute or very recent HIV-1 subtype B infection. Laboratory testing showed that 7 subjects were in Fiebig stage II, both viral RNA and p24 antigen were detected, and one subject was in Fiebig stage III, HIV-1 IgM in addition to viral RNA and p24 was detected (see Table 1). Detailed information of each subject is summarized in Table 3. Furthermore, samples were analyzed using the Hypermut tool (LANL) to test for overall enrichment of G-to-A hypermutation patterns with APOBEC3G/F signatures (Bourara et al., 2007; Harris and Liddament, 2004; Simon et al., 2005). Because these mutational patterns occur at a faster rate than ordinary mutations, they violate one of our model assumptions (constant mutation rate across sites and lineages). Only one sample, SUMA, presented mutations (three) that were found to carry the APOBEC3G/F signature. When all mutations were kept, the sample did not fit the Poisson model (χ2 goodness of fit p-value = 0.04, where a p<0.05 represents significant departure from the Poisson distribution), whereas a good fit was restored upon removing the APOBEC signatures (χ2 goodness of fit p-value = 0.243) (Keele et al., 2008).

Table 3
Patient characteristics at time of sampling, number of envelope sequences obtained and their length, and GenBank accession numbers

For each of the 8 samples we computed the HD0 and HDI frequencies, as well as the divergence, diversity, HDI variance, % sequence identity and maximum HDI. All quantities are summarized in Table 4. In order to compute the HD0 distribution, we used the LANL tool Consensus Maker (LANL) to compute a consensus sequence for each of the homogeneous samples, and then assumed that the consensus sequence was the founder sequence s0. We then proceeded to analyze the data in two steps: first, we used our models to infer what the expected maximum HD would be under a homogeneous infection, and used this to distinguish homogeneous infections from heterogeneous ones. Then, we applied our models to the homogeneous samples in order to infer a plausible time since the infection.

Table 4
Analysis of 8 HIV-1 acute sequence samples using the asynchronous infection model. For each subject, we calculated divergence, diversity and maximum intersequence HD. The parameter λ of the best fitting Poisson distribution is given for the 6 ...

The eight early infection samples presented two distinct behaviors: either overall low diversity (max HD below 0.5%) and unimodal intersequence HD distribution, or overall high diversity (max HD above 2.5%; patients 1051 and BORI) and multimodal intersequence HD distribution (see Table 4 and Figure 8). We hypothesized that the underlying difference between the two patterns was whether a single HIV-1 strain entered the host or if multiple distinct strains entered. In order to exclude single strain infection in the high diversity samples, we computed the maximum HD one could reasonably expect to achieve at a given point in the infection and compared this to predictions based on the Fiebig stage. To do this, we used the maximum HD probability distribution computed in the asynchronous model, Eq. (22) and (23), with P taken from equation (13), using the patient specific base length NB and sample size NS given in Table 3. To be conservative, we then computed the combined weight in the upper 2.5% tail that would be achieved at each generation step a. The maximum HD at generation step a is less than or equal to this value 97.5% of the time. For patient 1051 (Fiebig stage III), with probability 0.975 a minimum of 241 generation steps would be needed to achieve the observed maximum HD of 71 if the infection were homogeneous. This is not compatible with Fiebig stage III, which corresponds to a maximum of 37 days since infection (Table 1). Similarly, patient BORI (Fiebig stage II) yielded 261 generation steps for a maximum HD of 77, again incompatible with the Fiebig stage. We thus conclude that the high diversity samples were from multiple strain infections, whereas the low diversity samples were compatible with a single strain infection.

Figure 8
Poisson fit of intersequence Hamming distance distribution

An alternative way for distinguishing between homogeneous and heterogeneous infections can be obtained by looking at both the mean diversity and the variance of each sample. According to our analytical results, when the infection is truly homogeneous, the measured average diversity and its variance should be approximately equal, which is one of the basic properties of a Poisson distribution. However, since measured mean HD and variance are affected by sample size and stochastic effects in early evolution, they may differ. Thus, for a homogeneous infection we can require that the mean (diversity) and variance be located between the upper and lower 95% confidence limits computed from MC simulations based on the number of sequences sampled per patient (Figure 9). All of the six low diversity subjects (WEAU, WITO, TRJO, SUMA, 1054, 1056) satisfy this condition, while subjects 1051 and BORI violate it. Thus we classify subjects 1051 and BORI as ”heterogeneous” infections (i.e., infections initiated by two or more founder strains) and the other 6 subjects as “homogeneous” infections. The limitation of this classification is the assumption that the sequence population diversifies without any selection, e.g. under neutral evolution. This assumption does not hold for Fiebig stage III or higher, and this classification becomes problematic. The area outside the cone in Figure 9 indicates deviations from a Poisson distribution. In early Fiebig stages this suggests a heterogeneous infection, whereas in later stages it indicates either immune selection or a heterogeneous infection or both. Also due to purifying selection, samples from a heterogeneous infection could be located inside the cone in later Fiebig stages. However, to be classified as homogenous we also require that the average % diversity in these samples be less than that expected based on the upper limit of the cumulative duration for the Fiebig stage, i.e., day 34 for Fiebig stage II (Figure 9, vertical dashed line)

Figure 9
Highlighter plots and formal classification diagram

The above classification was also confirmed by visual inspection through the LANL tool Highlighter (LANL). The visualization tool represents each sequence in the sample with a horizontal line, and places a mark whenever the sequence presents a mutation from the consensus. Figure 9 shows two exemplary behaviors found in acute infection samples: the homogeneous sample SUMA has few randomly distributed mutations, whereas the heterogeneous sample 1051 presents a majority of sequences that are similar to the consensus and a second group of sequences that not only have many more mutations relative to the consensus, but these mutations are shared, indicative of a second lineage.

Examining Neutral Evolution: Star-like Phylogeny

During rapid exponential growth in the absence of selection, small samples of sequences are likely to have evolved from the founder sequence following a star-like phylogeny, i.e., they all coalesce at the founder (Wakeley, 2008). When this is the case, the intersequence HD frequency distribution, HDI, coincides with the self-convolution of the HD0 frequency distribution. To test whether this is the case and hence if the observed samples evolved following a star-like phylogeny, we compared the observed HDI frequencies from those computed assuming a star-like phylogeny. Given HD0 frequencies X0, X1, ... Xm, where Xi is the number of sequences with HD0=i, we constructed the theoretical HDI frequencies by computing the HD self-convolution frequencies Y0,Y1,...,Yn, where n=2m, as follows


The above formula is derived from the arithmetic convolution of the frequencies Xk with themselves. For example, Y0=X02, Y1 = X0X1, Y2=X0X2+X12, and so on.

Figure 10 compares the theoretical frequencies Y0,Y1,...,Yn (red lines), with the observed ones (the histograms). For 4 out of the 6 low diversity samples there was a perfect agreement between the calculated frequencies and the actual intersequence HD frequency. For the other two (WEAU and TRJO) a 5% and 15% difference was found, respectively. These very low differences, if any, in observed and computed frequencies suggest that all 6 samples evolved following an approximate star-like phylogeny.

Figure 10
Comparison between observed intersequence HD frequencies and the theoretical frequencies calculated from Eq. (19)

Estimating Time Since the MRCA

We applied the synchronous and asynchronous models to characterize early homogeneous infections and estimate the time since the MRCA. We did this using both the theoretical Poisson model and the output from the MC simulations.

In a homogeneous infection we expect the intersequence HD to follow a Poisson distribution, as in Eq. (13). We used a maximum likelihood method to fit a Poisson distribution to the observed HDI frequencies and estimate the Poisson parameter λ, which is proportional to the time or the number of generations since the most recent common ancestor. Given the vector of observed frequencies Y = (Y0,..., Yn), where Yi is the number of sequence pairs with HD=i, the log likelihood function is defined as


By minimizing the log likelihood function, we obtain


the mean of the intersequence Hamming distance distribution.

We then used λ to estimate the number of generation steps, n, since the most recent common ancestor (MRCA), using the following:

n={λ2εNBsynch.model[var phi]1+[var phi](λεNB1[var phi][var phi]2)asynch.model}

where n=a in the synchronous model, and n=[left floor]tτa[right floor] in the asynchronous model. Then t, the time since the MRCA, is given by t=s for the synchronous model, and t=a for the asynchronous model. As illustrated in Materials and Methods, we chose values of 2 and 1.5 days for τs and τa, respectively.

We assessed the goodness of the fit using a χ2 goodness-of-fit test statistic calculated from a singular value decomposition of the covariance matrix (see Materials and Methods). In a goodness of fit test the null hypothesis is that the two distributions being tested are statistically the same, hence a low p-value rejects the null hypothesis. All 6 low-diversity samples yielded a goodness-of-fit p-value of 0.1 or higher, suggesting that the observed distribution is consistent with a Poisson.

We fixed the values of base length NB and sample size NS appropriate for each patient and then, under either the synchronous or the asynchronous model, we generated the HD intersequence distribution for generation steps n1 through n2. For example, a typical output could be generated by choosing NB=2,600, NS=30 and generation steps 1 through 100. At each generation step, NS sequences were randomly sampled from it and the intersequence HD distribution of the sample was calculated. From the set of n2-n1+1 intersequence HD distributions generated, we used a Kolmogorov-Smirnov statistic (Casella and Berger, 1990) to pick the HD distribution that best fit the HD distribution computed from the observed sequence data. We repeated this 103 times. From n*, the average of generation step of the 103 best-fitting distributions, we estimate the time since the MRCA, as done above for the Poisson model.

For all 6 low-diversity samples the time of infection could be estimated based on patients’ Fiebig stage. Table 4 summarizes the estimates both from fitting the Poisson distribution and the MC output using the asynchronous model. The computed estimated time since infection estimated by each patient’s Fiebig stage were within the 95% CIs predicted by the MC simulations. In contrast, the Poisson fit yielded slightly lower estimates (with respect to the given Fiebig stage) for subjects SUMA and 1056. All Poisson estimates were systematically lower than the MC ones. This is due to the fact that run-to-run variation, while accounted for in the MC fitting, is neglected when fitting the Poisson distribution. To understand this, consider for example a base length of NB=2600. From Eq. (12), under the asynchronous model, the average Hamming distance from the founder strain, after one generation step will be 0.05. In this time, only 3 infected cells are produced, and, on average there will correspondingly be 0.15 mutations. However, in any given run, we can only have a discrete number of mutations. In particular, the first three cells will be identical 85% of the time (hence a mean HD of 0), 14% of the time there will be exactly one mutation across all three sequences (hence a mean HD of 0.333), and so on. Thus, though averaging over enough runs will lead to the correct mean value, most of the runs (85%) will have less than the expected number of mutations (after the first reverse transcription), and the rest will have more. This initial difference will persist through the generations and appear as an offset in a plot of mean HD versus generations (not shown). Averaging over 1,000 MC runs one finds that a sample of sequences with mean HD from the founder of 0.8 coalesce at a MRCA that is always 17 generation steps back (or 25.5 days) when fitting the Poisson distribution, but in the MC method 85% of the time the inferred number of generation steps will be higher (19 or 20 instead of 17) leading to a mean of 18.2 generation steps (or 27.3 days) Thus explaining the roughly 2 day discrepancy in the time to MRCA between these methods as shown in Table 4. Because the variation in the MC is equivalent to host-to-host variation, the MC 95% CIs are more likely to capture the true time estimates than the Poisson fits.


Early HIV-1 infection tends to be characterized by a viral population with limited sequence diversity in many but not all individuals. A recent study examined sequence data obtained from 102 individuals (Keele et al., 2008), and using the methods developed in this paper characterized the infections as being consistent with a single transmitted strain in 78 cases and multiple strains in 21 cases with 3 being borderline. In order to do this classification the authors used the model of early HIV-1 infection developed in this paper. Even though the biological conclusion that some people are infected by multiple strains and others by one strain only has already been published (Delwart et al., 2001; Derdeyn et al., 2004; Long et al., 2002; Sagar et al., 2004; Zhang et al., 1993), our goal here was to provide, in a self-contained manner, the mathematical underpinnings for the analysis of sequences obtained by single genome amplification methods.

Our goals were not novel in the sense that coalescent and Bayesian inference methods (Drummond et al., 2005; Drummond et al., 2006; Kuhner and Smith, 2007; Kuhner et al., 1998; Rannala, 1997), when restricted to homogeneous infections, can provide the same results. In fact, comparing the estimated time to the MRCA using BEAST (Drummond and Rambaut, 2007) and using our method on the 53 patients that had a homogeneous infection and no overt enrichment for APOBEC3 mutations from Keele et al. (Keele et al., 2008), we found the results to be similar (correlation coefficient = 0.973, best fitting slope = 1.036, see Figure 11). However, the main difference between these methods and ours is that whereas the former simulate genealogies, under similar assumptions, our simulation models follow the entire population. While for most of the problems studied in the literature simulating genealogies is the correct approach, what one gives up in that approach is simplicity when other processes like recombination and selection need to be modeled. A forward simulation like ours, on the other hand, is usually not feasible because of the problem size, for example for studying HIV-1 at the population level, but is readily applied to the important problem of modeling the evolutionary events in early infection and characterization of viral transmission. Because of the small number of generations since the extreme bottleneck, we can follow the entire population in a simulation as well as provide analytical calculations for various quantities summarizing viral evolution. In this paper, we set up a basic framework in which we neglect recombination and selection, but it is easy to see that with our methods these pose no problems of principle, whereas the analysis involving ancestral recombination graphs (Minichiello and Durbin, 2006), or ancestral selection graphs (Sharma, 1977) or both combined would be prohibitive for this data.

Figure 11
Comparison between BEAST results and Poisson model results

For cases of a single strain infection, our model suggests that the consensus sequence, which corresponds to our best estimate of the MRCA given the Poisson model described here, is either the transmitted strain or a strain one to two mutations away from the transmitted strain (Keele et al., 2008). Identifying the transmitted/founder strain has great biological significance as one can then determine if it has particular properties that allowed it to be transmitted. Even if our identification is only accurate to within one or two base accuracy this will still provide an enormous advance as it is the virus that selectively expands, and we can use this information to search for particular attributes of the virus that allowed it to expand. For example, preliminary examination of putative transmitted viruses suggests that they may generally be more uniformly refractive to neutralization than viruses obtained at the chronic stage of infection, which show heterogeneity in susceptibility to antibodies directed at their receptor binding surfaces (Keele et al., 2008). Also there are subtype specific attributes in terms of variable loop lengths and the number of glycosylation sites (the loops appear to be shorter and less glycosylated in the transmitted viruses of A and C clade infections (Chohan et al., 2005; Derdeyn et al., 2004; Sagar et al., 2006), but not B clade transmitted viruses (Frost et al., 2005)). Data is just beginning to accrue that will enable further exploration of potential viral transmission signatures and phenotypic traits, and the models described in this paper allow one to easily determine whether or not the MRCA or consensus sequence of an early sample is indeed a reasonable estimate of the transmitted virus.

Another important issue that our model addressed is the number of viruses from a single patient that one must sequence to determine if the infection is homogeneous within some confidence band and the number of sequences needed to obtain a reasonable estimate of the time to the MRCA. In order to achieve this, we computed pairwise Hamming distance frequencies of each given sample. Other types of statistical analyses that involve pairwise Hamming distances have been devised elsewhere (Gilbert et al., 2005), however the focus there was to simply compare two different populations, not to infer time estimates. Furthermore, we showed that under the homogeneous infection hypothesis, the HD frequencies follow a star-phylogeny and Poisson distribution. This was also shown in (Slatkin and Hudson, 1991), where the Poisson was used to calculate a population growth rate. Here we used it to test the homogeneity of the sample, and then to get time estimates since the MRCA. The probability of misclassifying an infection as homogeneous is easily calculable from our model. For example, we showed that with a sample of 30 sequences the probability of mistakenly classifying an infection as homogeneous due to not detecting a second minor lineage (less than 10% of the population) is at most 5%. However, now with deep sequencing methods, such as 454 technology (Margulies et al., 2005), 100,000 or more sequences of limited length can be obtained. Such technology may indeed show examples of what we have called homogeneous infections that are not truly homogenous, and could enable us to discern if minor lineages exist at very low frequencies.

The model assumes that the viral population grows exponentially with no selection pressure and no recombination. These are reasonable assumptions to make early in the infection, before the host’s immune response begins and while the infected cell population is still much smaller than the population of target cells. Further, early in infection derived from a single strain viral diversity is low and recombination may have only a small effect — as recombination between identical sequences leaves the sequences unchanged. Later in the infection, the exponential growth phase stops and selection pressure from the host environment gets established, both of which break the star-phylogeny evolution and hence our model is no longer valid.

In our mathematical models we neglect the occurrence of a mutation occurring more than once at the same site since it occurs with probability O(a2ε2NB). For our values of ε = 2.16×10-5 and NB ~ 2,600, this probability remains below 0.006 throughout the first 100 replication cycles. After that, the model assumptions may fail because of immune pressure.

We assumed that the viral diversity evolves under a star-like phylogeny, and that all genomes coalesce at the founder strain. We used this assumption to construct the intersequence HD distribution and compute sample diversity. The probability of not coalescing at the founder strain is O(R0m), where R0 is the basic reproductive ratio and m is the number of generations away from the founder strain. So, for example, if we sample after 10 generation steps, the probability of any two randomly chosen sequences to coalesce one generation step back is ~ R0-9. However, the probability of coalescing 1 or 2 generations earlier than the actual founder strain is not negligible. Even though we assumed that any two sequences coalesce at the founder strain, there is in fact less than a 0.5% chance of being 3 generations away from the actual founder, a 3% chance of being 2 generations away, and a 17% chance of being 1 generation away. For a set of 20 to 50 sequences, the probability of at least one pair coalescing after the actual founder is virtually certain, but the exponential growth of the infected cell population will severely limit the violation of star phylogeny.

It has been estimated that a free virus has a half-life of less than an hour (Ramratnam et al., 1999), whereas an infected cell has an average lifespan of ~2 days (Markowitz et al., 2003; Perelson et al., 1996). Therefore, the virus present in plasma in a given sample is a reflection of the viral genomes contained in HIV-1 producing cells. Thus, as a simplification, we chose in our models to follow the infected cell population and not the virus. At each generation step, we assumed that an infected cell successfully infects R0 other cells. In effect, we are following the integrated HIV-1 provirus rather than viral particles. The experimental data we analyzed was obtained by sequencing virus and not provirus. On average, each cell produces as many as 5 × 104 virions (Chen et al., 2007), which makes the viral population at any given time point much larger than the infected cell population. Sampling with replacement of the provirus pool is therefore a good approximation, introducing a fractional error on the estimated probabilities of O(10-4 NS), where NS is the number of sequences sampled. Because we never sample over 60 sequences, this is a negligible correction.

Finally, we assumed the mutation rate is constant across different lineages and along the entire genome. This assumption is violated if APOBEC3G/F causes G-to-A mutation at an enhanced rate even in sequences that are not overtly hypermutated. Keele et al. (Keele et al., 2008) showed that overall APOBEC enriched samples follow our evolutionary model upon removing the G-to-A mutations with APOBEC3G/F signature. For the point mutation rate, we used the value 2.16×10-5 per base per replication cycle. (Mansky and Temin, 1995) estimated a mutation rate of 3.14×10-5 per base per replication cycle, but their estimate counted all types of mutations, including gaps. Because we exclude gaps throughout our analyses, we removed their contribution from the Mansky calculation and obtained the value 2.16×10-5. The value from Mansky et al. was obtained in vitro from HeLa cells and reproduced in two different cell lines (Mansky, 1996; Mansky and Temin, 1995). More recent in vitro studies found a slightly smaller point mutation rate of 2.2 × 10-5 (including deletions and insertions) in one case (Huang and Wooley, 2005), and a larger rate of 5.4 × 10-5 in another (Gao et al., 2004). The use of a base substitution rate different from 2.16×10-5 would result in larger estimates for the number of days to the MRCA than obtained in Table 4, if the rate were smaller, and shorter if it were larger.

Within both the analytical and the MC model descriptions, we have envisioned two scenarios, one in which all infection events are synchronous, and one in which they are asynchronous and occur at two distinct times. Again, the advantage of the former is its simplicity. It yields a straightforward mathematical description, a population that grows exponentially in time with base R0. However, it is biologically unrealistic that each cell would infect all other cells in at a single time. To explore how the synchronicity affects the model, we broke this assumption in the simplest possible way, e.g. by allowing infections to happen at two times. Even though this second scenario is still biologically unrealistic, nonetheless, it shows that the intersequence HD distributions do not change nature: they are still essentially Poisson distributions, and the increase in diversity is again driven by the number of RT cycles that have occurred. However, whereas in the synchronous model the increase in diversity depends on the base substitution rate ε and the generation time expressed in days, in the asynchronous model it also depends on R0. This poses an additional difficulty in estimating time since MRCA, as studies have shown the wide variability in estimates of R0 across patients (Little et al., 1999; Stafford et al., 2000).

For the six patients that we classified as having been infected by a single viral strain, we estimated time since the MRCA by computing the divergence (i.e. the mean Hamming distance per base pair) from the founder strain. Because we do not know the actual founder strain, we constructed the sample consensus sequence and assumed it to be identical to the MRCA. In doing this, we are implicitly assuming no bottleneck after the virus has entered the host. Should a sublineage prevail over the rest of the population after entering the host, this would cause the consensus to represent the dominant sublineage rather than the actual infecting strain; an estimate to that MRCA that is statistically less than the minimal time from infection based on Fiebig stage would point to such a scenario, and suggest selection. All these caveats factor into our time since MRCA estimates, and would similarly impact the use coalescent methods in such estimates.

We have presented a simple description of an early homogeneous infection. This restricts the validity of our model to early infections, before selection pressure can be detected and while the infected cell population is still small with respect to the target cell population. Furthermore, in some cases there may be evidence of purifying selection even early in the infection, which could explain why for some of the samples we get too early time estimates, suggesting a bottleneck in viral evolution posterior to infection. Using the SNAP program (LANL), we looked at the dS/dN ratios of all 6 low diversity patients. Four out 6 (1054, SUMA, TRJO, WEAU) had ratios greater than 1, suggesting purifying selection (to be published elsewhere). Future developments of our model could allow correction for such a scenario, for example by using a modified base substitution rate.

Despite its limitations, our model successfully distinguishes homogenous from heterogeneous infections, and predicts the time evolution of divergence, diversity, maximum diversity and % sequence identity in single-strain HIV-1-infections.

Materials and Methods

Sequence Data Analysis

Plasma samples were obtained from 8 subjects with acute or very recent HIV-1 subtype B infection. All subjects gave informed consent, and plasma collections were performed with institutional review board and other regulatory approvals. Blood specimens were generally collected in acid citrate dextrose and plasma separated and stored at -20 to -70°C. To determine how the far into the acute phase the of infection the samples were taken, they were tested for HIV-1 RNA, p24 antigen, quantitative Chiron bDNA 3.0 or Roche Amplicor viral RNA assays; Coulter or Roche p24 Ag assays; Genetic Systems Anti-HIV-1/2 Plus O and Abbott AntiHIV-1/2 3rd Generation EIAs; and Genetic Systems HIV-1 Western Blot Kit. Based on these test results, subjects were staged according to the Fiebig laboratory classification system for acute and early HIV-1 infection (see Table 1).

We employed single genome amplification (SGA) of plasma HIV-1 RNA followed by direct sequence analysis of uncloned env amplicons (Keele et al., 2008) as described in Salazar et al. (Salazar-Gonzalez et al., 2008). The number of env sequences, ranging from 16 to 50, analyzed per subject, their GenBank accession numbers are given in Keele et al. (Keele et al., 2008), and the subjects’ Fiebig stages are shown in Table 1. Sequences were aligned with GeneCutter (LANL) and the alignment was hand-checked. We removed insertions and deletions using the “Gap Strip Tool” (, which removes all columns that contain one or more gaps from a nucleotide alignment.

Mathematical Derivations

Number of mutations

We show that P(mutations>NB[mid ]a)O(a2ε2NB)<<1. Chebyshev’s inequality (Feller, 1957) states that, given a random variable X with mean μ and variance σ2, for any positive integer k one has

Prob([mid ]Xμ[mid ]kσ)1k2.

Let X be the number of mutations. Then μ = aεNB and σ2 = aεNB (1-ε). We pick k such that + μ = NB + 1, from which it follows that:



To see that the probability of coalescence before generation 0 approaches zero exponentially as a and m get large, i.e. the expression in Eq. (6) is O(R0m), recall that




The Expected Maximum HDI

Given a sample of NS sequences drawn from the viral population at generation a derived from a single strain infection, and a positive integer M, we examine how the NS HD0 values should be distributed so that the maximum intersequence Hamming distance is exactly M. Notice that HDI = M if, for any given k0, there is at least one sequence such that HD0 = k and another one such that HD0 = M -k . Now, if k>[left floor]M2[right floor], where [left floor] x [right floor] is the floor functions defined as the maximum integer less than or equal to x, and there is more than one sequence with HD0 = k , then for two such sequences their relative Hamming distance will be HDI = 2k > M , assuming that for a single strain infection the MRCA is s0. Therefore, for the maximum to be exactly M, there can be only one sequence with HD0=k,k>[left floor]M2[right floor], and at least one with HD0 = M -k , and all the others with HD0 ≤ M -k. Using the abbreviated notation Pk(a)=P(HD0=k[mid ]a)=Binom(k;aNB,ε), the probability that the maximum HDI = M is thus Pk(a) multiplied by the probability of having all the other NS-1 sequences with HD0M -k and at least one with HD0 = M -k , yielding:

P(HDmax=M[mid ]a)=Pk(a)[(i=0MkPi(a))NS1(i=0Mk1Pi(a))NS1]

In the special case when M is even and k=M2, we need at least two sequences such that HD0=M2, which is expressed in the following


Putting everything together, we obtain

P(HDmax=M[mid ]a)=k=M+12MNSPk(a)[(i=0MkPi(a))NS1(i=0Mk1Pi(a))NS1]

when M is odd, and

P(HDmax=M[mid ]a)=(i=0M2Pi(a))NS(i=0M21Pi(a))NSNSPM2(a)(i=0M21Pi(a))NS1++k=M2+1MNSPk(a)[(i=0MkPi(a))NS1(i=0Mk1Pi(a))NS1]

when M is even. Here we are assuming that a>0 and that any two sequences coalesce at generation 0. Therefore, the expected maximum HD at generation a is given by

HDmax=M=0NBM×P(HDmax=M[mid ]a).

Asynchronous Model, Eq. (8)

Of all infected cells I(a+1,n+1) at time n+1, α I(a,n) were infected by cells of age a that were “born” at time n, whereas γI(a,n-1) were infected by cells of age a that were born at time n-1. In other words, we have the following relationship


Using the binomial expansion (x+1yx)=(xyx)+(xyx1), one can verify that for [left ceiling]n2[right ceiling]an the expression I(a,n)=I0αa(γα)na(ana) satisfies Eq. (24) as shown below:


Asynchronous Model, Eq. (10)


N(n)=a=[left ceiling]n2[right ceiling]nI(a,n).

By taking the summations on both side of Eq. (24), one obtains that N(n) satisfies N(n+1) = αN(n)+γN(n-1). Let λ1=α2(1+[var phi]), and λ2=α2(1[var phi]), where [var phi]=1+4γα2. Then, for any integers a and b, the function so defined: Na,b(n)=aλ1n+bλ2n is such that for every n it satisfies the identity Na,b(n+1) = αNa,b(n)+γNa,b(n-1). In fact, because λ1 and λ2 are the roots of the second degree equation x2 = αx + γ, it follows that for i=1,2 we have λi2=αλi+γ. In particular, for every n > 1, λin+1=αλin+γλin1, and as a consequence


Therefore, we need to find a and b such that N(0)=I0 and N(1)=αI0, which are the boundary conditions. This yields a linear system with solutions: a=I01+[var phi]2[var phi] and b=I01[var phi]2[var phi]. Hence, substituting these values into the expression of Na,b(n), we get N(n)=I0(α2)n(1+[var phi])n+1(1[var phi])n+12[var phi]. As a side result, notice that if we let Sn=a=[left ceiling]n2[right ceiling]n(ana)(γ)na then it follows that Sn=(1+[var phi])n+1(1[var phi])n+12n+1[var phi]. Finally, notice that for large values of n=(1+[var phi])n+1(1[var phi])n+12[var phi](1+[var phi])n, and therefore N(n)I0(α2(1+[var phi]))n.

Asynchronous Model, Eq. (13)

In order to show that the HD0 probability distribution is such that P(d,n) ≈ Pois(d,λ) + O(ε2) for some suitable parameter λ, we show that up to terms in ε2, all cumulants of this distribution are equal, which is a property that uniquely characterizes a Poisson distribution. The cumulant generating function is given by

KHD(n)(ξ)=LogMHD(n)(ξ)=εNBFnj1ξjj!pj(ε)a=[left ceiling]n2[right ceiling]na(ana)(γα2)na,

where MHD(n)(ξ) is the moment generating function

MHD(n)(ξ)=1Fna=[left ceiling]n2[right ceiling]n(ana)(aNBd)(γα2)na[εeξ+(1ε)]aNB.

In Eq. (26), pj(ε) is a polynomial in ε of degree j-1 such that pj(0) = 1 for all j’s. Let X(n)=a=[left ceiling]n2[right ceiling]na(ana)(γα2)na. One can show that for sufficiently large n, the following holds X(n)Fnn1+[var phi]2[var phi]+1[var phi]2[var phi]2. This can be shown by first noticing that Fn+1Fn1+[var phi]2+O(en) because of the identity

FnFn1=12(1+[var phi])n+1(1[var phi])n+1(1+[var phi])n(1[var phi])n=121+[var phi](1[var phi])(1[var phi]1+[var phi])n1(1[var phi]1+[var phi])n

And since θ=1[var phi]1+[var phi]<1, it follows that FnFn11+[var phi]2+θn(1+[var phi]2)+O(θ2n). Given X(n)=a=[left ceiling]n2[right ceiling]na(ana)(γα2)na, for n >> 1, we have X(n)Fnn1+[var phi]2[var phi]+1[var phi]2[var phi]2. In fact, let x=γα2=[var phi]214. Then

Fn+([var phi]214)(n[var phi]+1[var phi]2)Fn1=[partial differential][partial differential]x[xFn]=[partial differential][partial differential]x[Σ(ana)xna+1]=(n+1)FnΣa(ana)xna

As a consequence, X(n)=nFnn[var phi]214[var phi]Fn1[var phi]214[var phi]2Fn1. Dividing both side by Fn completes the proof.

Given this notation, we can write KHD(n)(ξ)=εNB(j1ξjj!pj(ε))X(n)Fn. As a consequence, for large n, we can approximate the m-th cumulant as

κm(n)εNBpm(ε)(n1+[var phi]2[var phi]+1[var phi]2[var phi]2)(n1+[var phi]2[var phi]+1[var phi]2[var phi]2)εNB+O(ε2NB),

and up to terms in O(ε2NB), all cumulants are equal (the m-th cumulant does not depend on m but only on n), which proves that the distribution is approximately Poisson.

Generation times τs and τa

The times at which cells are infected differ in the synchronous and asynchronous models. However, the growth rate of the infected cell population in the two models should both be chosen equal to the best estimate of the actual growth rate. Setting these growth rates equal will lead to a relationship between τs and τa. Current estimates for the generation time are roughly two days (Markowitz et al., 2003). Thus, lets assume τs=2 days. In order to derive the duration of τa, notice that LogR0 (N(n)) is linear in n for sufficiently large n, where N(n) is the quantity defined in Eq. (25). In particular, this quantity grows linearly with a rate per τa given by

ρa=LogR0N(n+1)N(n)LogR0(α1+[var phi]2).

In the synchronous model, at any time t=a, the total number of newly infected cells is N(n)=I0R0n. Therefore, the synchronous growth rate ρs in the logarithmic scale is 1, i.e.,


If the mean generation time is 2 days, it follows that the average growth rate is 0.5/day. Comparing the rate of increase per day between the two models, we get


where τs=2 days. Hence τa=2ρa2LogR0(α1+[var phi]2). In particular, when α=γ=R02, τa=1.5 days and the total number of newly infected cells at time t=a is given by

N(n)=I0(R04)n(1+[var phi])n+1(1[var phi])n+12[var phi],

where [var phi]=1+8R0. We could also assume given the data that the average generation time in an asynchronous model is 2 days and then use the procedure given above to deduce a corresponding τs. In using a τa of 1.5 days half the infected cells are implicitly assumed to live for 3 days. While the average measured lifetime is 2 days the variance around the mean is not known and hence it is unclear if this value of τa is realistic.

Chi square test for dependent data cells



where Y is the vector of intersequence HD frequencies of length n=max(HD), E denotes the expectation operator E(Y)=(E(Y0),...E(Yn)), and Σ-1 is calculated by inverting the covariance matrix Σi,j = σi,j = E(YiYj)-E(Yi)E(Yj) on its non-singular space using Singular Value Decomposition. This is motivated by the non-independence of our observations, since they are drawn from a convolution of two identical Poisson distributions with mean λ. In particular E(Yk)=c22k1e2λλkk!, where c is a normalization constant and k=0, ... ,n. Since the convolution operation imposes [left floor]n2[right floor]+1 linear constraints on Σ’s columns, the degrees of freedom of the χ2 statistic are df[left floor]n2[right floor]1.

Monte Carlo Simulations

The number of bases that undergo mutation is determined by the mutation rate ε. However, once a base is chosen to be mutated a matrix of base transitions is then used to determine which base it is mutated into. This same method is used in both the synchronous and asynchronous models. The base frequencies and substitution matrix employed in the MC simulations were obtained from a general time reversible (GTR) evolutionary model (Yang, 1994) optimized in conjunction with a maximum likelihood phylogenetic tree. Relative base substitution frequencies are listed in Table 5; these were based on the GTR probability matrix (Korber et al., 2000) and a maximum likelihood estimate of rate variation per site, and indicate the probability of change of one nucleotide to another per unit time, assuming a branch length of 1 (there is extensive change with a branch length of 1). In the MC simulation the equilibrium base frequencies for this model were A=0.373, G=0.241, C=0.193, and T=0.192. The transition matrix was computed from Table 5 by setting the diagonal elements to zero and renormalizing each row.

Table 5
Transition Matrix. Rows are “from”, columns are “to”

The maximum likelihood tree used to derive this matrix in Table 5 included 445 HIV-1 clade B Envelope sequences from the Los Alamos database (LANL) combined with sequences from Keele et al. (Keele et al., 2008) (code described in (Korber et al., 2000)). While this model was used for the simulations presented in this paper, the use of the HIV-1 specific GTR model turned out to have little impact on the simulation results; a neutral model with an even base distribution and equal substitution rates yielded essentially the same results for the simulations (data not shown).


Portions of this work were done under the auspices of the U. S. Department of Energy under contract DE-AC52-06NA25396 and supported in part by the Center for HIV/AIDS Vaccine Immunology (AI67854), the Bill & Melinda Gates Foundation Grand Challenges program (37874), the University of Alabama at Birmingham Center for AIDS Research (AI27767), the University of Rochester Developmental Center for AIDS Research (P30-AI078498) and NIH grants AI083115, AI028433, and RR0655. We thank Marcus Daniels for technical assistance.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


  • Abrahams MR, Anderson JA, Giorgi EE, Seoighe C, Mlisana K, Ping LH, Athreya GS, Treurnicht FK, Keele BF, Wood N, Salazar-Gonzalez JF, Bhattacharya T, Chu H, Hoffman I, Galvin S, Mapanje C, Kazembe P, Thebus R, Fiscus S, Hide W, Cohen MS, Karim SA, Haynes BF, Shaw GM, Hahn BH, Korber BT, Swanstrom R, Williamson C. Quantitating the multiplicity of infection with human immunodeficiency virus type 1 subtype C reveals a non-poisson distribution of transmitted variants. J Virol. 2009;83:3556–67. [PMC free article] [PubMed]
  • Achaz G, Palmer S, Kearney M, Maldarelli F, Mellors JW, Coffin JM, Wakeley J. A robust measure of HIV-1 population turnover within chronically infected individuals. Mol Biol Evol. 2004;21:1902–12. [PubMed]
  • Bourara K, Liegler TJ, Grant RM. Target cell APOBEC3C can induce limited G-to-A mutation in HIV-1. PLoS Pathog. 2007;3:1477–1485. [PMC free article] [PubMed]
  • Casella G, Berger RL. Statistical inference. Brooks/Cole Pub. Co.; Pacific Grove, Calif.: 1990.
  • Chen HY, Di Mascio M, Perelson AS, Ho DD, Zhang L. Determination of virus burst size in vivo using a single-cycle SIV in rhesus macaques. Proc Natl Acad Sci U S A. 2007;104:19079–84. [PubMed]
  • Chohan B, Lang D, Sagar M, Korber B, Lavreys L, Richardson B, Overbaugh J. Selection for human immunodeficiency virus type 1 envelope glycosylation variants with shorter V1-V2 loop sequences occurs during transmission of certain genetic subtypes and may impact viral RNA levels. J. Virol. 2005;79:6528–6531. [PMC free article] [PubMed]
  • Clark SJ, Saag MS, Decker WD, Campbell-Hill S, Roberson JL, Veldkamp PJ, Kappes JC, Hahn BH, Shaw GM. High titers of cytopathic virus in plasma of patients with symptomatic primary HIV-1 infection. N. Engl. J. Med. 1991;324:954–960. [PubMed]
  • Delwart EL, Magierowska M, Royz M, Foley B, Peddada L, Smith R, Heldebrand C, Conrad A, Busch MP. Homogeneous quasispecies in 16 out of 17 individuals during very early HIV-1 primary infection. AIDS. 2001;15:1–7. [PubMed]
  • Derdeyn CA, Decker JM, Bibollet-Ruche F, Mokili JL, Muldoon M, Denham SA, Heil ML, Kasolo F, Musonda R, Hahn BH, Shaw GM, Korber BT, Allen S, Hunter E. Envelope-constrained neutralization-sensitive HIV-1 after heterosexual transmission. Science. 2004;303:2019–2022. [PubMed]
  • Dickover R, Garratty E, Yusim K, Miller C, Korber B, Bryson Y. Role of maternal autologous neutralizing antibody in selective perinatal transmission of human immunodeficiency virus type 1 escape variants. J. Virol. 2006;80:6525–33. [PMC free article] [PubMed]
  • Drummond AJ, Rambaut A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 2007;7:214. [PMC free article] [PubMed]
  • Drummond AJ, Rambaut A, Shapiro B, Pybus OG. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evol. 2005;22:1185–1192. [PubMed]
  • Drummond AJ, Ho SY, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 2006;4:e88. [PMC free article] [PubMed]
  • Edwards CT, Holmes EC, Wilson DJ, Viscidi RP, Abrams EJ, Phillips RE, Drummond AJ. Population genetic estimation of the loss of genetic diversity during horizontal transmission of HIV-1. BMC Evol Biol. 2006;6:28. [PMC free article] [PubMed]
  • Feller W. An introduction to probability theory and its applications. John Wiley and Sons, Inc.; 1957.
  • Fiebig EW, Heldebrant CM, Smith RI, Conrad AJ, Delwart EL, Busch MP. Intermittent low-level viremia in very early primary HIV-1 infection. J Acquir Immune Defic Syndr. 2005;39:133–7. [PubMed]
  • Fiebig EW, Wright DJ, Rawal BD, Garrett PE, Schumacher RT, Peddada L, Heldebrant C, Smith R, Conrad A, Kleinman SH, Busch MP. Dynamics of HIV viremia and antibody seroconversion in plasma donors: implications for diagnosis and staging of primary HIV infection. AIDS. 2003;17:1871–1879. [PubMed]
  • Frost SD, Liu Y, Pond SL, Chappey C, Wrin T, Petropoulos CJ, Little SJ, Richman DD. Characterization of human immunodeficiency virus type 1 (HIV-1) envelope variation and neutralizing antibody responses during transmission of HIV-1 subtype B. J. Virol. 2005;79:6523–7. [PMC free article] [PubMed]
  • Gaines H, von Sydow M, Pehrson PO, Lundbegh P. Clinical picture of primary HIV infection presenting as a glandular-fever-like illness. BMJ. 1988;297:1363–8. [PMC free article] [PubMed]
  • Gao F, Chen Y, Levy DN, Conway JA, Kepler TB, Hui H. Unselected mutations in the human immunodeficiency virus type 1 genome are mostly nonsynonymous and often deleterious. J Virol. 2004;78:2426–33. [PMC free article] [PubMed]
  • Gilbert PB, Rossini AJ, Shankarappa R. Two-sample tests for comparing intra-individual genetic sequence diversity between populations. Biometrics. 2005;61:106–17. [PubMed]
  • Harris RS, Liddament MT. Retroviral restriction by APOBEC proteins. Nat. Rev. Immunol. 2004;4:868–877. [PubMed]
  • Heffernan JM, Wahl LM. Monte Carlo estimates of natural variation in HIV infection. J Theor Biol. 2005;236:137–53. [PubMed]
  • Huang KJ, Wooley DP. A new cell-based assay for measuring the forward mutation rate of HIV-1. J Virol Methods. 2005;124:95–104. [PubMed]
  • Kamina A, Makuch RW, Zhao H. A stochastic modeling of early HIV-1 population dynamics. Math Biosci. 2001;170:187–98. [PubMed]
  • Keele B, Li H, Learn GH, Hraber PT, Giorgi EE, Grayson T, Sun C, Chen Y, Yeh WW, Letvin NL, Mascola JR, Nabel GJ, Hayens BF, Bhattacharya T, Perelson AS, Korber BT, Hahn BH, Shaw GS. Low Dose Rectal Inoculation of Rhesus Macaques by SIVsmE660 or SIVmac251 Recapitulates Human Mucosal Infection by HIV-1. J Exp Med To appear. 2009 [PMC free article] [PubMed]
  • Keele BF, Giorgi EE, Salazar-Gonzalez JF, Decker JM, Pham KT, Salazar MG, Sun C, Grayson T, Wang S, Li H, Wei X, Jiang C, Kirchherr JL, Gao F, Anderson JA, Ping LH, Swanstrom R, Tomaras GD, Blattner WA, Goepfert PA, Kilby JM, Saag MS, Delwart EL, Busch MP, Cohen MS, Montefiori DC, Haynes BF, Gaschen B, Athreya GS, Lee HY, Wood N, Seoighe C, Perelson AS, Bhattacharya T, Korber BT, Hahn BH, Shaw GM. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc Natl Acad Sci U S A. 2008 [PubMed]
  • Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn BH, Wolinsky S, Bhattacharya T. Timing the ancestor of the HIV-1 pandemic strains. Science. 2000;288:1789–96. [PubMed]
  • Kuhner MK, Smith LP. Comparing likelihood and Bayesian coalescent estimation of population parameters. Genetics. 2007;175:155–65. [PubMed]
  • Kuhner MK, Yamato J, Felsenstein J. Maximum likelihood estimation of population growth rates based on the coalescent. Genetics. 1998;149:429–34. [PubMed]
  • LANL Hypermut.
  • LANL Consensus Maker.
  • LANL Highlighter.
  • LANL GeneCutter.
  • LANL Los Alamos Database.
  • Lee HY, Topham DJ, Park SY, Hollenbaugh J, Treanor J, Mosmann TR, Jin X, Ward BM, Miao H, Holden-Wiltse J, Perelson AS, Zand M, Wu H. Simulation and prediction of the adaptive immune response to influenza a virus infection. J Virol. 2009;83:7151–65. [PMC free article] [PubMed]
  • Leigh Brown AJ. Analysis of HIV-1 env gene sequences reveals evidence for a low effective number in the viral population. Proc Natl Acad Sci U S A. 1997;94:1862–5. [PubMed]
  • Lindback S, Karlsson AC, Mittler J, Blaxhult A, Carlsson M, Briheim G, Sonnerborg A, Gaines H, Karolinska Institutet Primary HIV Infection Study Group Viral dynamics in primary HIV-1 infection. AIDS. 2000a;14:2283–2291. [PubMed]
  • Lindback S, Thorstensson R, Karlsson AC, von Sydow M, Flamholc L, Blaxhult A, Sonnerborg A, Biberfeld G, Gaines H, Karolinska Institute Primary HIV Infection Study Group Diagnosis of primary HIV-1 infection and duration of follow-up after HIV exposure. AIDS. 2000b;14:2333–2339. [PubMed]
  • Little SJ, McLean AR, Spina CA, Richman DD, Havlir DV. Viral dynamics of acute HIV-1 infection. J. Exp. Med. 1999;190:841–850. [PMC free article] [PubMed]
  • Long EM, Rainwater SM, Lavreys L, Mandaliya K, Overbaugh J. HIV type 1 variants transmitted to women in Kenya require the CCR5 coreceptor for entry, regardless of the genetic complexity of the infecting virus. AIDS Res. Hum. Retroviruses. 2002;18:567–576. [PubMed]
  • Mansky LM. Forward mutation rate of human immunodeficiency virus type 1 in a T lymphoid cell line. AIDS Res Hum Retroviruses. 1996;12:307–14. [PubMed]
  • Mansky LM, Temin HM. Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase. J. Virol. 1995;69:5087–5094. [PMC free article] [PubMed]
  • Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–80. [PMC free article] [PubMed]
  • Markowitz M, Louie M, Hurley A, Sun E, Di Mascio M, Perelson AS, Ho DD. A novel antiviral intervention results in more accurate assessment of human immunodeficiency virus type 1 replication dynamics and T-cell decay in vivo. J. Virol. 2003;77:5037–5038. [PMC free article] [PubMed]
  • Minichiello MJ, Durbin R. Mapping trait loci by use of inferred ancestral recombination graphs. Am J Hum Genet. 2006;79:910–22. [PubMed]
  • Nowak MA, Lloyd AL, Vasquez GM, Wiltrout TA, Wahl LM, Bischofberger N, Williams J, Kinter A, Fauci AS, Hirsch VM, Lifson JD. Viral dynamics of primary viremia and antiretroviral therapy in simian immunodeficiency virus infection. J Virol. 1997;71:7518–25. [PMC free article] [PubMed]
  • Painter SL, Biek R, Holley DC, Poss M. Envelope variants from women recently infected with clade A human immunodeficiency virus type 1 confer distinct phenotypes that are discerned by competition and neutralization experiments. J Virol. 2003;77:8448–61. [PMC free article] [PubMed]
  • Perelson AS, Neumann AU, Markowitz M, Leonard JM, Ho DD. HIV-1 dynamics in vivo: virion clearance rate, infected cell life-span, and viral generation time. Science. 1996;271:1582–6. [PubMed]
  • Ramratnam B, Bonhoeffer S, Binley J, Hurley A, Zhang L, Mittler JE, Markowitz M, Moore JP, Perelson AS, Ho DD. Rapid production and clearance of HIV-1 and hepatitis C virus assessed by large volume plasma apheresis. Lancet. 1999;354:1782–5. [PubMed]
  • Rannala B. Gene genealogy in a population of variable size. Heredity. 1997;78(Pt 4):417–23. [PubMed]
  • Ribeiro RM, Bonhoeffer S. A stochastic model for primary HIV infection: optimal timing of therapy. AIDS. 1999;13:351–7. [PubMed]
  • Richman DD, Wrin T, Little SJ, Petropoulos CJ. Rapid evolution of the neutralizing antibody response to HIV type 1 infection. Proc Natl Acad Sci U S A. 2003;100:4144–9. [PubMed]
  • Ritola K, Pilcher CD, Fiscus SA, Hoffman NG, Nelson JA, Kitrinos KM, Hicks CB, Eron JJ, Jr., Swanstrom R. Multiple V1/V2 env variants are frequently present during primary infection with human immunodeficiency virus type 1. J. Virol. 2004;78:11208–11218. [PMC free article] [PubMed]
  • Ruskin HJ, Pandey RB, Liu Y. Viral load and stochastic mutation in a Monte Carlo simulation of HIV. Physica A. 2002;293:315–323.
  • Sagar M, Wu X, Lee S, Overbaugh J. Human immunodeficiency virus type 1 V1-V2 envelope loop sequences expand and add glycosylation sites over the course of infection, and these modifications affect antibody neutralization sensitivity. J Virol. 2006;80:9586–98. [PMC free article] [PubMed]
  • Sagar M, Kirkegaard E, Long EM, Celum C, Buchbinder S, Daar ES, Overbaugh J. Human immunodeficiency virus type 1 (HIV-1) diversity at time of infection is not restricted to certain risk groups or specific HIV-1 subtypes. J Virol. 2004;78:7279–83. [PMC free article] [PubMed]
  • Salazar-Gonzalez JF, Bailes E, Pham KT, Salazar MG, Guffey MB, Keele BF, Derdeyn CA, Farmer P, Hunter E, Allen S, Manigart O, Mulenga J, Anderson JA, Swanstrom R, Haynes BF, Athreya GS, Korber BT, Sharp PM, Shaw GM, Hahn BH. Deciphering human immunodeficiency virus type 1 transmission and early envelope diversification by single-genome amplification and sequencing. J Virol. 2008;82:3952–70. [PMC free article] [PubMed]
  • Schacker T, Collier AC, Hughes J, Shea T, Corey L. Clinical and epidemiologic features of primary HIV infection. Ann. Intern. Med. 1996;125:257–264. [PubMed]
  • Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, Farzadegan H, Gupta P, Rinaldo CR, Learn GH, He X, Huang XL, Mullins JI. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J. Virol. 1999;73:10489–502. [PMC free article] [PubMed]
  • Sharma JC. Inheritance of defective extremities. Indian J Med Res. 1977;66:120–6. [PubMed]
  • Simon V, Zennou V, Murray D, Huang Y, Ho DD, Bieniasz PD. Natural variation in Vif: differential impact on APOBEC3G/3F and a potential role in HIV-1 diversification. PLoS Pathog. 2005;1:e6. [PMC free article] [PubMed]
  • Slatkin M, Hudson RR. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics. 1991;129:555–62. [PubMed]
  • Stafford MA, Corey L, Cao Y, Daar ES, Ho DD, Perelson AS. Modeling plasma virus concentration during primary HIV infection. J. Theor. Biol. 2000;203:285–301. [PubMed]
  • Tan WY, Wu H. Stochastic modeling of the dynamics of CD4+ T-cell infection by HIV and some Monte Carlo studies. Math Biosci. 1998;147:173–205. [PubMed]
  • Tuckwell HC, Le Corfec E. A stochastic model for early HIV-1 population dynamics. J Theor Biol. 1998;195:451–63. [PubMed]
  • Vernazza PL, Eron JJ, Fiscus SA, Cohen MS. Sexual transmission of HIV: infectiousness and prevention. AIDS. 1999;13:155–66. [PubMed]
  • Wakeley J. Coalescent Theory, an Introduction. Roberts & Co.; 2008.
  • Wei X, Decker JM, Wang S, Hui H, Kappes JC, Wu X, Salazar-Gonzalez JF, Salazar MG, Kilby JM, Saag MS, Komarova NL, Nowak MA, Hahn BH, Kwong PD, Shaw GM. Antibody neutralization and escape by HIV-1. Nature. 2003;422:307–312. [PubMed]
  • Wolinsky SM, Wike CM, Korber BT, Hutto C, Parks WP, Rosenblum LL, Kunstman KJ, Furtado MR, Munoz JL. Selective transmission of human immunodeficiency virus type-1 variants from mothers to infants. Science. 1992;255:1134–7. [PubMed]
  • Wolinsky SM, Korber BT, Neumann AU, Daniels M, Kunstman KJ, Whetsell AJ, Furtado MR, Cao Y, Ho DD, Safrit JT. Adaptive evolution of human immunodeficiency virus-type 1 during the natural course of infection. Science. 1996;272:537–42. [PubMed]
  • Yang Z. Estimating the pattern of nucleotide substitution. J Mol Evol. 1994;39:105–11. [PubMed]
  • Zhang LQ, MacKenzie P, Cleland A, Holmes EC, Brown AJ, Simmonds P. Selection for specific sequences in the external envelope protein of human immunodeficiency virus type 1 upon primary infection. J. Virol. 1993;67:3345–56. [PMC free article] [PubMed]