Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2760689

Formats

Article sections

Authors

Related links

J Theor Biol. Author manuscript; available in PMC 2010 November 21.

Published in final edited form as:

Published online 2009 August 4. doi: 10.1016/j.jtbi.2009.07.038

PMCID: PMC2760689

NIHMSID: NIHMS142119

Ha Youn Lee,^{1,}^{2,}^{*} Elena E. Giorgi,^{1,}^{3,}^{*} Brandon F. Keele,^{4} Brian Gaschen,^{1} Gayathri S. Athreya,^{1} Jesus F. Salazar-Gonzalez,^{4} Kimmy T. Pham,^{4} Paul A. Goepfert,^{4} J. Michael Kilby,^{4} Michael S. Saag,^{4} Eric L. Delwart,^{5} Michael P. Busch,^{5} Beatrice H. Hahn,^{4} George M. Shaw,^{4} Bette T. Korber,^{1,}^{6} Tanmoy Bhattacharya,^{1,}^{6} and Alan S. Perelson^{1,}^{§}

The publisher's final edited version of this article is available at J Theor Biol

See other articles in PMC that cite the published article.

We describe a mathematical model and Monte-Carlo (MC) simulation of viral evolution during acute infection. We consider both synchronous and asynchronous processes of viral infection of new target cells. The model enables an assessment of the expected sequence diversity in new HIV-1 infections originating from a single transmitted viral strain, estimation of the most recent common ancestor (MRCA) of the transmitted viral lineage, and estimation of the time to coalesce back to the MRCA. We also calculate the probability of the MRCA being the transmitted virus or an evolved variant. Excluding insertions and deletions, we assume HIV-1 evolves by base substitution without selection pressure during the earliest phase of HIV-1 infection prior to the immune response. Unlike phylogenetic methods that follow a lineage backwards to coalescence, we compare the observed data to a model of the diversification of a viral population forward in time. To illustrate the application of these methods, we provide detailed comparisons of the model and simulations results to 306 envelope sequences obtained from 8 newly infected subjects at a single time point. The data from 6/8 patients were in good agreement with model predictions, and hence compatible with a single-strain infection evolving under no selection pressure. The diversity of the samples from the other two patients was too great to be explained by the model, suggesting multiple HIV-1-strains were transmitted. The model can also be applied to longitudinal patient data to estimate within-host viral evolutionary parameters.

The HIV-1 population in a chronically infected individual is subject to continuous immune selection (Richman et al., 2003; Wei et al., 2003), and evolves to become a complex set of related viruses, often referred to as a quasispecies, through the course of an infection (Lee et al., 2009; Shankarappa et al., 1999; Wolinsky et al., 1996). A reduction in viral complexity at transmission was originally noted in the context of mother to infant transmission (Wolinsky et al., 1992), and has been extensively studied in recent years (Derdeyn et al., 2004; Dickover et al., 2006; Edwards et al., 2006; Painter et al., 2003). During sexual transmission of HIV-1, a genetic bottleneck usually occurs since a limited number of viral strains are transmitted from the complex quasispecies typically found in a donor (Delwart et al., 2001; Derdeyn et al., 2004; Zhang et al., 1993), although other studies have found multiple transmitted variants at a relatively high frequency (Long et al., 2002; Ritola et al., 2004; Sagar et al., 2004; Vernazza et al., 1999). Infection with multiple genetic variants has been associated with genital tract ulcers and use of hormonal contraceptives (Sagar et al., 2004). Recent studies have identified patients in the earliest weeks of infection, many prior to the selective pressure imposed by the newly infected host’s initial immune response (Abrahams et al., 2009; Keele et al., 2008; Salazar-Gonzalez et al., 2008). Sequence data from the HIV-1 *env* gene collected during acute HIV-1 infection in 102 subjects by Keele et al. (2008) show a wide range of diversity, with the average number of bases differing between pairs of sequences from the same patient varying between 0.01% and 2.18%, suggesting that both single and multiple viral strain transmissions may have occurred in this cohort.

In this paper, we develop simple models of HIV-1 evolution early in infection with the aim of quantitatively assessing whether infections were established by single or multiple viral strains. Further, in the case of single strain, i.e., homogeneous, infections we aim to identify the most-likely initiating strain or a close descendent that gave rise to the observed lineage.

We derive analytical results and approximate formulas. We used Monte Carlo (MC) simulations to capture the randomness in early HIV-1 evolution and to compare with our analytical results. The analytical results were derived from idealized models whereas the MC simulations allowed for more accurate models that incorporate unequal base composition and an evolutionary based substitution matrix that defines the frequency at which base *i* is replaced by base *j* during a mutation event. Previously, Monte Carlo methods have been used to study the within-host dynamics of HIV-1 infection (Heffernan and Wahl, 2005; Kamina et al., 2001; Ribeiro and Bonhoeffer, 1999; Ruskin, 2002; Tan and Wu, 1998; Tuckwell and Le Corfec, 1998), but here our focus is on sequence evolution and not viral and T cell dynamics as in these earlier works. Keele and colleagues applied a variant of this model to the analysis of 102 B-clade infected patients (Keele et al., 2008) and of 18 experimentally SIV-infected rhesus macaques (Keele, 2009). Abrahams et al. (2009) have used the same techniques to analyze a set of 69 C-clade infected subjects.

In this study we provide a complete mathematical description of the model, and explore the implications of varying the baseline assumptions, the input parameters, as well as purely analytical versus computational outcomes. We use just 8 of the patients described in Keele *et al*. (Keele et al., 2008) to illustrate nuances in the application of our model. These 8 subjects were chosen to be representative of the full set of 102 patients with 80% being characterized as having homogenous infection and 20% heterogeneous infections.

The main problem we focus on here is developing a systematic, reasoned way to determine whether a single strain or multiple stains of HIV-1 infected an individual. This is not straight-forward; even if an individual is infected by a single strain, with time this transmitted virus will diversify. Thus, given a set of sequences one needs to compute how much diversity would be expected by a given time from infection. For sequences collected early after infection, if the diversity is much greater than what is expected from a homogeneous infection, then multiple strain infection is a likely explanation. Other signatures of multiple strain infection might also be present, e.g., the sequences cluster into groups with very different most recent common ancestors. Our model also provides an estimate of the time from the origin of the most recent common ancestor (MRCA) of the sampled variants. If the sampled variants are representative of a lineage that was initiated by the infecting strain, the time to the MRCA should correspond to the time of infection; if the lineage arose in the donor, the time to the MRCA would be longer than the infection; if the lineage arose post-infection in the newly infected individual as a consequence of selection, the time estimated to the MRCA would be less than the time from infection.

To characterize HIV-1 evolution in samples in the earliest weeks of infection, we have modeled the events that occur between virus transmission and peak replication, approximately 21-35 days later. Our model assumes random drift prior to the initial immune response and exponential viral expansion prior to peak viremia. We ignored the effects of selection—based on the premise that the sequences we analyze were obtained sufficiently early in infection that immune responses would not yet provide substantial selective pressure. Although other sources of selection may be present, these also are ignored in our model. The effects of recombination were not modeled either. Comparing the expectations based on the mathematical model and the simulations to observed acute infection sequence data enables us to explore the validity of these assumptions in real infections. We show that under the hypothesis of exponential growth in the infected cell population the frequency distribution of the genetic distances between pairs of HIV-1 sequences follow an approximate Poisson distribution and star-phylogeny topology. This was first shown by Slatkin and Hudson in the context of the evolution of mitochondrial DNA (Slatkin and Hudson, 1991).

Since the precise time of initial HIV-1 infection cannot usually be known with certainty, the status of acute/early subjects can be classified using the “Fiebig” staging system (Table 1), which is based on an orderly appearance of viral RNA, antigen and antibodies in plasma during early infection (Fiebig et al., 2005; Fiebig et al., 2003). Prior to stage I, where plasma viral RNA first becomes detectable, is the eclipse phase. The length of this period can be roughly estimated based on clinical histories from a series of studies (Clark et al., 1991; Gaines et al., 1988; Lindback et al., 2000a; Lindback et al., 2000b; Schacker et al., 1996) and suggests an average length of about 10 days (range 7-21 days, see Table 1). For each patient in our dataset the Fiebig stage is known. This provides a rough estimate of the time since infection, which can be compared with the estimated time since the MRCA computed with our models from the sampled sequences.

Fiebig stage classification for sub-stages of HIV-1 primary infection, and the average and cumulative duration of each phase. The duration of each stage is presented in Keele *et al*. (Keele et al., 2008). We obtained 95% CIs, shown in parenthesis, by recalculating **...**

Coalescent theory can also be used to find common ancestors and estimate the time to coalescence. Further, results have been obtained using coalescent theory for the type of linear birth death process that underlies the model described in this paper (Kuhner et al., 1998; Rannala, 1997). When the infection is indeed homogeneous and it meets our model assumptions (exponential viral growth, no selection, etc.), we are under a very particular evolutionary scenario that allows us to use more direct and computationally efficient methods to derive the approximate time to the MRCA. Even though appropriate, Bayesian estimation methods (Drummond et al., 2005; Drummond et al., 2006; Kuhner and Smith, 2007), as the ones provided by the software BEAST by (Drummond and Rambaut, 2007), are computationally intensive even when the number of sequences is small as they perform a whole suite of additional tests that are redundant for our particular evolutionary scenario. The methods presented below are computationally efficient, and our model results can be obtained in minutes for the entire dataset, while in our hands BEAST runs for a single patient took over 3 hours. Further, as we shall show, the estimated time to the MRCA using BEAST and using our method are very similar. Thus, we believe the methods we present below enable a rapid exploration of the implications of a forward evolution model, provide consistent results with a coalescence model, and allow us to compare the model to the data to infer biologically interesting information from the sample.

HIV-1 can be transmitted from one individual to another either through the transmission of virus particles or infected cells. Since transmission of virus will quickly generate infected cells, we describe the transmission as if it occurred through infected cells**.**

In the case of sexually transmitted HIV-1, most often only a single viral sequence initiates the new infection, or only a single sequence grows out to yield a detectable level of viremia {Delwart, 2001 #61; Derdeyn, 2004 #20; Zhang, 1993 #59; Abrahams, 2009 #185}. Thus, we shall model the case in which infection is initiated by a single HIV-1 sequence. The simplest implementation of this hypothesis is to assume that a single cell, carrying this sequence as a provirus, starts the infection. We shall also examine the case in which multiple cells carrying identical sequences start the infection. When the predictions of these models fail to explain the extent of sequence heterogeneity observed, this is evidence that either multiple sequences have been transmitted, or that immune or other selective pressures are driving the observed level of diversification (Long et al., 2002; Ritola et al., 2004; Sagar et al., 2004; Vernazza et al., 1999).

We assume that the initial infected cell produces virus; most are cleared, but some successfully infect a new generation of cells. The number of secondary infections caused by one infected cell placed in the population of cells of an uninfected individual and fully susceptible to infection is called the basic reproductive ratio, *R _{0}*. While data on early viral kinetics in humans is limited, the available data in humans infected with HIV-1 and in monkeys infected with SIV and SHIV show that the virus grows exponentially until a viral load peak is attained a few weeks after infection (Little et al., 1999; Nowak et al., 1997; Stafford et al., 2000). Following the peak, viral levels decline and establish a set-point. At the set-point each infected cell, on average, successfully infects one other cell during its lifetime. If it infected more than one other cell viral levels would increase, whereas if it infected less than one cell viral levels would fall. Thus, once the set-point is established we can assume that the number of infected cells remains constant.

In our model we assume a homogeneous infection in which the virus grows exponentially with no selection pressure, no recombination, no occurrence of back mutations and a constant mutation rate across positions and across lineages. We also assume the virus grows with a fixed generation length and that each infected cell produces the same number of progeny. Clearly, the approximation of exponential growth must break down as target cells become limited and the number of cells infected per generation must decrease from *R _{0}*, stabilizing at 1 when the viral set point is established. However, we present evidence indicating that what is important in determining viral diversity is the number of reverse transcription events that have occurred along a lineage, i.e., the number of generations that separate the initial infection from the time at which sequences were obtained, and not the number of cells infected at each generation. In our first, simplest model, we also assume that the infection is synchronous, and thus can be characterized by discrete generations, with the virus infecting exactly

Let all genomes be of same length *N _{B}*, let

$$P(\text{mutations}=n)=\left(\begin{array}{c}\hfill {N}_{B}\hfill \\ \hfill n\hfill \end{array}\right){\epsilon}^{n}{(1-\epsilon )}^{{N}_{B}-n}\text{Binom}(n;{N}_{B},\epsilon ).$$

(1)

After the second replication cycle, the total number of mutations in the new generation of infected cells will follow the probability distribution given by

$$P(\text{mutations}=n\mathit{gen}.=2)=\sum _{k=0}^{n}\text{Binom}(n-k;{N}_{B},\epsilon )\text{Binom}(k;{N}_{B},\epsilon )=\text{Binom}(n;2{N}_{B},\epsilon ).$$

In other words, *n* mutations in the second generation are the result of all possible combinations of *k* mutations occurring in the first replication, and *n-k* occurring in the second, for all integers *k* between 0 and *n*. Mathematically, this is given by the convolution of two Binomial distributions with the same probability parameter (ε, in our case), which is in turn a Binomial distribution (Casella and Berger, 1990). One can see that after *a* replication cycles,

$$P(\text{mutations}=n\mathit{gen}.=a)=\text{Binom}(n;a{N}_{B},\epsilon ).$$

Let *HD _{0}* denote the Hamming distance, i.e., the number of base differences from the infecting strain

$$P={(1-\epsilon )}^{a}+a\epsilon {(1-\epsilon )}^{a-1}.$$

(2)

Therefore, the probability that at least one site across the entire genome has mutated more than once is *Q* = 1-*P*^{NB} , which is approximately $\frac{1}{2}{N}_{B}a(a-1){\epsilon}^{2}$ when *ε* is small.

For the HIV-1 base substitution rate ε, we use the value of 2.16 × 10^{-5} base substitutions per replication cycle. The base substitution rate we use is derived from the results of (Mansky and Temin, 1995), where after a single round of HIV-1 replication they found an overall mutation rate of 3.4 × 10^{-5}, but this included insertions and deletions. Restricting their results to only base substitutions, we calculated from their data a base substitution rate of 2.16 × 10^{-5} per base per replication. The sequences that we analyze later are envelope gene sequences with N_{B} ~2600 bases. Substituting these values into *Q*, we find that the probability of two or more mutations at the same site increases with the number of replication cycles, yet even for *a*=*50* one finds *Q* < 0.0015. Hence we expect that we can ignore back mutations when we analyze patient sequence data obtained early in infection.

Under this assumption, *HD _{0}* coincides with the number of mutations and hence it follows the same Binomial distribution:

$$P(H{D}_{0}=da)=\text{Binom}(d;a{N}_{B},\epsilon )+O\left({a}^{2}{\epsilon}^{2}{N}_{B}\right)\approx \left(\begin{array}{c}\hfill a{N}_{B}\hfill \\ \hfill d\hfill \end{array}\right){\epsilon}^{d}{(1-\epsilon )}^{a{N}_{B}-d}$$

(3)

Notice that this formula assigns a nonzero probability to the Hamming distance being larger than the total number of bases *N _{B}*, and this is an artifact of our approximations. In fact, using Chebyshev’s inequality (Feller, 1957), one can show (see Materials and Methods) that $P(\text{mutations}>{N}_{B}a)\le O\left(\frac{{a}^{2}{\epsilon}^{2}}{{N}_{B}}\right)1$, and thus indeed, within our approximation,

In the synchronous model, we assume that each cell infects *R*_{0} other cells at the end of its life cycle. At generation *a* from the infection event, there will be ${R}_{0}^{a}$ infected cells, each one carrying one HIV-1 genome. Further, at generation *a*

$$P(H{D}_{0}=d,a)=\left(\begin{array}{c}\hfill a{N}_{B}\hfill \\ \hfill d\hfill \end{array}\right){\epsilon}^{d}{(1-\epsilon )}^{a{N}_{B}-d}.$$

(4)

In order to analyze single time point data, it is useful to compute the intersequence HD distribution rather than HD_{0} (since for the latter we would need to assume that we know the infecting sequence *s _{0}*). In other words, rather than comparing any given sequence to the founder sequence, we take all possible sequence pairs in the sample and compute their relative HD. Given two sequences,

$$\begin{array}{c}\hfill P(HD[{s}_{1},{s}_{2}]=da)=\sum _{k=0}^{d}P(H{D}_{0}=ka)P(H{D}_{0}=d-ka)=\hfill & =\sum _{k=0}^{d}\text{Binom}(k;a{N}_{B},\epsilon )\text{Binom}(d-k;a{N}_{B},\epsilon )=\text{Binom}(d;2a{N}_{B},\epsilon )\hfill \hfill \end{array}$$

(5)

The assumption of independent evolution of sequences *s _{1}* and

$$P(MRCA[{s}_{1},{s}_{2}]\le a-m{s}_{1},{s}_{2}x)=\frac{{R}_{0}^{a-m}-1}{{R}_{0}^{a}-1}.$$

(6)

As shown in Material and Methods, this probability approaches zero rapidly as *a* and *m* get large.

We can use Eq. (6) to determine the probability that two sequences from a given individual coalesce at a time point other than at the transmitted sequence. For example, for samples obtained a few weeks into the infection and under no selective pressure, e.g. *a*=5 and *R _{0}*=6 (see below) the probability of coalescence later than generation 0, from Eq. (6) is 0.167 for

Samples that diverge from a star phylogeny often show patterns of shared mutations across sequences. The above calculation shows that coalescence not at the founder strain is most likely one generation away from the actual founder. Therefore, a mutation observed very early in the infection, could possibly carry on and develop as a second lineage. Also, we can not rule out that the founders of these two lineages were simply transmitted from the donor and hence coalesced in the donor. For infections transmitted during the acute phase before much diversification occurred in the donor this may even be likely. Lastly, we observe that since the base substitution rate ε is so small it is very likely that the sequences at a=0 and a=1 are the same. In fact, with probability (1-*ε*)^{RoNB} all *R*_{0} proviruses at generation 1 are identical to the founder; e.g., with *R _{0}*=

For ease of notation, let the intersequence Hamming distance distribution be denoted *HD _{I}* . Thus,

$$P(H{D}_{I}=da)=P(HD[{s}_{1},{s}_{2}]=da)=\text{Binom}(d;2{N}_{B}a,\epsilon ).$$

(7)

We also define some basic quantities based on the calculated Hamming distance distribution from the initial strain and the intersequence Hamming distance distribution between all possible sequence pairs. The divergence at generation *a* is defined as the average Hamming distance per base from the initial founder strain. From Eq. (4), this can be estimated as the mean of the binomial distribution divided by the number of bases, and is *εa* . The diversity at generation *a* is the average intersequence Hamming distance per base between sequence pairs and from Eq. (7), diversity = 2*εa*. The variance of the intersequence HD distribution at generation *a* divided by the number of bases is *2εa(1-ε)*. We also introduce the % sequence identity, defined as the proportion of sequences identical to the sequence, *s _{0,}* of the infecting strain. As the virus evolves, the population diversifies from the founder strain and the proportion of the total population that is identical to the MRCA decays exponentially. By setting

Finally, we wish to compute the expected maximum *HD _{I}*. This is motivated by the observation that more extensive diversity is seen in some transmission cases, presumably the result of infections established by two or more distinct HIV-1 strains (Long et al., 2002; Ritola et al., 2004; Sagar et al., 2004; Vernazza et al., 1999). Thus, a strategy for systematic classification of homogeneous, single strain infections versus heterogeneous infections is needed. To address this, we ask what is the maximum diversity that one would expect in a single strain infection within a time frame compatible with the subject’s Fiebig stage at the time of sampling. The maximum diversity expected depends on the sample size: the larger the sample, the higher the expected maximum, as outliers become more likely to be sampled.,The expected maximum

The mathematical model derived above assumes that all mutations have equal probability. However, in HIV-1 the four bases are not equally frequent and the rates of base substitution are not all equal. Further, we neglected more than one mutation occurring at the same site, although this potentially can occur. To examine the influence of these biological details on the predictions of HIV-1 sequence evolution, and to capture stochastic effects, we developed a simple Monte Carlo simulation and compared its predictions with those of the mathematical model.

The simulation starts at “generation 0” with one cell infected with a single HIV-1 sequence *N _{B}* bases long. The initial sequence is generated randomly with base frequencies of 0.19 (T and C), 0.37 (A) and 0.24 (G), based on typical frequencies measured in the full-length envelope gene (see Materials and Methods). We assume that the initially infected cell infects

We implemented the MC code so that the user has the option of running either one or multiple iterations. When multiple iterations are done, the HD frequencies computed at each generation step are averaged over all iterations, thus retrieving the mean field description described by the analytical method. On the other hand, when not averaging over all iterations, one captures the stochastic effects due to rare mutations, e.g. the run-to-run variance, which, in real life, corresponds to the host-to-host variation. This is completely neglected in the analytical method, which is a mean field description.

In the synchronous model the population of infected cells grows exponentially as ${R}_{0}^{a}$. Once the number of infected cells reaches *10 ^{4}*, which is close to some estimates of the effective population size

Using the MC simulation, we computed divergence, diversity, maximum HD and % sequence identity (defined in the previous section) as functions of time, and confirmed that their dependence on the number of generation steps matched the dependency found analytically. Figure 2 shows this comparison as well as 95% confidence intervals computed from 1,000 MC simulations using a typical number of sampled sequences, *N _{S}*=

The synchronous infection model is a simplification. In reality, cell infections occur at random times by the viruses released from an infected cell. Viral production starts on average about 24 hours after a cell is initially infected (Perelson et al., 1996), and most likely continues until cell death. In order to understand the dependence of our results on the synchronous approximation, we devised a second model in which infections are asynchronous. Again we will assume each cell infects *R _{0}* others during its lifetime. While each of the R

Let *τ _{a}* and 2

Denote by *I(a,t)* the number of infected cells of age *a* at time *t*. Here, by age, we mean the age of the infecting genome, measured in terms of the number of times the viral genome has undergone reverse transcription, rather than the cell’s physical age. Thus, *a* is equivalent to the generation variable we used in the synchronous model. As a consequence, at any given time *t,* under the synchronous model all cells belong to the same age group, whereas in the asynchronous model, the infecting genomes have undergone a minimum of $\frac{t}{2{\tau}_{a}}$ replication cycles, and a maximum of $\frac{t}{{\tau}_{a}}$ (see Figure 3), where and are the ceiling and floor functions defined as *x* = max {*n* *Z* | *n* ≤ *x*} and *x* = min {*n* *Z* | *n* ≤ *x*}, where *Z* is the set of integers.

Let *I _{0}* denote the initial number of infected cells and let $n=\frac{t}{{\tau}_{a}}$, i.e.,

$$I(a,n)={I}_{0}{\alpha}^{a}{\left(\frac{\gamma}{\alpha}\right)}^{n-a}\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right)\phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{thickmathspace}{0ex}}a=\frac{n}{2},\dots ,n\phantom{\rule{thickmathspace}{0ex}}\text{and}\phantom{\rule{thickmathspace}{0ex}}0\phantom{\rule{thickmathspace}{0ex}}\text{otherwise}.$$

(8)

As illustrated in Figure 3, at any given time *n*, the HIV-1 strains in an individual have undergone a minimum of $\frac{n}{2}$ and a maximum of *n* replication cycles. Then, as a function of time, the random variable *HD _{0}* follows the probability distribution

$$P(H{D}_{0}=dn)=\frac{\sum _{a=\frac{n}{2}n\left(\begin{array}{c}\hfill a{N}_{B}\hfill \\ \hfill d\hfill \end{array}\right){\epsilon}^{d}{(1-\epsilon )}^{a{N}_{B}-d}I(a,n)}^{\sum _{a=\frac{n}{2}nI(a,n)}^{}}}{}$$

(9)

In Materials and Methods we show the following relationship

$$\sum _{a=\frac{n}{2}nI(a,n)=\frac{{I}_{0}}{2}}^{}$$

(10)

where $\mathrm{=\sqrt{1+4\frac{\gamma}{{\alpha}^{2}}}}$ and ${F}_{n}=\frac{{(1+)n+1}^{-}}{}$. By substituting Eq. (10) into Eq. (9) we obtain

$$P(H{D}_{0}=dn)=\frac{1}{{F}_{n}}\sum _{a=\frac{n}{2}n\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right)\left(\begin{array}{c}\hfill a{N}_{B}\hfill \\ \hfill d\hfill \end{array}\right){\left(\frac{\gamma}{\alpha}\right)}^{n-a}{\epsilon}^{d}{(1-\epsilon )}^{a{N}_{B}-d},}^{}$$

(11)

with mean *μ _{n}* =

$${\lambda}_{n}=n\frac{1+2+\frac{1-{2}^{}.}{}}{}$$

(12)

Notice, as one would expect, that in neither the synchronous nor the asynchronous model, does the probability distribution depend on the initial number of infected cells, *I _{0}*.

We show in Materials and Methods that up to terms in *O*(*ε*^{2}) *P*(*HD*_{0} =*d* | *n)* is a Poisson distribution with mean *μ _{n}* =

$$P(H{D}_{0}=dn)\approx Pois(d,{\lambda}_{n}\epsilon {N}_{B})+O\left({\epsilon}^{2}\right).$$

(13)

As a consequence, given that the intersequence HD distribution, *HD _{I}*, is to a good approximation the self-convolution of the

In the synchronous model, the intersequence Hamming distance, Eq. (5), follows a binomial distribution, which is also well approximated by a Poisson distribution with equal mean since *2aN _{B}* is large,

Let *N(n)* be the total number of newly infected cells at time *n*. For *I _{0}*=

We also devised a partially randomized model, in which the α, instead of being a fixed parameter, is a random number uniformly distributed with mean *R*_{0}/2, and then γ is always chosen to be *R*_{0}-*α*. This introduces an additional variability the effect of which is inversely proportional to the total population size. Therefore, large fluctuations from the above description will occur within the first 2-3 generation steps, but will be negligible after that, thus recovering the above mean field description.

An important difference between the asynchronous and synchronous models is in the dependence on *R _{0}*. In both models, divergence and diversity depend linearly on the number of generation steps. In the synchronous model, the slope of the diversity is 2ε does not depend on

$$\frac{{\lambda}_{n}}{n}=\frac{1+2,}{}$$

(14)

with $\mathrm{=\sqrt{1+4\frac{\gamma}{{\alpha}^{2}}}}$ and *γ*+*α*=*R _{0}*. One sees that in the asynchronous model there is a dependence on

$$\epsilon \left(1+\frac{1}{\sqrt{1+4\frac{1-\eta}{{R}_{0}{\eta}^{2}}}}\right).$$

(15)

Notice that the when *η*=1, we retrieve as a special case the synchronous model, and indeed Eq. (15) yields exactly 2ε. However, when *η*≠*1*, the above expression is strictly smaller than 2ε, and approaches 2ε as *R _{0}* increases. Figure 4 illustrates the dependence of the slope of the diversity as a function of the number of generation steps on

Table 2 summarizes the above formulas for both the synchronous and the asynchronous models. In both models, divergence, diversity, and variance increase linearly as a function of time. Diversity and variance are expressed by the same mathematical formula, which is a consequence of the fact that the intersequence HDs follow a Poisson distribution. Divergence, diversity, and *HD _{I}* variance do not depend on the number of bases per sequence. In contrast, the maximum HD and the % sequence identity depend on the sequence length

We keep the basic framework used for the synchronous MC simulation, and choose parameters *R _{0}*=

At times *τ _{a}, 2τ_{a}*, 3

We examined the dependency of our results on the values chosen for the parameters *τ, R _{0}, I*

When patients are sampled, only a limited number of sequences are obtained. The Monte Carlo simulations provide a means of assessing the variability introduced by limited sequencing and can be used to choose the number of samples to be sequenced based on the trade-offs between cost and accuracy. Figure 7 shows the variation in the HD frequency distribution as the sample size *N _{S}* varies. Clearly, the larger the sample size, the smaller the variation.

We also asked what is the probability of erroneously classifying an infection as homogeneous due to poor sampling. In other words, what is the chance of missing a second lineage that represents a small percentage of the entire population when *N _{S}* sequences are sampled. We assumed a range of hypothetical low population prevalence, from 0.1% to 20%, and computed the probability that, out of a sample size

$$P=\text{Binom}(0;{N}_{S},p)$$

(16)

Thus, for example, for *N _{S}* ≥30, we can be 95% confident that any missed variant would comprise less than 10% of the total viral population (see Fig. pS9 in (Keele et al., 2008)). The number of sequences sampled from our 8 patients ranged from 16 to 50. When

Plasma samples were obtained from 8 subjects with acute or very recent HIV-1 subtype B infection. Laboratory testing showed that 7 subjects were in Fiebig stage II, both viral RNA and p24 antigen were detected, and one subject was in Fiebig stage III, HIV-1 IgM in addition to viral RNA and p24 was detected (see Table 1). Detailed information of each subject is summarized in Table 3. Furthermore, samples were analyzed using the Hypermut tool (LANL) to test for overall enrichment of G-to-A hypermutation patterns with APOBEC3G/F signatures (Bourara et al., 2007; Harris and Liddament, 2004; Simon et al., 2005). Because these mutational patterns occur at a faster rate than ordinary mutations, they violate one of our model assumptions (constant mutation rate across sites and lineages). Only one sample, SUMA, presented mutations (three) that were found to carry the APOBEC3G/F signature. When all mutations were kept, the sample did not fit the Poisson model (χ^{2} goodness of fit p-value = 0.04, where a p<0.05 represents significant departure from the Poisson distribution), whereas a good fit was restored upon removing the APOBEC signatures (χ^{2} goodness of fit p-value = 0.243) (Keele et al., 2008).

Patient characteristics at time of sampling, number of envelope sequences obtained and their length, and GenBank accession numbers

For each of the 8 samples we computed the *HD _{0}* and

Analysis of 8 HIV-1 acute sequence samples using the asynchronous infection model. For each subject, we calculated divergence, diversity and maximum intersequence HD. The parameter λ of the best fitting Poisson distribution is given for the 6 **...**

The eight early infection samples presented two distinct behaviors: either overall low diversity (max HD below 0.5%) and unimodal intersequence HD distribution, or overall high diversity (max HD above 2.5%; patients 1051 and BORI) and multimodal intersequence HD distribution (see Table 4 and Figure 8). We hypothesized that the underlying difference between the two patterns was whether a single HIV-1 strain entered the host or if multiple distinct strains entered. In order to exclude single strain infection in the high diversity samples, we computed the maximum HD one could reasonably expect to achieve at a given point in the infection and compared this to predictions based on the Fiebig stage. To do this, we used the maximum HD probability distribution computed in the asynchronous model, Eq. (22) and (23), with *P* taken from equation (13), using the patient specific base length *N _{B}* and sample size

An alternative way for distinguishing between homogeneous and heterogeneous infections can be obtained by looking at both the mean diversity and the variance of each sample. According to our analytical results, when the infection is truly homogeneous, the measured average diversity and its variance should be approximately equal, which is one of the basic properties of a Poisson distribution. However, since measured mean HD and variance are affected by sample size and stochastic effects in early evolution, they may differ. Thus, for a homogeneous infection we can require that the mean (diversity) and variance be located between the upper and lower 95% confidence limits computed from MC simulations based on the number of sequences sampled per patient (Figure 9). All of the six low diversity subjects (WEAU, WITO, TRJO, SUMA, 1054, 1056) satisfy this condition, while subjects 1051 and BORI violate it. Thus we classify subjects 1051 and BORI as ”heterogeneous” infections (i.e., infections initiated by two or more founder strains) and the other 6 subjects as “homogeneous” infections. The limitation of this classification is the assumption that the sequence population diversifies without any selection, e.g. under neutral evolution. This assumption does not hold for Fiebig stage III or higher, and this classification becomes problematic. The area outside the cone in Figure 9 indicates deviations from a Poisson distribution. In early Fiebig stages this suggests a heterogeneous infection, whereas in later stages it indicates either immune selection or a heterogeneous infection or both. Also due to purifying selection, samples from a heterogeneous infection could be located inside the cone in later Fiebig stages. However, to be classified as homogenous we also require that the average % diversity in these samples be less than that expected based on the upper limit of the cumulative duration for the Fiebig stage, i.e., day 34 for Fiebig stage II (Figure 9, vertical dashed line)

The above classification was also confirmed by visual inspection through the LANL tool Highlighter (LANL). The visualization tool represents each sequence in the sample with a horizontal line, and places a mark whenever the sequence presents a mutation from the consensus. Figure 9 shows two exemplary behaviors found in acute infection samples: the homogeneous sample SUMA has few randomly distributed mutations, whereas the heterogeneous sample 1051 presents a majority of sequences that are similar to the consensus and a second group of sequences that not only have many more mutations relative to the consensus, but these mutations are shared, indicative of a second lineage.

During rapid exponential growth in the absence of selection, small samples of sequences are likely to have evolved from the founder sequence following a star-like phylogeny, i.e., they all coalesce at the founder (Wakeley, 2008). When this is the case, the intersequence *HD* frequency distribution, *HD _{I}*, coincides with the self-convolution of the

$${\stackrel{\u2012}{Y}}_{k}=\frac{1}{2}\sum _{i=0}^{k}{X}_{k-i}{X}_{i}+\frac{1}{2}{\delta}_{k,k-i}{X}_{k}\phantom{\rule{thickmathspace}{0ex}}\mathrm{for}\phantom{\rule{thickmathspace}{0ex}}k=1,...,n.$$

(17)

The above formula is derived from the arithmetic convolution of the frequencies *X _{k}* with themselves. For example, ${\stackrel{\u2012}{Y}}_{0}={X}_{0}^{2}$,

Figure 10 compares the theoretical frequencies _{0},_{1},...,* _{n}* (red lines), with the observed ones (the histograms). For 4 out of the 6 low diversity samples there was a perfect agreement between the calculated frequencies and the actual intersequence HD frequency. For the other two (WEAU and TRJO) a 5% and 15% difference was found, respectively. These very low differences, if any, in observed and computed frequencies suggest that all 6 samples evolved following an approximate star-like phylogeny.

We applied the synchronous and asynchronous models to characterize early homogeneous infections and estimate the time since the MRCA. We did this using both the theoretical Poisson model and the output from the MC simulations.

In a homogeneous infection we expect the intersequence HD to follow a Poisson distribution, as in Eq. (13). We used a maximum likelihood method to fit a Poisson distribution to the observed *HD _{I}* frequencies and estimate the Poisson parameter λ, which is proportional to the time or the number of generations since the most recent common ancestor. Given the vector of observed frequencies

$$LL\left(\lambda \right)=\mathrm{log}(\frac{{Y}_{0}+...+{Y}_{n}!}{{Y}_{0}!...{Y}_{n}!}\underset{{\left[\frac{{e}^{-\lambda}{\lambda}^{i}}{i!}\right]}^{{Y}_{i}}}{\overset{}{i=0n}}).$$

(18)

By minimizing the log likelihood function, we obtain

$$\lambda =\frac{\sum _{i=0}^{n}i{Y}_{i}}{\sum _{i=0}^{n}{Y}_{i}},$$

(19)

the mean of the intersequence Hamming distance distribution.

We then used λ to estimate the number of generation steps, *n*, since the most recent common ancestor (MRCA), using the following:

$$n=\{\begin{array}{cc}\frac{\lambda}{2\epsilon {N}_{B}}\hfill & \mathrm{synch}.\phantom{\rule{thickmathspace}{0ex}}\text{model}\hfill \\ \frac{\mathrm{1+(\frac{\lambda}{\epsilon {N}_{B}}-\frac{1-{2}^{}}{)}\mathrm{asynch}.\phantom{\rule{thickmathspace}{0ex}}\text{model}\hfill}}{}\phantom{\}}\hfill \end{array}$$

(20)

where *n*=*a* in the synchronous model, and $n=\frac{t}{{\tau}_{a}}$ in the asynchronous model. Then *t*, the time since the MRCA, is given by *t*=*nτ _{s}* for the synchronous model, and

We assessed the goodness of the fit using a χ^{2} goodness-of-fit test statistic calculated from a singular value decomposition of the covariance matrix (see Materials and Methods). In a goodness of fit test the null hypothesis is that the two distributions being tested are statistically the same, hence a low p-value rejects the null hypothesis. All 6 low-diversity samples yielded a goodness-of-fit p-value of 0.1 or higher, suggesting that the observed distribution is consistent with a Poisson.

We fixed the values of base length *N _{B}* and sample size

For all 6 low-diversity samples the time of infection could be estimated based on patients’ Fiebig stage. Table 4 summarizes the estimates both from fitting the Poisson distribution and the MC output using the asynchronous model. The computed estimated time since infection estimated by each patient’s Fiebig stage were within the 95% CIs predicted by the MC simulations. In contrast, the Poisson fit yielded slightly lower estimates (with respect to the given Fiebig stage) for subjects SUMA and 1056. All Poisson estimates were systematically lower than the MC ones. This is due to the fact that run-to-run variation, while accounted for in the MC fitting, is neglected when fitting the Poisson distribution. To understand this, consider for example a base length of *N _{B}*=2600. From Eq. (12), under the asynchronous model, the average Hamming distance from the founder strain, after one generation step will be 0.05. In this time, only 3 infected cells are produced, and, on average there will correspondingly be 0.15 mutations. However, in any given run, we can only have a discrete number of mutations. In particular, the first three cells will be identical 85% of the time (hence a mean HD of 0), 14% of the time there will be exactly one mutation across all three sequences (hence a mean HD of 0.333), and so on. Thus, though averaging over enough runs will lead to the correct mean value, most of the runs (85%) will have less than the expected number of mutations (after the first reverse transcription), and the rest will have more. This initial difference will persist through the generations and appear as an offset in a plot of mean HD versus generations (not shown). Averaging over 1,000 MC runs one finds that a sample of sequences with mean HD from the founder of 0.8 coalesce at a MRCA that is always 17 generation steps back (or 25.5 days) when fitting the Poisson distribution, but in the MC method 85% of the time the inferred number of generation steps will be higher (19 or 20 instead of 17) leading to a mean of 18.2 generation steps (or 27.3 days) Thus explaining the roughly 2 day discrepancy in the time to MRCA between these methods as shown in Table 4. Because the variation in the MC is equivalent to host-to-host variation, the MC 95% CIs are more likely to capture the true time estimates than the Poisson fits.

Early HIV-1 infection tends to be characterized by a viral population with limited sequence diversity in many but not all individuals. A recent study examined sequence data obtained from 102 individuals (Keele et al., 2008), and using the methods developed in this paper characterized the infections as being consistent with a single transmitted strain in 78 cases and multiple strains in 21 cases with 3 being borderline. In order to do this classification the authors used the model of early HIV-1 infection developed in this paper. Even though the biological conclusion that some people are infected by multiple strains and others by one strain only has already been published (Delwart et al., 2001; Derdeyn et al., 2004; Long et al., 2002; Sagar et al., 2004; Zhang et al., 1993), our goal here was to provide, in a self-contained manner, the mathematical underpinnings for the analysis of sequences obtained by single genome amplification methods.

Our goals were not novel in the sense that coalescent and Bayesian inference methods (Drummond et al., 2005; Drummond et al., 2006; Kuhner and Smith, 2007; Kuhner et al., 1998; Rannala, 1997), when restricted to homogeneous infections, can provide the same results. In fact, comparing the estimated time to the MRCA using BEAST (Drummond and Rambaut, 2007) and using our method on the 53 patients that had a homogeneous infection and no overt enrichment for APOBEC3 mutations from Keele et al. (Keele et al., 2008), we found the results to be similar (correlation coefficient = 0.973, best fitting slope = 1.036, see Figure 11). However, the main difference between these methods and ours is that whereas the former simulate genealogies, under similar assumptions, our simulation models follow the entire population. While for most of the problems studied in the literature simulating genealogies is the correct approach, what one gives up in that approach is simplicity when other processes like recombination and selection need to be modeled. A forward simulation like ours, on the other hand, is usually not feasible because of the problem size, for example for studying HIV-1 at the population level, but is readily applied to the important problem of modeling the evolutionary events in early infection and characterization of viral transmission. Because of the small number of generations since the extreme bottleneck, we can follow the entire population in a simulation as well as provide analytical calculations for various quantities summarizing viral evolution. In this paper, we set up a basic framework in which we neglect recombination and selection, but it is easy to see that with our methods these pose no problems of principle, whereas the analysis involving ancestral recombination graphs (Minichiello and Durbin, 2006), or ancestral selection graphs (Sharma, 1977) or both combined would be prohibitive for this data.

For cases of a single strain infection, our model suggests that the consensus sequence, which corresponds to our best estimate of the MRCA given the Poisson model described here, is either the transmitted strain or a strain one to two mutations away from the transmitted strain (Keele et al., 2008). Identifying the transmitted/founder strain has great biological significance as one can then determine if it has particular properties that allowed it to be transmitted. Even if our identification is only accurate to within one or two base accuracy this will still provide an enormous advance as it is the virus that selectively expands, and we can use this information to search for particular attributes of the virus that allowed it to expand. For example, preliminary examination of putative transmitted viruses suggests that they may generally be more uniformly refractive to neutralization than viruses obtained at the chronic stage of infection, which show heterogeneity in susceptibility to antibodies directed at their receptor binding surfaces (Keele et al., 2008). Also there are subtype specific attributes in terms of variable loop lengths and the number of glycosylation sites (the loops appear to be shorter and less glycosylated in the transmitted viruses of A and C clade infections (Chohan et al., 2005; Derdeyn et al., 2004; Sagar et al., 2006), but not B clade transmitted viruses (Frost et al., 2005)). Data is just beginning to accrue that will enable further exploration of potential viral transmission signatures and phenotypic traits, and the models described in this paper allow one to easily determine whether or not the MRCA or consensus sequence of an early sample is indeed a reasonable estimate of the transmitted virus.

Another important issue that our model addressed is the number of viruses from a single patient that one must sequence to determine if the infection is homogeneous within some confidence band and the number of sequences needed to obtain a reasonable estimate of the time to the MRCA. In order to achieve this, we computed pairwise Hamming distance frequencies of each given sample. Other types of statistical analyses that involve pairwise Hamming distances have been devised elsewhere (Gilbert et al., 2005), however the focus there was to simply compare two different populations, not to infer time estimates. Furthermore, we showed that under the homogeneous infection hypothesis, the HD frequencies follow a star-phylogeny and Poisson distribution. This was also shown in (Slatkin and Hudson, 1991), where the Poisson was used to calculate a population growth rate. Here we used it to test the homogeneity of the sample, and then to get time estimates since the MRCA. The probability of misclassifying an infection as homogeneous is easily calculable from our model. For example, we showed that with a sample of 30 sequences the probability of mistakenly classifying an infection as homogeneous due to not detecting a second minor lineage (less than 10% of the population) is at most 5%. However, now with deep sequencing methods, such as 454 technology (Margulies et al., 2005), 100,000 or more sequences of limited length can be obtained. Such technology may indeed show examples of what we have called homogeneous infections that are not truly homogenous, and could enable us to discern if minor lineages exist at very low frequencies.

The model assumes that the viral population grows exponentially with no selection pressure and no recombination. These are reasonable assumptions to make early in the infection, before the host’s immune response begins and while the infected cell population is still much smaller than the population of target cells. Further, early in infection derived from a single strain viral diversity is low and recombination may have only a small effect — as recombination between identical sequences leaves the sequences unchanged. Later in the infection, the exponential growth phase stops and selection pressure from the host environment gets established, both of which break the star-phylogeny evolution and hence our model is no longer valid.

In our mathematical models we neglect the occurrence of a mutation occurring more than once at the same site since it occurs with probability *O*(*a ^{2}ε^{2}N_{B}*). For our values of

We assumed that the viral diversity evolves under a star-like phylogeny, and that all genomes coalesce at the founder strain. We used this assumption to construct the intersequence HD distribution and compute sample diversity. The probability of not coalescing at the founder strain is $O\left({R}_{0}^{-m}\right)$, where *R _{0}* is the basic reproductive ratio and

It has been estimated that a free virus has a half-life of less than an hour (Ramratnam et al., 1999), whereas an infected cell has an average lifespan of ~2 days (Markowitz et al., 2003; Perelson et al., 1996). Therefore, the virus present in plasma in a given sample is a reflection of the viral genomes contained in HIV-1 producing cells. Thus, as a simplification, we chose in our models to follow the infected cell population and not the virus. At each generation step, we assumed that an infected cell successfully infects *R _{0}* other cells. In effect, we are following the integrated HIV-1 provirus rather than viral particles. The experimental data we analyzed was obtained by sequencing virus and not provirus. On average, each cell produces as many as 5 × 10

Finally, we assumed the mutation rate is constant across different lineages and along the entire genome. This assumption is violated if APOBEC3G/F causes G-to-A mutation at an enhanced rate even in sequences that are not overtly hypermutated. Keele et al. (Keele et al., 2008) showed that overall APOBEC enriched samples follow our evolutionary model upon removing the G-to-A mutations with APOBEC3G/F signature. For the point mutation rate, we used the value 2.16×10^{-5} per base per replication cycle. (Mansky and Temin, 1995) estimated a mutation rate of 3.14×10^{-5} per base per replication cycle, but their estimate counted all types of mutations, including gaps. Because we exclude gaps throughout our analyses, we removed their contribution from the Mansky calculation and obtained the value 2.16×10^{-5}. The value from Mansky et al. was obtained *in vitro* from HeLa cells and reproduced in two different cell lines (Mansky, 1996; Mansky and Temin, 1995). More recent *in vitro* studies found a slightly smaller point mutation rate of 2.2 × 10^{-5} (including deletions and insertions) in one case (Huang and Wooley, 2005), and a larger rate of 5.4 × 10^{-5} in another (Gao et al., 2004). The use of a base substitution rate different from 2.16×10^{-5} would result in larger estimates for the number of days to the MRCA than obtained in Table 4, if the rate were smaller, and shorter if it were larger.

Within both the analytical and the MC model descriptions, we have envisioned two scenarios, one in which all infection events are synchronous, and one in which they are asynchronous and occur at two distinct times. Again, the advantage of the former is its simplicity. It yields a straightforward mathematical description, a population that grows exponentially in time with base *R _{0}*. However, it is biologically unrealistic that each cell would infect all other cells in at a single time. To explore how the synchronicity affects the model, we broke this assumption in the simplest possible way, e.g. by allowing infections to happen at two times. Even though this second scenario is still biologically unrealistic, nonetheless, it shows that the intersequence HD distributions do not change nature: they are still essentially Poisson distributions, and the increase in diversity is again driven by the number of RT cycles that have occurred. However, whereas in the synchronous model the increase in diversity depends on the base substitution rate ε and the generation time expressed in days, in the asynchronous model it also depends on

For the six patients that we classified as having been infected by a single viral strain, we estimated time since the MRCA by computing the divergence (i.e. the mean Hamming distance per base pair) from the founder strain. Because we do not know the actual founder strain, we constructed the sample consensus sequence and assumed it to be identical to the MRCA. In doing this, we are implicitly assuming no bottleneck after the virus has entered the host. Should a sublineage prevail over the rest of the population after entering the host, this would cause the consensus to represent the dominant sublineage rather than the actual infecting strain; an estimate to that MRCA that is statistically less than the minimal time from infection based on Fiebig stage would point to such a scenario, and suggest selection. All these caveats factor into our time since MRCA estimates, and would similarly impact the use coalescent methods in such estimates.

We have presented a simple description of an early homogeneous infection. This restricts the validity of our model to early infections, before selection pressure can be detected and while the infected cell population is still small with respect to the target cell population. Furthermore, in some cases there may be evidence of purifying selection even early in the infection, which could explain why for some of the samples we get too early time estimates, suggesting a bottleneck in viral evolution posterior to infection. Using the SNAP program (LANL), we looked at the dS/dN ratios of all 6 low diversity patients. Four out 6 (1054, SUMA, TRJO, WEAU) had ratios greater than 1, suggesting purifying selection (to be published elsewhere). Future developments of our model could allow correction for such a scenario, for example by using a modified base substitution rate.

Despite its limitations, our model successfully distinguishes homogenous from heterogeneous infections, and predicts the time evolution of divergence, diversity, maximum diversity and % sequence identity in single-strain HIV-1-infections.

Plasma samples were obtained from 8 subjects with acute or very recent HIV-1 subtype B infection. All subjects gave informed consent, and plasma collections were performed with institutional review board and other regulatory approvals. Blood specimens were generally collected in acid citrate dextrose and plasma separated and stored at -20 to -70°C. To determine how the far into the acute phase the of infection the samples were taken, they were tested for HIV-1 RNA, p24 antigen, quantitative Chiron bDNA 3.0 or Roche Amplicor viral RNA assays; Coulter or Roche p24 Ag assays; Genetic Systems Anti-HIV-1/2 Plus O and Abbott AntiHIV-1/2 3^{rd} Generation EIAs; and Genetic Systems HIV-1 Western Blot Kit. Based on these test results, subjects were staged according to the Fiebig laboratory classification system for acute and early HIV-1 infection (see Table 1).

We employed single genome amplification (SGA) of plasma HIV-1 RNA followed by direct sequence analysis of uncloned *env* amplicons (Keele et al., 2008) as described in Salazar *et al*. (Salazar-Gonzalez et al., 2008). The number of *env* sequences, ranging from 16 to 50, analyzed per subject, their GenBank accession numbers are given in Keele *et al*. (Keele et al., 2008), and the subjects’ Fiebig stages are shown in Table 1. Sequences were aligned with GeneCutter (LANL) and the alignment was hand-checked. We removed insertions and deletions using the “Gap Strip Tool” (http://www.hiv.lanl.gov/content/sequence/GAPSTREEZE/strip_ready.html), which removes all columns that contain one or more gaps from a nucleotide alignment.

We show that $P(\text{mutations}>{N}_{B}a)\le O\left(\frac{{a}^{2}{\epsilon}^{2}}{{N}_{B}}\right)1$. Chebyshev’s inequality (Feller, 1957) states that, given a random variable *X* with mean μ and variance *σ ^{2}*, for any positive integer

$$\mathrm{Pr}ob(X-\mu \ge k\sigma )\le \frac{1}{{k}^{2}}.$$

Let *X* be the number of mutations. Then *μ* = *aεN _{B}* and

$$\mathrm{Pr}ob(H{D}_{0}\ge {N}_{B}+1)\le \frac{a\epsilon (1-\epsilon )}{{N}_{B}{\left(1+\frac{1}{{N}_{B}}-a\epsilon \right)}^{2}}\approx \frac{a\epsilon (1-\epsilon )}{{N}_{B}}+O\left(\frac{{a}^{2}{\epsilon}^{2}}{{N}_{B}}\right)<<1.$$

To see that the probability of coalescence before generation 0 approaches zero exponentially as *a* and *m* get large, i.e. the expression in Eq. (6) is $O\left({R}_{0}^{-m}\right)$, recall that

$${x}^{n}-1=(x-1)({x}^{n-1}+...+x+1).$$

Therefore

$$\frac{{R}_{0}^{a-m}-1}{{R}_{0}^{a}-1}=\frac{{R}_{0}^{a-m-1}+...+{R}_{0}+1}{{R}_{0}^{a-1}+...+{R}_{0}+1}=O\left({R}_{0}^{a-m-1-a+1}\right)=O\left({R}_{0}^{-m}\right).$$

Given a sample of *N*_{S} sequences drawn from the viral population at generation *a* derived from a single strain infection, and a positive integer *M*, we examine how the *N*_{S} *HD*_{0} values should be distributed so that the maximum intersequence Hamming distance is exactly *M*. Notice that *HD _{I}* =

$$P(H{D}_{\mathrm{max}}=Ma)={P}_{k}^{\left(a\right)}\left[{\left(\sum _{i=0}^{M-k}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}-1}-{\left(\sum _{i=0}^{M-k-1}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}-1}\right]$$

In the special case when *M* is even and $k=\frac{M}{2}$, we need at least two sequences such that $H{D}_{0}=\frac{M}{2}$, which is expressed in the following

$${\left(\sum _{i=0}^{\frac{M}{2}}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}}-{\left(\sum _{i=0}^{\frac{M}{2}-1}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}}-{N}_{S}{P}_{\frac{M}{2}}^{\left(a\right)}{\left(\sum _{i=0}^{\frac{M}{2}-1}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}-1}$$

(21)

Putting everything together, we obtain

$$P(H{D}_{\mathrm{max}}=Ma)=\sum _{k=\frac{M+1}{2}}^{M}{N}_{S}{P}_{k}^{\left(a\right)}\left[{\left(\sum _{i=0}^{M-k}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}-1}-{\left(\sum _{i=0}^{M-k-1}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}-1}\right]$$

(22)

when *M* is odd, and

$$\begin{array}{c}\hfill P(H{D}_{\mathrm{max}}=Ma)={\left(\sum _{i=0}^{\frac{M}{2}}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}}-{\left(\sum _{i=0}^{\frac{M}{2}-1}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}}-{N}_{S}{P}_{\frac{M}{2}}^{\left(a\right)}{\left(\sum _{i=0}^{\frac{M}{2}-1}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}-1}+\hfill & \hfill & +\sum _{k=\frac{M}{2}+1}^{M}{N}_{S}{P}_{k}^{\left(a\right)}\left[{\left(\sum _{i=0}^{M-k}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}-1}-{\left(\sum _{i=0}^{M-k-1}{P}_{i}^{\left(a\right)}\right)}^{{N}_{S}-1}\right]\hfill \end{array}$$

(23)

when *M* is even. Here we are assuming that *a*>0 and that any two sequences coalesce at generation 0. Therefore, the expected maximum *HD* at generation *a* is given by

$${\stackrel{\u2012}{HD}}_{\mathrm{max}}=\sum _{M=0}^{{N}_{B}}M\times P(H{D}_{\mathrm{max}}=Ma).$$

Of all infected cells *I(a*+*1*,*n*+*1)* at time *n*+*1*, α *I(a,n)* were infected by cells of age *a* that were “born” at time *n*, whereas *γI(a,n-1)* were infected by cells of age *a* that were born at time *n-1*. In other words, we have the following relationship

$$I(a+1,n+1)=\alpha I(a,n)+\gamma I(a,n-1).$$

(24)

Using the binomial expansion $\left(\begin{array}{c}\hfill x+1\hfill \\ \hfill y-x\hfill \end{array}\right)=\left(\begin{array}{c}\hfill x\hfill \\ \hfill y-x\hfill \end{array}\right)+\left(\begin{array}{c}\hfill x\hfill \\ \hfill y-x-1\hfill \end{array}\right)$, one can verify that for $\frac{n}{2}\le a\le n$ the expression $I(a,n)={I}_{0}{\alpha}^{a}{\left(\frac{\gamma}{\alpha}\right)}^{n-a}\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right)$ satisfies Eq. (24) as shown below:

$$\begin{array}{cc}\hfill & \alpha I(a,n)+\gamma I(a,n-1)={\alpha}^{a+1}{I}_{0}\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right){\left(\frac{\gamma}{a}\right)}^{n-a}+\gamma {\alpha}^{a}{I}_{0}\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a-1\hfill \end{array}\right){\left(\frac{\gamma}{\alpha}\right)}^{n-a-1}=\hfill \\ \hfill & ={\alpha}^{a}{I}_{0}{\left(\frac{\gamma}{\alpha}\right)}^{n-a}\left[\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right)+\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a-1\hfill \end{array}\right)\right]={\alpha}^{a+1}{I}_{0}{\left(\frac{\gamma}{\alpha}\right)}^{n-a}\left(\begin{array}{c}\hfill a+1\hfill \\ \hfill n-a\hfill \end{array}\right)=I(a+1,n+1).\hfill \end{array}$$

Let

$$N\left(n\right)=\sum _{a=\frac{n}{2}nI(a,n).}^{}$$

(25)

By taking the summations on both side of Eq. (24), one obtains that *N(n)* satisfies *N*(*n*+1) = *αN*(*n*)+*γN*(*n*-1). Let ${\lambda}_{1}=\frac{\alpha}{2}(1+)$, and ${\lambda}_{2}=\frac{\alpha}{2}(1-)$, where $\mathrm{=\sqrt{1+4\frac{\gamma}{{\alpha}^{2}}}}$. Then, for any integers *a* and *b*, the function so defined: ${N}_{a,b}\left(n\right)=a{\lambda}_{1}^{n}+b{\lambda}_{2}^{n}$ is such that for every *n* it satisfies the identity *N _{a,b}*(

$$\begin{array}{cc}\hfill \alpha {N}_{a,b}& \left(n\right)+\gamma {N}_{a,b}(n-1)=\alpha (a{\lambda}_{1}^{n}+b{\lambda}_{2}^{n})+\gamma (a{\lambda}_{1}^{n-1}+b{\lambda}_{2}^{n-1})=\hfill \\ \hfill & a(\alpha {\lambda}_{1}^{n}+\gamma {\lambda}_{1}^{n-1})+b(\alpha {\lambda}_{2}^{n}+\gamma {\lambda}_{2}^{n-1})=a{\lambda}_{1}^{n+1}+b{\lambda}_{2}^{n+1}={N}_{a,b}(n+1)\hfill \end{array}$$

Therefore, we need to find *a* and *b* such that *N(0)*=*I _{0}* and

In order to show that the *HD _{0}* probability distribution is such that

$${K}_{HD}^{\left(n\right)}\left(\xi \right)=Log{M}_{HD}^{\left(n\right)}\left(\xi \right)=\frac{\epsilon {N}_{B}}{{F}_{n}}\sum _{j\ge 1}\frac{{\xi}^{j}}{j!}{p}_{j}\left(\epsilon \right)\sum _{a=\frac{n}{2}na\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right){\left(\frac{\gamma}{{\alpha}^{2}}\right)}^{n-a},}^{}$$

(26)

where ${M}_{HD}^{\left(n\right)}\left(\xi \right)$ is the moment generating function

$${M}_{HD}^{\left(n\right)}\left(\xi \right)=\frac{1}{{F}_{n}}\sum _{a=\frac{n}{2}n\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right)\left(\begin{array}{c}\hfill a{N}_{B}\hfill \\ \hfill d\hfill \end{array}\right){\left(\frac{\gamma}{{\alpha}^{2}}\right)}^{n-a}{[\epsilon {e}^{\xi}+(1-\epsilon )]}^{a{N}_{B}}.}^{}$$

In Eq. (26), *p _{j}*(

$$\frac{{F}_{n}}{{F}_{n-1}}=\frac{1}{2}\frac{{(1+)n+1}^{-}}{}$$

(27)

And since $\theta =\frac{1-1+1}{}$, it follows that $\frac{{F}_{n}}{{F}_{n-1}}\approx \frac{1+2}{+}$. Given $X\left(n\right)={\sum}_{a=\frac{n}{2}na\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right){\left(\frac{\gamma}{{\alpha}^{2}}\right)}^{n-a}}^{}$, for *n* >> 1, we have $\frac{X\left(n\right)}{{F}_{n}}\approx n\frac{1+2+\frac{1-2{2}^{}}{}}{}$. In fact, let $x=\frac{\gamma}{{\alpha}^{2}}=\frac{{2}^{-}4}{}$. Then

$${F}_{n}+(\frac{{2}^{-}4}{)}(\frac{n\mathrm{+1}{2}^{})}{{F}_{n-1}}$$

As a consequence, $X\left(n\right)=n{F}_{n}-n\frac{{2}^{-}4{F}_{n-1}-\frac{{2}^{-}4{2}^{}{F}_{n-1}}{}}{}$. Dividing both side by *F _{n}* completes the proof.

Given this notation, we can write ${K}_{HD}^{\left(n\right)}\left(\xi \right)=\epsilon {N}_{B}\left({\sum}_{j\ge 1}\frac{{\xi}^{j}}{j!}{p}_{j}\left(\epsilon \right)\right)\frac{X\left(n\right)}{{F}_{n}}$. As a consequence, for large *n*, we can approximate the *m*-th cumulant as

$${\kappa}_{m}^{\left(n\right)}\approx \epsilon {N}_{B}{p}_{m}\left(\epsilon \right)(n\frac{1+2+\frac{1-2{2}^{}}{)}\approx (n\frac{1+2+\frac{1-2{2}^{}}{)}\epsilon {N}_{B}+O\left({\epsilon}^{2}{N}_{B}\right),}{}}{}$$

and up to terms in *O*(*ε*^{2}*N _{B}*), all cumulants are equal (the m-th cumulant does not depend on

The times at which cells are infected differ in the synchronous and asynchronous models. However, the growth rate of the infected cell population in the two models should both be chosen equal to the best estimate of the actual growth rate. Setting these growth rates equal will lead to a relationship between *τ _{s}* and

$${\rho}_{a}=Lo{g}_{{R}_{0}}\frac{N(n+1)}{N\left(n\right)}\approx Lo{g}_{{R}_{0}}(\alpha \frac{1+2}{}).$$

In the synchronous model, at any time *t*=*a*, the total number of newly infected cells is $N\left(n\right)={I}_{0}{R}_{0}^{n}$. Therefore, the synchronous growth rate *ρ _{s}* in the logarithmic scale is 1, i.e.,

$${\rho}_{s}=Lo{g}_{{R}_{0}}\frac{N(n+1)}{N\left(n\right)}=Lo{g}_{{R}_{0}}\frac{{R}_{0}^{n+1}}{{R}_{0}^{n}}=1.$$

If the mean generation time is *2* days, it follows that the average growth rate is 0.5/day. Comparing the rate of increase per day between the two models, we get

$${\rho}_{a}\frac{\Delta {\tau}_{a}}{\Delta days}={\rho}_{s}\frac{\Delta {\tau}_{s}}{\Delta days}=\frac{1}{2},$$

where *τ _{s}*=2 days. Hence ${\tau}_{a}=2{\rho}_{a}\approx 2{\mathit{Log}}_{{R}_{0}}(\alpha \frac{1+2}{})$. In particular, when $\alpha =\gamma =\frac{{R}_{0}}{2}$,

$$N\left(n\right)={I}_{0}{\left(\frac{{R}_{0}}{4}\right)}^{n}\frac{{(1+)n+1}^{-}}{}$$

where $\mathrm{=\sqrt{1+\frac{8}{{R}_{0}}}}$. We could also assume given the data that the average generation time in an asynchronous model is 2 days and then use the procedure given above to deduce a corresponding τ_{s}. In using a *τ _{a}* of 1.5 days half the infected cells are implicitly assumed to live for 3 days. While the average measured lifetime is 2 days the variance around the mean is not known and hence it is unclear if this value of

Define

$${\chi}^{2}={(Y-E\left(Y\right))}^{t}{\Sigma}^{-1}(Y-E\left(Y\right)),$$

(28)

where *Y* is the vector of intersequence HD frequencies of length *n*=max(*HD*), E denotes the expectation operator *E*(*Y*)=(*E*(*Y*_{0}),...*E*(*Y _{n}*)), and Σ

The number of bases that undergo mutation is determined by the mutation rate ε. However, once a base is chosen to be mutated a matrix of base transitions is then used to determine which base it is mutated into. This same method is used in both the synchronous and asynchronous models. The base frequencies and substitution matrix employed in the MC simulations were obtained from a general time reversible (GTR) evolutionary model (Yang, 1994) optimized in conjunction with a maximum likelihood phylogenetic tree. Relative base substitution frequencies are listed in Table 5; these were based on the GTR probability matrix (Korber et al., 2000) and a maximum likelihood estimate of rate variation per site, and indicate the probability of change of one nucleotide to another per unit time, assuming a branch length of 1 (there is extensive change with a branch length of 1). In the MC simulation the equilibrium base frequencies for this model were A=0.373, G=0.241, C=0.193, and T=0.192. The transition matrix was computed from Table 5 by setting the diagonal elements to zero and renormalizing each row.

The maximum likelihood tree used to derive this matrix in Table 5 included 445 HIV-1 clade B Envelope sequences from the Los Alamos database (LANL) combined with sequences from Keele *et al.* (Keele et al., 2008) (code described in (Korber et al., 2000)). While this model was used for the simulations presented in this paper, the use of the HIV-1 specific GTR model turned out to have little impact on the simulation results; a neutral model with an even base distribution and equal substitution rates yielded essentially the same results for the simulations (data not shown).

Portions of this work were done under the auspices of the U. S. Department of Energy under contract DE-AC52-06NA25396 and supported in part by the Center for HIV/AIDS Vaccine Immunology (AI67854), the Bill & Melinda Gates Foundation Grand Challenges program (37874), the University of Alabama at Birmingham Center for AIDS Research (AI27767), the University of Rochester Developmental Center for AIDS Research (P30-AI078498) and NIH grants AI083115, AI028433, and RR0655. We thank Marcus Daniels for technical assistance.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

- Abrahams MR, Anderson JA, Giorgi EE, Seoighe C, Mlisana K, Ping LH, Athreya GS, Treurnicht FK, Keele BF, Wood N, Salazar-Gonzalez JF, Bhattacharya T, Chu H, Hoffman I, Galvin S, Mapanje C, Kazembe P, Thebus R, Fiscus S, Hide W, Cohen MS, Karim SA, Haynes BF, Shaw GM, Hahn BH, Korber BT, Swanstrom R, Williamson C. Quantitating the multiplicity of infection with human immunodeficiency virus type 1 subtype C reveals a non-poisson distribution of transmitted variants. J Virol. 2009;83:3556–67. [PMC free article] [PubMed]
- Achaz G, Palmer S, Kearney M, Maldarelli F, Mellors JW, Coffin JM, Wakeley J. A robust measure of HIV-1 population turnover within chronically infected individuals. Mol Biol Evol. 2004;21:1902–12. [PubMed]
- Bourara K, Liegler TJ, Grant RM. Target cell APOBEC3C can induce limited G-to-A mutation in HIV-1. PLoS Pathog. 2007;3:1477–1485. [PMC free article] [PubMed]
- Casella G, Berger RL. Statistical inference. Brooks/Cole Pub. Co.; Pacific Grove, Calif.: 1990.
- Chen HY, Di Mascio M, Perelson AS, Ho DD, Zhang L. Determination of virus burst size in vivo using a single-cycle SIV in rhesus macaques. Proc Natl Acad Sci U S A. 2007;104:19079–84. [PubMed]
- Chohan B, Lang D, Sagar M, Korber B, Lavreys L, Richardson B, Overbaugh J. Selection for human immunodeficiency virus type 1 envelope glycosylation variants with shorter V1-V2 loop sequences occurs during transmission of certain genetic subtypes and may impact viral RNA levels. J. Virol. 2005;79:6528–6531. [PMC free article] [PubMed]
- Clark SJ, Saag MS, Decker WD, Campbell-Hill S, Roberson JL, Veldkamp PJ, Kappes JC, Hahn BH, Shaw GM. High titers of cytopathic virus in plasma of patients with symptomatic primary HIV-1 infection. N. Engl. J. Med. 1991;324:954–960. [PubMed]
- Delwart EL, Magierowska M, Royz M, Foley B, Peddada L, Smith R, Heldebrand C, Conrad A, Busch MP. Homogeneous quasispecies in 16 out of 17 individuals during very early HIV-1 primary infection. AIDS. 2001;15:1–7. [PubMed]
- Derdeyn CA, Decker JM, Bibollet-Ruche F, Mokili JL, Muldoon M, Denham SA, Heil ML, Kasolo F, Musonda R, Hahn BH, Shaw GM, Korber BT, Allen S, Hunter E. Envelope-constrained neutralization-sensitive HIV-1 after heterosexual transmission. Science. 2004;303:2019–2022. [PubMed]
- Dickover R, Garratty E, Yusim K, Miller C, Korber B, Bryson Y. Role of maternal autologous neutralizing antibody in selective perinatal transmission of human immunodeficiency virus type 1 escape variants. J. Virol. 2006;80:6525–33. [PMC free article] [PubMed]
- Drummond AJ, Rambaut A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 2007;7:214. [PMC free article] [PubMed]
- Drummond AJ, Rambaut A, Shapiro B, Pybus OG. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evol. 2005;22:1185–1192. [PubMed]
- Drummond AJ, Ho SY, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 2006;4:e88. [PMC free article] [PubMed]
- Edwards CT, Holmes EC, Wilson DJ, Viscidi RP, Abrams EJ, Phillips RE, Drummond AJ. Population genetic estimation of the loss of genetic diversity during horizontal transmission of HIV-1. BMC Evol Biol. 2006;6:28. [PMC free article] [PubMed]
- Feller W. An introduction to probability theory and its applications. John Wiley and Sons, Inc.; 1957.
- Fiebig EW, Heldebrant CM, Smith RI, Conrad AJ, Delwart EL, Busch MP. Intermittent low-level viremia in very early primary HIV-1 infection. J Acquir Immune Defic Syndr. 2005;39:133–7. [PubMed]
- Fiebig EW, Wright DJ, Rawal BD, Garrett PE, Schumacher RT, Peddada L, Heldebrant C, Smith R, Conrad A, Kleinman SH, Busch MP. Dynamics of HIV viremia and antibody seroconversion in plasma donors: implications for diagnosis and staging of primary HIV infection. AIDS. 2003;17:1871–1879. [PubMed]
- Frost SD, Liu Y, Pond SL, Chappey C, Wrin T, Petropoulos CJ, Little SJ, Richman DD. Characterization of human immunodeficiency virus type 1 (HIV-1) envelope variation and neutralizing antibody responses during transmission of HIV-1 subtype B. J. Virol. 2005;79:6523–7. [PMC free article] [PubMed]
- Gaines H, von Sydow M, Pehrson PO, Lundbegh P. Clinical picture of primary HIV infection presenting as a glandular-fever-like illness. BMJ. 1988;297:1363–8. [PMC free article] [PubMed]
- Gao F, Chen Y, Levy DN, Conway JA, Kepler TB, Hui H. Unselected mutations in the human immunodeficiency virus type 1 genome are mostly nonsynonymous and often deleterious. J Virol. 2004;78:2426–33. [PMC free article] [PubMed]
- Gilbert PB, Rossini AJ, Shankarappa R. Two-sample tests for comparing intra-individual genetic sequence diversity between populations. Biometrics. 2005;61:106–17. [PubMed]
- Harris RS, Liddament MT. Retroviral restriction by APOBEC proteins. Nat. Rev. Immunol. 2004;4:868–877. [PubMed]
- Heffernan JM, Wahl LM. Monte Carlo estimates of natural variation in HIV infection. J Theor Biol. 2005;236:137–53. [PubMed]
- Huang KJ, Wooley DP. A new cell-based assay for measuring the forward mutation rate of HIV-1. J Virol Methods. 2005;124:95–104. [PubMed]
- Kamina A, Makuch RW, Zhao H. A stochastic modeling of early HIV-1 population dynamics. Math Biosci. 2001;170:187–98. [PubMed]
- Keele B, Li H, Learn GH, Hraber PT, Giorgi EE, Grayson T, Sun C, Chen Y, Yeh WW, Letvin NL, Mascola JR, Nabel GJ, Hayens BF, Bhattacharya T, Perelson AS, Korber BT, Hahn BH, Shaw GS. Low Dose Rectal Inoculation of Rhesus Macaques by SIVsmE660 or SIVmac251 Recapitulates Human Mucosal Infection by HIV-1. J Exp Med To appear. 2009 [PMC free article] [PubMed]
- Keele BF, Giorgi EE, Salazar-Gonzalez JF, Decker JM, Pham KT, Salazar MG, Sun C, Grayson T, Wang S, Li H, Wei X, Jiang C, Kirchherr JL, Gao F, Anderson JA, Ping LH, Swanstrom R, Tomaras GD, Blattner WA, Goepfert PA, Kilby JM, Saag MS, Delwart EL, Busch MP, Cohen MS, Montefiori DC, Haynes BF, Gaschen B, Athreya GS, Lee HY, Wood N, Seoighe C, Perelson AS, Bhattacharya T, Korber BT, Hahn BH, Shaw GM. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc Natl Acad Sci U S A. 2008 [PubMed]
- Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn BH, Wolinsky S, Bhattacharya T. Timing the ancestor of the HIV-1 pandemic strains. Science. 2000;288:1789–96. [PubMed]
- Kuhner MK, Smith LP. Comparing likelihood and Bayesian coalescent estimation of population parameters. Genetics. 2007;175:155–65. [PubMed]
- Kuhner MK, Yamato J, Felsenstein J. Maximum likelihood estimation of population growth rates based on the coalescent. Genetics. 1998;149:429–34. [PubMed]
- LANL Hypermut. http://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermut.html.
- LANL Consensus Maker. http://www.hiv.lanl.gov/content/sequence/CONSENSUS/consensus.html.
- LANL Highlighter. http://www.hiv.lanl.gov/content/sequence/HIGHLIGHT/highlighter.html.
- LANL. SNAP http://www.hiv.lanl.gov/content/sequence/SNAP/SNAP.html.
- LANL GeneCutter. http://www.hiv.lanl.gov/content/sequence/GENE_CUTTER/cutter.html.
- LANL Los Alamos Database. http://www.hiv.lanl.gov/content/index.
- Lee HY, Topham DJ, Park SY, Hollenbaugh J, Treanor J, Mosmann TR, Jin X, Ward BM, Miao H, Holden-Wiltse J, Perelson AS, Zand M, Wu H. Simulation and prediction of the adaptive immune response to influenza a virus infection. J Virol. 2009;83:7151–65. [PMC free article] [PubMed]
- Leigh Brown AJ. Analysis of HIV-1 env gene sequences reveals evidence for a low effective number in the viral population. Proc Natl Acad Sci U S A. 1997;94:1862–5. [PubMed]
- Lindback S, Karlsson AC, Mittler J, Blaxhult A, Carlsson M, Briheim G, Sonnerborg A, Gaines H, Karolinska Institutet Primary HIV Infection Study Group Viral dynamics in primary HIV-1 infection. AIDS. 2000a;14:2283–2291. [PubMed]
- Lindback S, Thorstensson R, Karlsson AC, von Sydow M, Flamholc L, Blaxhult A, Sonnerborg A, Biberfeld G, Gaines H, Karolinska Institute Primary HIV Infection Study Group Diagnosis of primary HIV-1 infection and duration of follow-up after HIV exposure. AIDS. 2000b;14:2333–2339. [PubMed]
- Little SJ, McLean AR, Spina CA, Richman DD, Havlir DV. Viral dynamics of acute HIV-1 infection. J. Exp. Med. 1999;190:841–850. [PMC free article] [PubMed]
- Long EM, Rainwater SM, Lavreys L, Mandaliya K, Overbaugh J. HIV type 1 variants transmitted to women in Kenya require the CCR5 coreceptor for entry, regardless of the genetic complexity of the infecting virus. AIDS Res. Hum. Retroviruses. 2002;18:567–576. [PubMed]
- Mansky LM. Forward mutation rate of human immunodeficiency virus type 1 in a T lymphoid cell line. AIDS Res Hum Retroviruses. 1996;12:307–14. [PubMed]
- Mansky LM, Temin HM. Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase. J. Virol. 1995;69:5087–5094. [PMC free article] [PubMed]
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–80. [PMC free article] [PubMed]
- Markowitz M, Louie M, Hurley A, Sun E, Di Mascio M, Perelson AS, Ho DD. A novel antiviral intervention results in more accurate assessment of human immunodeficiency virus type 1 replication dynamics and T-cell decay in vivo. J. Virol. 2003;77:5037–5038. [PMC free article] [PubMed]
- Minichiello MJ, Durbin R. Mapping trait loci by use of inferred ancestral recombination graphs. Am J Hum Genet. 2006;79:910–22. [PubMed]
- Nowak MA, Lloyd AL, Vasquez GM, Wiltrout TA, Wahl LM, Bischofberger N, Williams J, Kinter A, Fauci AS, Hirsch VM, Lifson JD. Viral dynamics of primary viremia and antiretroviral therapy in simian immunodeficiency virus infection. J Virol. 1997;71:7518–25. [PMC free article] [PubMed]
- Painter SL, Biek R, Holley DC, Poss M. Envelope variants from women recently infected with clade A human immunodeficiency virus type 1 confer distinct phenotypes that are discerned by competition and neutralization experiments. J Virol. 2003;77:8448–61. [PMC free article] [PubMed]
- Perelson AS, Neumann AU, Markowitz M, Leonard JM, Ho DD. HIV-1 dynamics in vivo: virion clearance rate, infected cell life-span, and viral generation time. Science. 1996;271:1582–6. [PubMed]
- Ramratnam B, Bonhoeffer S, Binley J, Hurley A, Zhang L, Mittler JE, Markowitz M, Moore JP, Perelson AS, Ho DD. Rapid production and clearance of HIV-1 and hepatitis C virus assessed by large volume plasma apheresis. Lancet. 1999;354:1782–5. [PubMed]
- Rannala B. Gene genealogy in a population of variable size. Heredity. 1997;78(Pt 4):417–23. [PubMed]
- Ribeiro RM, Bonhoeffer S. A stochastic model for primary HIV infection: optimal timing of therapy. AIDS. 1999;13:351–7. [PubMed]
- Richman DD, Wrin T, Little SJ, Petropoulos CJ. Rapid evolution of the neutralizing antibody response to HIV type 1 infection. Proc Natl Acad Sci U S A. 2003;100:4144–9. [PubMed]
- Ritola K, Pilcher CD, Fiscus SA, Hoffman NG, Nelson JA, Kitrinos KM, Hicks CB, Eron JJ, Jr., Swanstrom R. Multiple V1/V2 env variants are frequently present during primary infection with human immunodeficiency virus type 1. J. Virol. 2004;78:11208–11218. [PMC free article] [PubMed]
- Ruskin HJ, Pandey RB, Liu Y. Viral load and stochastic mutation in a Monte Carlo simulation of HIV. Physica A. 2002;293:315–323.
- Sagar M, Wu X, Lee S, Overbaugh J. Human immunodeficiency virus type 1 V1-V2 envelope loop sequences expand and add glycosylation sites over the course of infection, and these modifications affect antibody neutralization sensitivity. J Virol. 2006;80:9586–98. [PMC free article] [PubMed]
- Sagar M, Kirkegaard E, Long EM, Celum C, Buchbinder S, Daar ES, Overbaugh J. Human immunodeficiency virus type 1 (HIV-1) diversity at time of infection is not restricted to certain risk groups or specific HIV-1 subtypes. J Virol. 2004;78:7279–83. [PMC free article] [PubMed]
- Salazar-Gonzalez JF, Bailes E, Pham KT, Salazar MG, Guffey MB, Keele BF, Derdeyn CA, Farmer P, Hunter E, Allen S, Manigart O, Mulenga J, Anderson JA, Swanstrom R, Haynes BF, Athreya GS, Korber BT, Sharp PM, Shaw GM, Hahn BH. Deciphering human immunodeficiency virus type 1 transmission and early envelope diversification by single-genome amplification and sequencing. J Virol. 2008;82:3952–70. [PMC free article] [PubMed]
- Schacker T, Collier AC, Hughes J, Shea T, Corey L. Clinical and epidemiologic features of primary HIV infection. Ann. Intern. Med. 1996;125:257–264. [PubMed]
- Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, Farzadegan H, Gupta P, Rinaldo CR, Learn GH, He X, Huang XL, Mullins JI. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J. Virol. 1999;73:10489–502. [PMC free article] [PubMed]
- Sharma JC. Inheritance of defective extremities. Indian J Med Res. 1977;66:120–6. [PubMed]
- Simon V, Zennou V, Murray D, Huang Y, Ho DD, Bieniasz PD. Natural variation in Vif: differential impact on APOBEC3G/3F and a potential role in HIV-1 diversification. PLoS Pathog. 2005;1:e6. [PMC free article] [PubMed]
- Slatkin M, Hudson RR. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics. 1991;129:555–62. [PubMed]
- Stafford MA, Corey L, Cao Y, Daar ES, Ho DD, Perelson AS. Modeling plasma virus concentration during primary HIV infection. J. Theor. Biol. 2000;203:285–301. [PubMed]
- Tan WY, Wu H. Stochastic modeling of the dynamics of CD4+ T-cell infection by HIV and some Monte Carlo studies. Math Biosci. 1998;147:173–205. [PubMed]
- Tuckwell HC, Le Corfec E. A stochastic model for early HIV-1 population dynamics. J Theor Biol. 1998;195:451–63. [PubMed]
- Vernazza PL, Eron JJ, Fiscus SA, Cohen MS. Sexual transmission of HIV: infectiousness and prevention. AIDS. 1999;13:155–66. [PubMed]
- Wakeley J. Coalescent Theory, an Introduction. Roberts & Co.; 2008.
- Wei X, Decker JM, Wang S, Hui H, Kappes JC, Wu X, Salazar-Gonzalez JF, Salazar MG, Kilby JM, Saag MS, Komarova NL, Nowak MA, Hahn BH, Kwong PD, Shaw GM. Antibody neutralization and escape by HIV-1. Nature. 2003;422:307–312. [PubMed]
- Wolinsky SM, Wike CM, Korber BT, Hutto C, Parks WP, Rosenblum LL, Kunstman KJ, Furtado MR, Munoz JL. Selective transmission of human immunodeficiency virus type-1 variants from mothers to infants. Science. 1992;255:1134–7. [PubMed]
- Wolinsky SM, Korber BT, Neumann AU, Daniels M, Kunstman KJ, Whetsell AJ, Furtado MR, Cao Y, Ho DD, Safrit JT. Adaptive evolution of human immunodeficiency virus-type 1 during the natural course of infection. Science. 1996;272:537–42. [PubMed]
- Yang Z. Estimating the pattern of nucleotide substitution. J Mol Evol. 1994;39:105–11. [PubMed]
- Zhang LQ, MacKenzie P, Cleland A, Holmes EC, Brown AJ, Simmonds P. Selection for specific sequences in the external envelope protein of human immunodeficiency virus type 1 upon primary infection. J. Virol. 1993;67:3345–56. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |