Search tips
Search criteria 


Logo of biostsLink to Publisher's site
Biostatistics. 2011 October; 12(4): 750–762.
Published online 2011 May 19. doi:  10.1093/biostatistics/kxr011
PMCID: PMC3169671

Recursive partitioning of resistant mutations for longitudinal markers based on a U-type score

Chengcheng Hu*
Division of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, AZ 85724, USA, ude.anozira.liame@ccuh


Development of human immunodeficiency virus resistance mutations is a major cause of failure of antiretroviral treatment. We develop a recursive partitioning method to correlate high-dimensional viral sequences with repeatedly measured outcomes. The splitting criterion of this procedure is based on a class of U-type score statistics. The proposed method is flexible enough to apply to a broad range of problems involving longitudinal outcomes. Simulation studies are performed to explore the finite-sample properties of the proposed method, which is also illustrated through analysis of data collected in 3 phase II clinical trials testing the antiretroviral drug efavirenz.

Keywords: Antiretroviral drugs, Longitudinal data, Recursive partitioning, Repeated measurements, Resistance mutations, Tree method


Highly active antiretroviral therapy (HAART), which usually consists of 3 or more different antiretroviral drugs, has been very effective in suppressing the replication of human immunodeficiency virus (HIV) and slowing the progression to acquired immune deficiency syndrome (AIDS). After HAART initiation, viral load, which is quantified as the number of HIV RNA copies per milliliter of blood plasma, quickly declines. Under the selective pressure of antiretroviral drugs, however, viruses may develop drug resistance mutations, resulting in a slower decline or even a rebound of the viral load. These mutations tend to persist after change in therapy and impact the effect of subsequent treatment regimens, as there is considerable cross-resistance among drugs within the same drug class. Furthermore, minority species of virus with drug resistance mutations, present since initial infection, may emerge as the dominant species under drug pressure. Antiretroviral resistance is a major cause of treatment failure. In clinical studies of HIV-infected patients, resistant mutations are generally monitored by sequencing the relevant portions of the HIV genome; in this paper, we focus on the reverse transcriptease (RT) region of the HIV genome, corresponding to the classes of drugs used in the studies that motivated our research. These genetic regions have a total length of several hundred codons, and high dimensionality complicates the effort to understand the relationship between HIV genotype and viral response to therapy. To apply traditional statistical methods to viral genetic data, we need to reduce the dimensionality of the viral genome and to classify the genetic sequences into a small number of relatively homogeneous classes. Such method may also be useful in stratification of patients in clinical trials and in guiding treatment choices.

Recursive partitioning is of particular interest in the study of HIV-resistant mutations. Such methods naturally accommodate high-dimensional data, allow exploration of complex interactions of different codons, and provide easily interpretable results. For an introduction to recursive partitioning methods, the readers are referred to the seminal work of Breiman and others (1984), while Zhang and Singer (1999) gives a very nice summary of more recent developments. This method has been employed to investigate the association between drug-resistant mutations and a single measurement of viral load or 50% inhibitory concentration (IC50) of a certain drug, which is a drug susceptibility phenotype. See Sevin and others (2000), Foulkes and DeGruttola (2002), Foulkes and others (2004), and Beerenwinkel and others (2002) for details.

Recursive partitioning methods have also been developed for repeatedly measured responses. Zhang (1998) studied multiple binary responses with joint probability distribution from the exponential family. The method of Segal (1992) for longitudinal responses requires the observations to be spaced equally in time, and the covariance structure needs to be modeled. Lee (2006) proposed a recursive partitioning method based on generalized estimating equation (GEE) models, which also requires a common set of observation time points for all study participants or the covariance structure needs to be modeled.

In this article, we develop a tree-based method to classify HIV genetic sequences with respect to longitudinal outcome measures such as plasma HIV-1 RNA (viral load) and CD4 count. Although in most studies, all participants follow the same schedule of clinical visits, at which the biomarkers will be measured, the actual visit time could be weeks or even months away from the scheduled time. Also, trajectory of a patient's longitudinal viral load can be quite erratic and parametric modeling could be very difficult. As examples, Figure 1 shows viral RNA trajectories of 6 subjects who started combination therapy containing the drug efavirenz. Two subjects, represented by solid curves, kept the viral load suppressed below 400 copies/ml (the horizontal line) since a few weeks after treatment initiation. Two other subjects, represented by broken curves, initially had suppressed viral load but then it rebounded back to high levels. Viral load of the remaining 2 subjects (dotted curves) was never suppressed. We see it would be challenging to find an appropriate parametric model for viral load trajectories—the variable is generally measured at different time points for different subjects, and the mean structure is hard to model. The method we propose in Section 2 is fully nonparametric and does not require modeling the mean and covariance structure; it also allows irregular times of measurement. Performance of the method is studied by simulation in Section 3 under realistic settings with moderate sample sizes, and in Section 4, the method is applied to data from 3 phase II clinical trials, from which the 6 subjects shown in Figure 1 are selected. Discussions and potential extensions are relegated to Section 5.

Fig. 1.

Viral load trajectories of 6 participants in the DMP-266 studies. 1


2.1. Definition of U-type score

For a study of n subjects, assume that the viral load of the ith subject (i = 1,…,n) was measured at baseline ti0 = 0 and ni subsequent time points 0 < ti1 < (...) < ti,ni, with values Yi0,Yi1,…,Yi,ni, respectively. Viral load declines in the first few months of successful treatment; we restrict our attention to this phase of the study to focus on therapeutic efficacy. Similar to May and DeGruttola (2007), for 2 different subjects indexed by i and j and one viral measurement of each person measured at tik > 0 and tjl > 0, respectively, we define

An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx1_ht.jpg

This score compares 2 outcome measurements of 2 different subjects; if the ith subject achieved a lower viral load at tik than did the jth subject at time tjltik, a score of 1 is assigned. On the other hand, if the ith subject had a higher viral load at tik than did the jth subject at time tjltik, a score of − 1 is assigned. Since viral load is expected to decline, we assign a score of 0 if a subject achieved a lower viral load level at a later time than did another subject, as we are unable to judge which performance was superior.

To compare the rates of decline for any 2 subjects, we sum over scores of all pairwise comparisons between their viral load measurements

An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx2_ht.jpg

This score compares the full viral load trajectories of the 2 subjects. A positive score implies that the ith subject had a better response to the therapy than did the jth subject.

2.2. Properties of U-type score

When the viral loads of 2 subjects follow exactly the same downward trend and the observations at different times follow the same process, we can easily show that the U-type score comparing them has expectation 0. Prospective studies, including clinical trials, normally require the study participants to visit clinics at a fixed set of time points for collection of laboratory and clinical data. The date at which the visit is scheduled depends on the convenience of both the study subject and the clinic staff, and hence could differ from the pre-set timeline by a few days or even weeks. For the ith subject, let tik* be the kth time point (k = 0,…,m) of scheduled clinical visit. Note that {ti0*,…,tim*} could differ from the {ti0,…,ti,ni} since the subject might miss one or more of the clinical visits. Let Δik be an indicator of whether the ith subject had the kth scheduled visit. In the decline phase of an HIV study, we can generally assume that {ti0*,…,tim*} and {Δi0,…,Δim} are identically and independently distributed across all study participants. We also assume that the visit indicators are independent of all other data, that is, visits are missing completely at random.

Let Yik* be the (possibly unobserved) viral load of the ith subject at the scheduled time tik*. It can be seen that D(i,j) = (ninj) − 1k = 1ml = 1mΔikΔjlD{(Yik*,tik*),(Yjl*,tjl*)}. If the viral load process is the same for both the ith subject and the jth subject, viral load data of the ith and jth subjects are identically distributed. For any k and l, it can be easily seen that E D{(Yik*,tik*),(Yjl*,tjl*)} = E D{(Yjk*,tjk*),(Yil*,til*)} = − E D{(Yil*,til*),(Yjk*,tjk*)}. In the special case of k = l, this reduces to E D{(Yik*,tik*),(Yjk*,tjk*)} = 0. Since the visit indicators Δik's are assumed to be independent of all other data, we get ED(i,j) = 0.

When the viral load processes for subjects i and j are different, assume Yik* = μi(tik*) + ϵi(tik*) and Yjl* = μj(tjl*) + ϵj(tjl*), where μi(·) and μj(·) are deterministic functions representing the means of the time-varying viral loads for the 2 subjects, and ϵi(·) and ϵj(·) are independently and identically distributed mean-zero error processes that are also independent of all other data. If μj(t) > μi(t) for all positive time t in the range we consider, we can easily prove that for 1 ≤ klm, E [D{(Yik*,tik*),(Yjl*,tjl*)} + D{(Yil*,til*),(Yjk*,tjk*)}] > 0, and for 1 ≤ km, E D{(Yik*,tik*),(Yjk*,tjk*)} > 0. We then have ED(i,j) > 0. Thus, we can be sure that when the viral load of subjects i and j are identically distributed, the expectation of D(i,j) is 0, and if the viral load of the ith subject has a mean curve lower than that of the jth subject, the expectation of D(i,j) is positive. These properties demonstrate that D(·,·) provides a valid pairwise comparison measure between any 2 subjects.

2.3. Alternative definition of score

Flexibility in the definition of the pairwise comparison score allows accommodation of special characteristics of the outcome measures. For example, in a study with longitudinal viral loads as the outcomes, we can revise the score D(·,·) to take into account information about viral response. If a subject fails to attain viral load suppression below a certain threshold in the first few months of receiving a new therapy, the viral load is unlikely to become suppressed later on the same treatment, and the therapy is considered to have failed. If one subject always has higher viral load levels than does another, but neither attains viral load levels below a fixed threshold, both are considered treatment failures and we may not want to distinguish between them. Also, if the viral load of a subject who complies with therapy initially becomes suppressed but rebounds later, viral load is unlikely to be suppressed again on the same treatment, which is also considered to have failed. Again we might not want to distinguish between 2 such subjects. Regardless of the nature of treatment failure, it may be most appropriate to consider any subject who failed to have had a worse response than any other subject who did not fail, within the same interval of time.

Investigators may choose to consider subjects whose viral loads are never suppressed to have worse treatment response than those who have an initial viral suppression followed by a rebound. Alternatively, because subjects of the latter description are at more risk of developing new resistance mutations than are those who never responded to treatment, we may consider transient response to be worse than no response. The choice may depend on the goal of the analysis: focusing on biological activity would lead to different choices for the score definition than focusing on clinical benefit.

Viral suppression usually occurs within a few weeks after initiation of an effective new treatment. In HIV/AIDS studies, the most common thresholds for viral suppression are 400 copies/ml or 50 copies/ml, depending on assay characteristics. Then definition of viral rebound varies across studies; in the studies we consider, the definition was 2 consecutive viral loads of over 400 copies/ml (or 50 copies/ml) or one single viral load of over 4000 copies/ml after an initial viral suppression. For the ith subject, let

An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx3_ht.jpg

Now, we can define an alternative score comparing 2 different subjects indexed by i and j:

An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx4_ht.jpg

where D(i,j) is defined in (2). In this definition, an initial viral suppression followed by rebound is considered a response superior to no suppression.

Another marker of interest in HIV studies is CD4 T-lymphocyte count, which is expected to rise after initiation of an effective therapy. A more rapid rise is considered to be a more favorable response; hence, a score can be defined as in (1), reversing the directions of the inequalities involving the outcomes.

2.4. Construction of trees

Here, we introduce a recursive partitioning method to correlate baseline variables of potentially high dimensions, like viral genetic mutations, with trajectories of a longitudinal biomarker, like viral load. A splitting criterion based on the U-type score D(·,·) defined in (2) measures the difference between 2 mutually exclusive groups of study subjects. Based on a viral genetic mutation or other baseline variable chosen by the splitting criterion, the cohort of all subjects (defined as the root node) is divided into 2 subgroups, called daughter nodes. The same splitting rule is then applied recursively to each daughter node to build a large tree. A pruning procedure is then developed to remove some branches to obtain a tree with proper size. The methods are motivated by those of Breiman and others (1984) and also by the survival tree methodology of LeBlanc and Crowley (1993), who used the log-rank statistic as the splitting criterion.

For a study of n subjects, assume all had the relevant genetic sequence at study entry. For the ith subject (i = 1,…,n), let Zir be an indicator of mutation at the rth codon (r = 1,…,R). Let v be an arbitrary node, which is a subset of study participants. For any r, let wr be the subset of v with all study subjects having a mutation at the rth codon: wr = {i[set membership]v:Zir = 1}. Thus, v\wr consists of subjects with a wild-type rth codon. Let

An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx5_ht.jpg

where An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx6_ht.jpg is an estimator for the standard deviation of the sum of U-type scores in the formula above. As for general tree methods, a binary split of v can also be defined by dichotomizing a baseline continuous or ordinal variable Z, where the 2 daughter nodes are defined as {Zz} and {Z < z} for a threshold z or the split can be defined by dichotomizing a baseline nominal (unordered categorical) variable W, where the 2 daughter nodes are defined as {W[set membership]A} and {W¬A} for a subset A of all levels of W.

The variance estimator An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx6_ht.jpg can be obtained from general theory on two-sample U-statistics, which is covered in textbooks like Serfling (1980). For an arbitrary binary split of v into w and v\w, let m1 = |w| and m2 = |v\w|, the sizes of w and v\w, respectively, and let m = m1 + m2. Define Uw = (m1m2) − 1i[set membership]wj[set membership]v\wD(i,j). We have An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx15_ht.jpg(Uw E Uw)→dN(0,σ102/p + σ012/(1 − p)), where σ102 = Cov{D(i,j),D(i,j)}, σ012 = Cov{D(i,j),D(i,j)}, and p is the limit of m1/m. Here, i,i[set membership]w, and ii, while j,j[set membership]v\w, and jj. These quantities can be easily estimated from the data: An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx10_ht.jpg, and An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx11_ht.jpg.

As mentioned earlier, a binary split can be defined based on any ordered or unordered baseline variable (Breiman and others, 1984). For brevity of notation, however, we assume that there is no baseline variable other than those mutation indicators. We now split the node v based on the mutation status at the r*th codon, where G(r*,v) = max1 ≤ rRG(r,v). The 2 daughter nodes are then wr* and v\wr*, with G(v)[equivalent]G(r*,v) a goodness of split measure indicating the difference in viral load trajectories of the 2 daughter nodes. We apply this procedure first to the root node of all subjects, and then recursively to the resulting daughter nodes. The procedure does not stop until every terminal node satisfies one of the criteria: the node size is smaller than a prespecified number, say 5; the node contains only homogeneous subjects with respect to the splitting criterion, that is, G(r,·) = 0 for all 1 ≤ rR or all subjects in the node have the same baseline viral sequence. This splitting process generally results in a large tree, and pruning is needed to avoid overfitting.

2.5. Determination of proper tree size

To describe the pruning procedure, we introduce some notation. For any tree or any branch of a tree T, T~ denotes the set of all terminal nodes of T, and To[equivalent]T\T~ denotes all internal nodes of T. When T1 is a subtree of T2, we write T1[succeeds, equal]T2, while T1[precedes]T2 specifies that T1 is a proper subtree of T2. Also T0 denotes the large tree obtained in the splitting procedure.

The goodness of split measure for a tree or a branch of a tree T, G(T)=v[set membership]ToG(v), is the sum of goodness of split measures for all internal nodes of T. When further splits are made, G(T) can only increase. As in Breiman and others (1984) and LeBlanc and Crowley (1993), we add a penalty term to the splitting criterion, penalizing the complexity of the tree. For α ≥ 0, define Gα(T)=G(T)α|To|. Thus, α is the penalty deducted from the splitting criterion for each additional split. Our goal is to find a subtree of T0 that maximizes Gα(·) for any α ≥ 0. As pointed out by LeBlanc and Crowley (1993), α can be chosen between 2 and 4. Using α = 2 is in the same spirit of the Akaike information criterion, while a penalty of α = 4 is roughly equivalent to setting the significance level at 0.05 for each split.

For any tree T, if T[succeeds, equal]T and Gα(T) = maxT*[succeeds, equal]TGα(T*), we call T an optimally pruned subtree of T for the penalty value α. For α ≥ 0, let T(α) denote the smallest optimally pruned subtree of T with respect to α. For any node v of T, let Tv be the branch of T rooted at v. As in Breiman and others (1984), we can establish the existence of T(α) by induction on |T| and further prove that T(α) = {v[set membership]T:Gα(Tv) > 0 for any ancestor vof v}.

Note that T0(0) = T0. To identify T0(α) for all α ≥ 0, for any node v[set membership]T0, we define

An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx7_ht.jpg

Here, T0v is the branch of T0 rooted at v. Let α1 = minv[set membership]T0g0(v). Then T0 = T0(α) for all α < α1 and T1 = T0(α1) can be obtained by removing from T0 every branch T0v with g0(v) = α1. Now, for any v[set membership]T1, define

An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx8_ht.jpg

Let α2 = minv[set membership]T1g1(v). Then T1 = T0(α) for α1α < α2 and T2 = T0(α2) can be obtained by removing from T1 every branch T1v with g1(v) = α2. This process is repeated until for some S > 0, TS is reduced to the tree containing the root node of T0 only. We then have a sequence of trees TS[precedes](...)[precedes]T1[precedes]T0 and a sequence of numbers = αS + 1 > αS > (...) > α1 > α0 = 0 so that Ts = T0(α) for αsα < αs + 1.

Like LeBlanc and Crowley (1993), we define the cost for a node v as R(v)=v[set membership]To0vG(v) if v[set membership]T0o, and 0 otherwise. For any T[succeeds, equal]T0, let R(T)=v[set membership]T~R(v) and Rα(T)=R(T)+α|T~|. This implies that Gα(T) = G(T0) + αRα(T) and our proposed pruning process can be described in terms of Rα(·) instead of Gα(·), and all optimal properties of the pruning process can be proved as in Chapter 10 of Breiman and others (1984).

Since Gα is calculated from the same data used to grow and prune the tree, it is a biased estimate of the expected Gα for the same tree applied to a new independent data set. If the sample size is large, we can randomly split the data into a training sample and a test sample. A tree will be grown and pruned on the training sample, and the goodness of split measures can be calculated using the test sample, resulting in an unbiased estimate of Gα.

Data sets of moderate size do not allow for separate training and test samples, so resampling methods are used to calculate a proper estimate of Gα as in LeBlanc and Crowley (1993). Let X1 and X2 be 2 samples of data, and let G(X1;X2,T) be the G statistic calculated from X1 for a tree T built on X2. Then G(T) = G(X;X,T) for the data set X, and what we need is an estimator for G*(T) = E{G(X*;X,T)}, where X* is a sample independent of X. Let αs = An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx16_ht.jpg for 1 ≤ sS − 1. We use the following resampling method to identify the proper tree size. First, draw B bootstrap samples from X and denote them by Xb(b = 1,…,B). Then grow a large tree Tb for each b, and for each s find Tb(αs), the optimally pruned subtree of Tb for the per split penalty αs. The overoptimism in G(Tb(αs)) can be estimated by Obs = G(Xb;Xb,Tb(αs)) − G(X;Xb,Tb(αs)). Then An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx12_ht.jpg is a reasonable estimate for the overoptimism in G(T(αs)). For any fixed per split penalty α, we can then choose the optimal subtree as the T(αs) that maximizes An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx13_ht.jpg is maximized at s = S − 1 and An external file that holds a picture, illustration, etc.
Object name is biostskxr011fx14_ht.jpg, the optimal subtree contains only the root node.


Simulation studies are carried out under various settings. In each setting, n = 200 subjects have baseline (t = 0) viral load and viral genetic sequences and are scheduled to have 5 viral load measurements at t = 1,2,…,5. Thus, ni = 5 for i = 1,…,n. For any scheduled t > 0, the actual measurement time follows a uniform distribution on [t − 0.25,t + 0.25]. For r = 1,…,30, let Cir be the indicator of mutation at the rth codon, which are all independent in the simulation. Let Yik be the log10 viral load at time tik for the ith subject. In all scenarios, we assume Yik = β0 + β1itik + ϵik, where the random error ϵik~N(0,σ2) is independent of other variables. We set σ = log102, so that a change of one standard deviation in the error is equivalent to doubling or halving the viral load on the original scale. The baseline mean β0 is set to be log10(50000), close to the median observation in the efavirenz studies to be analysed in the next section. The slope β1i depends on the resistance status of each patient. All simulated viral loads are imputed at 50, reflecting the threshold of accurate quantification for commonly used assays. At any split the left daughter node is associated with better response. Data are simulated under 4 general scenarios: I.

  • I. Each codon has a 0.5 chance to be mutant and none is resistant. Two values of the slope β1i are simulated; β1i = − log10(2) (rapid decline) and β1i = − log10(2)/2.5 (slow decline). Here, the correct tree is one with root node only. II.
  • II. Each codon has a 0.5 chance to be mutant, and a mutant first codon confers resistance to the study drug. For resistant subjects β1i = 0, and for other subjects sensitive to drugs, we simulate 2 separate settings of rapid decline (β1i = − log10(2)) and slow decline (β1i = − log10(2)/2.5). Here, the correct tree has a single split on codon 1. III.
  • III. Each codon has a 0.5 chance to be mutant, and a subject is resistant to drugs only if the first and second codons are both mutant. Hence, about 25% of all subjects are resistant. For resistant subjects, the slope β1i = 0, and for the sensitive subjects, we simulate 2 separate settings of rapid decline (β1i = − log10(2)) and slow decline (β1i = − log10(2)/2.5). Here, the correct tree has 2 splits on the first 2 codons in either order, and the second split should be at the right daughter node of the root. IV.
  • IV. The first codon has probability of 1/3 to be mutant and the other codons have 0.5 chance to mutant. A subject is resistant if Ci1 = 1 with slope β1i = 0. A subject is sensitive to drugs if Ci1 = Ci2 = 0, with the slope β1i set to be negative. If Ci1 = 0 and Ci2 = 1, the viral load declines initially at the first 3 time points after baseline and then rebounds to baseline level. For these rebounding subjects, the slope β1i is negative for k ≤ 3 and β1i = 0 for k = 4,5. Under this scenario, the cohort is divided into 3 subsets of approximately equal sizes: resistant, sensitive, and rebounding. Again we simulate 2 separate settings of rapid decline (β1i = − log10(2)) and slow decline (β1i = − log10(2)/2.5) for the sensitive subjects and for the initial decline phase of the rebounding subjects. Multiple trees are considered correct: root node is split on codon 1 and its left daughter node is split on codon 2, root node is split on codon 2 and its left daughter node is split on codon 1, and also this tree with an extra split added to the right daughter node based on codon 1.

One thousand data sets are generated for each setting. In the pruning process, we use 50 bootstrap samples to calculate the overoptimism in the goodness of split measure. Two values are used for the per-split penalty (α = 2 and 4). Table 1 provides the proportions of getting various types of trees for each setting. The correct trees under all scenarios have been described above. When there should be at least one split but the resulting tree has none we refer to it as a null tree. A noisy tree contains all correct split(s) and also some extra noise. A partially correct tree (denoted by “partial” in the table) contains some but not all correct splits, and an incorrect tree has none of the correct split(s) but some noise.

Table 1.

Results of simulation studies

For the first scenario of no resistance mutations, the method performs well for all settings, with > 95% power to obtain the correct tree. Not surprisingly, the performance is slightly better in the slow decline setting, where observations at different time points are closer to each other. For the second scenario with one resistant mutation, the method performs well in all settings, with > 97% power to obtain the correct tree. For the small number of data sets where the correct tree is not obtained, all selected trees have the root node correctly split on codon 1, with some added noise. For the third scenario with interaction effect of 2 codons, the power to identify the correct tree is very high (96%) in the rapid decline setting. In the slow decline setting, viral load trajectories of resistant and sensitive subjects are not as well separated as in the rapid decline setting, and the power is moderately high ( > 80%). For the fourth scenario with resistant, sensitive and rebounding subjects, the power is very high ( > 97%) in the rapid decline setting and is moderately high ( > 76%) in the slow decline setting. The 2 penalty values α = 2 or 4 induced similar performance in all scenarios. From Table 1, it can be seen that the average tree sizes are close for these 2 α values.

Simulation studies are also carried out for the smaller sample sizes of n = 100 and n = 30, and results are summarized in Section 1 of the web-based supplementary material available at Biostatistics online. For n = 100, the proposed method still performs well except for the slow decline settings of scenarios III and IV, where the power is low to detect the resistant mutations. This is due to the small number of subjects with resistant mutations (e.g., 25 in scenario III). When n = 30 we see some further reduction in the power to detect resistant mutations.

For scenarios II and III, we also study through simulation the power of a parametric approach using the true functional form of the viral load over time. For each node, a GEE model is fitted and a splitting criterion based on the residual sum of squares is used. The algorithm has an almost 100% power to identify the correct resistant codons when n = 200 and 100. The power deteriorated substantially only in the slow decline setting of scenario III when n = 30. As discussed in Section 1, viral trajectories can be quite complicated as they reflect a complex interaction of viral dynamics, evolutionary dynamics, intercurrent illnesses, and behavior. These factors induce a high level of heterogeneity in the data, so parametric modeling will be very difficult, and sometimes even impossible.


Data from 3 phase II clinical trials of efavirenz (DMP266-003, 004 and 005) were analyzed using the proposed methods. Most study participants were previously exposed to nucleoside reverse transcriptase inhibitors (NRTIs), often leading to development of various resistance mutations to NRTIs. The drug under study belongs to the drug class of nonnucleoside reverse transcriptase inhibitors (NNRTIs) and proved to be useful in construction of potent regimens. Because most study participants received efavirenz in combination with 2 NRTIs, baseline NRTI mutations could predict response to treatment. The details of these studies can be found in Bacheler and others (2000). A total of 156 subjects had viral genetic sequences and viral load available at baseline, and at least one viral load measurement after study entry. The December 2009 version of the International AIDS Society-USA drug resistance mutations list (Johnson and others, 2009) specifies 25 NRTI and NNRTI resistance sites, and 12 of them appeared in the baseline data of these 3 studies. They are all considered as potential predictors for future viral load, along with the baseline viral load dichotomized at 50,000 copies/ml, which is close to the median of 55,115 copies/ml. These predictors are summarized in Table 2.

Table 2.

Frequencies of baseline mutations and high viral load

We first investigate plasma HIV-1 RNA trajectories in the first 12 weeks after study entry. The U-type score D(·,·) defined in (2) serves as the basis of the first analysis; another analysis makes use of the alternative score D~(·,·) defined in (3). In the latter, viral suppression is defined as HIV-1 RNA level below 400 copies/ml, and rebound, as 2 consecutive values above 400 copies/ml or a single one above 4000 copies/ml after an initial suppression. As viral load testing did not always happen on the scheduled day, we allow a 1-week window before and after the scheduled dates. In this way, we include all viral loads measured in the first 13 weeks in the 12-week analysis. In calculation of both scores, viral load measurements of 2 different subjects taken within 7 days of each other were considered to have occurred at the same time. During the period of time under consideration, the median number of viral load measurements was 7, and nearly 95% of subjects had at least 4 viral load measurements. For the alternative score, the numbers of subjects with V(i) = 0 (viral suppression), 1 (initial viral suppression followed by rebound) and 2 (no suppression) were 55(35%), 37(24%), and 64(41%), respectively. To identify appropriate size of a tree, the resampling method mentioned at the end of Section 2 was used to correct for the overoptimism in G(T), with 50 bootstrap samples. The per-split penalty α is first set to be 2, and analysis is repeated with α = 4.

Figure 2 shows the tree constructed from original scores with α set to be 2; subjects with better response were included in the left daughter nodes. We see that the root node was split on baseline viral load, with low levels associated with better response. For subjects with low baseline viral load, baseline mutation at RT184 predicted worse response, as did mutation at RT210 for those with high viral load. Subjects with low viral load and mutant RT184 were split based on RT69, and such subjects with wild-type RT69 were further split based on RT219. The mean profile for trajectories in each terminal node is shown in Figure 1 of the web-based supplementary material available at Biostatistics online. The alternative scores gave a very similar tree (not shown), but did not contain the split based on RT210 for subjects with high baseline viral loads. A smaller tree is expected, since the alternative score contains less information than did the original, but the 2 types of scores gave a quite consistent picture about resistant mutations. When α was set to be 4, the resulting tree was unchanged for both scores.

Fig. 2.

Tree for DMP-266 studies based on viral load (VL) of the first 12 weeks.

When viral load trajectories in the first 16 weeks are studied, setting α = 2 and using the original scores gave a similar tree (Figure 3) to that for response in the first 12 weeks. The tree has one fewer split—that associated with RT69. The mean profile for trajectories in each terminal node is shown in Figure 2 of the web-based supplementary material available at Biostatistics online. The alternative scores, on the other hand, did not give any split. When α is set at 4, both types of scores gave the tree with no split. We see that the association between baseline viral load and mutations with viral response during the 16-week period was weaker. This may reflect the impact of new resistance mutations after study entry.

Fig. 3.

Tree for DMP-266 studies based on viral load (VL) of the first 16 weeks.

As suggested by a referee we also carried out a traditional variable selection procedure using GEE models. All predictors significant at 0.10 level in single-covariate models were included in a backward elimination process at the 0.05 significance level. Similar models were obtained for both the 12-week and the 16-week periods, which show that high baseline viral load along with mutations at RT75 and RT101 were associated with poor outcome, while mutation at RT106 was associated with favorable outcome. One cause of the difference in results from this approach and our recursive partitioning model could be the additive versus interaction nature of the methods. Also, the proposed recursive partitioning method gives a predictive model, while GEE is used to find association between predictors and the outcome.


We have proposed a recursive partitioning method to correlate baseline variables, which could be of high dimension, with the entire observed trajectory of a longitudinally measured marker. No functional form is assumed for the marker trajectory, and no parametric assumptions are made on the marker distribution. This method is especially useful for markers like the viral load in HIV/AIDS studies where parametric modeling can be very complicated.

The definition of the U-type score is very flexible and can depend on the nature of specific markers and the goals of analysis. We explored different ways of defining the score for markers expected to fall (e.g. viral load) and those expected to rise (e.g. CD4 T-cell count). We can also incorporate qualitative information like suppression and rebound.

We showed through simulation that our method works very well for modest sample sizes. The pruning process is able to simultaneously identify signals in the data and exclude noise. Choice of per-split penalty α = 2 or 4 had little impact on trees—an observation that also applied in the analysis of DMP-266 data. The choice of α did not affect the resulting tree for the strong signal observed during the first 12 weeks of the study. During that time period, we see that similar trees were generated from the original and alternative scores, despite the very different score definitions. During the 16-week period in which the associations appeared to be somewhat weaker, the original scores, containing more information, led to a greater number of splits, as did the choice of α = 2 compared to α = 4.

In the development of the score, we assumed that longitudinal marker measurements are missing completely at random. More general missing mechanism will be explored in future research. When viral loads are missing at random, the missingness mechanism will be modeled, and an inverse probability-weighted version of D(i,j) can be defined.

In Section 3, we see that the power of the proposed method is low for some scenarios when n ≤ 100. The score comparing 2 viral load observations (Y1,t1) and (Y2,t2) from 2 different subjects is set to 0 if Y1 < Y2 but t1 > t2. This score avoids specification of the mean trajectory and parametric assumptions for the longitudinal viral load but does not utilize all information contained in such pair of observations. Making use of such information, when possible, could improve the power of the proposed method. Further research on this issue will use smoothing techniques in the calculation of scores.


National Institutes of Health (AI51164).

Supplementary Material

Supplementary Data:


The authors would like to thank Bristol-Myers Squibb Company for providing the DMP 266 data, and 2 reviewers and the associate editor for helpful suggestions and comments. Conflict of Interest: None declared.


  • Bacheler L, Anton E, Kudish P, Baker D, Bunville J, Krakowski K, Bolling L, Aujay MXW, Ellis D, Becker M. and others. Human immunodeficiency virus type 1 mutations selected in patients failing efavirenz combination therapy. Antimicrobial Agents and Chemotherapy. 2000;44:2475–2484. [PMC free article] [PubMed]
  • Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D, Korn K, Selbig J. Diversity and complexity of HIV-1 drug resistance: a bioinformatics approach to predicting phenotype from genotype. Proceedings of the National Academy of Sciences of the United States of America. 2002;99:8271–8276. [PubMed]
  • Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Boca Raton, FL: Chapman & Hall/CRC Press; 1984.
  • Foulkes AS, DeGruttola V. Characterizing the relationship between HIV-1 genotype and phenotype: prediction based classification. Biometrics. 2002;58:145–156. [PubMed]
  • Foulkes AS, DeGruttola V, Hertogs K. Combining genotype groups and recursive partitioning: an application to human immunodeficiency virus type 1 genetics data. Applied Statistician. 2004;53:311–323.
  • Johnson VA, Brun-Vézinet F, Clotet B, Günthard HF, Kuritzkes DR, Pillay D, Schapiro JM, Richman DD. Update of the drug resistance mutations in HIV-1: December 2009. Topics in HIV Medicine. 2009;17:138–145. [PubMed]
  • LeBlanc M, Crowley J. Survival trees by goodness of split. Journal of the American Statistical Association. 1993;88:457–467.
  • Lee SK. On classification and regression trees for multiple responses and its application. Journal of Classification. 2006;23:123–141.
  • May S, DeGruttola V. Nonparametric tests for dependent observations obtained at varying time points. Biometrics. 2007;63:194–200. [PubMed]
  • Segal MR. Tree-structured methods for longitudinal data. Journal of the American Statistical Association. 1992;87:407–418.
  • Serfling RJ. Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons; 1980.
  • Sevin AD, De Gruttola V, Nijhuis M, Schapiro JM, Foulkes AS, Para MF, Boucher CA. Methods for investigation of the relationship between drug-susceptibility phenotype and human immunodeficiency virus type-1 genotype with applications to AIDS clinical trials group 333. Journal of Infectious Diseases. 2000;182:59–67. [PubMed]
  • Zhang HP. Classification trees for multiple binary responses. Journal of the American Statistical Association. 1998;93:180–193.
  • Zhang HP, Singer B. Recursive Partitioning in the Health Sciences. New York: Springer; 1999.

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press