Home | About | Journals | Submit | Contact Us | Français |

**|**Biostatistics**|**PMC3169671

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. METHODS
- 3. SIMULATION STUDIES
- 4. PHASE II EFAVIRENZ STUDIES
- 5. DISCUSSION
- SUPPLEMENTARY MATERIAL
- FUNDING
- Supplementary Material
- References

Authors

Related links

Biostatistics. 2011 October; 12(4): 750–762.

Published online 2011 May 19. doi: 10.1093/biostatistics/kxr011

PMCID: PMC3169671

Chengcheng Hu^{*}

Division of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, AZ 85724, USA, Email: ude.anozira.liame@ccuh

Victor Degruttola, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA;

Received 2010 July 20; Revised 2011 April 11; Accepted 2011 April 20.

Copyright © The Author 2011. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

This article has been cited by other articles in PMC.

Development of human immunodeficiency virus resistance mutations is a major cause of failure of antiretroviral treatment. We develop a recursive partitioning method to correlate high-dimensional viral sequences with repeatedly measured outcomes. The splitting criterion of this procedure is based on a class of *U*-type score statistics. The proposed method is flexible enough to apply to a broad range of problems involving longitudinal outcomes. Simulation studies are performed to explore the finite-sample properties of the proposed method, which is also illustrated through analysis of data collected in 3 phase II clinical trials testing the antiretroviral drug efavirenz.

Highly active antiretroviral therapy (HAART), which usually consists of 3 or more different antiretroviral drugs, has been very effective in suppressing the replication of human immunodeficiency virus (HIV) and slowing the progression to acquired immune deficiency syndrome (AIDS). After HAART initiation, viral load, which is quantified as the number of HIV RNA copies per milliliter of blood plasma, quickly declines. Under the selective pressure of antiretroviral drugs, however, viruses may develop drug resistance mutations, resulting in a slower decline or even a rebound of the viral load. These mutations tend to persist after change in therapy and impact the effect of subsequent treatment regimens, as there is considerable cross-resistance among drugs within the same drug class. Furthermore, minority species of virus with drug resistance mutations, present since initial infection, may emerge as the dominant species under drug pressure. Antiretroviral resistance is a major cause of treatment failure. In clinical studies of HIV-infected patients, resistant mutations are generally monitored by sequencing the relevant portions of the HIV genome; in this paper, we focus on the reverse transcriptease (RT) region of the HIV genome, corresponding to the classes of drugs used in the studies that motivated our research. These genetic regions have a total length of several hundred codons, and high dimensionality complicates the effort to understand the relationship between HIV genotype and viral response to therapy. To apply traditional statistical methods to viral genetic data, we need to reduce the dimensionality of the viral genome and to classify the genetic sequences into a small number of relatively homogeneous classes. Such method may also be useful in stratification of patients in clinical trials and in guiding treatment choices.

Recursive partitioning is of particular interest in the study of HIV-resistant mutations. Such methods naturally accommodate high-dimensional data, allow exploration of complex interactions of different codons, and provide easily interpretable results. For an introduction to recursive partitioning methods, the readers are referred to the seminal work of Breiman *and others* (1984), while Zhang and Singer (1999) gives a very nice summary of more recent developments. This method has been employed to investigate the association between drug-resistant mutations and a single measurement of viral load or 50% inhibitory concentration (IC_{50}) of a certain drug, which is a drug susceptibility phenotype. See Sevin *and others* (2000), Foulkes and DeGruttola (2002), Foulkes *and others* (2004), and Beerenwinkel *and others* (2002) for details.

Recursive partitioning methods have also been developed for repeatedly measured responses. Zhang (1998) studied multiple binary responses with joint probability distribution from the exponential family. The method of Segal (1992) for longitudinal responses requires the observations to be spaced equally in time, and the covariance structure needs to be modeled. Lee (2006) proposed a recursive partitioning method based on generalized estimating equation (GEE) models, which also requires a common set of observation time points for all study participants or the covariance structure needs to be modeled.

In this article, we develop a tree-based method to classify HIV genetic sequences with respect to longitudinal outcome measures such as plasma HIV-1 RNA (viral load) and CD4 count. Although in most studies, all participants follow the same schedule of clinical visits, at which the biomarkers will be measured, the actual visit time could be weeks or even months away from the scheduled time. Also, trajectory of a patient's longitudinal viral load can be quite erratic and parametric modeling could be very difficult. As examples, Figure 1 shows viral RNA trajectories of 6 subjects who started combination therapy containing the drug efavirenz. Two subjects, represented by solid curves, kept the viral load suppressed below 400 copies/ml (the horizontal line) since a few weeks after treatment initiation. Two other subjects, represented by broken curves, initially had suppressed viral load but then it rebounded back to high levels. Viral load of the remaining 2 subjects (dotted curves) was never suppressed. We see it would be challenging to find an appropriate parametric model for viral load trajectories—the variable is generally measured at different time points for different subjects, and the mean structure is hard to model. The method we propose in Section 2 is fully nonparametric and does not require modeling the mean and covariance structure; it also allows irregular times of measurement. Performance of the method is studied by simulation in Section 3 under realistic settings with moderate sample sizes, and in Section 4, the method is applied to data from 3 phase II clinical trials, from which the 6 subjects shown in Figure 1 are selected. Discussions and potential extensions are relegated to Section 5.

For a study of *n* subjects, assume that the viral load of the *i*th subject (*i* = 1,…,*n*) was measured at baseline *t*_{i0} = 0 and *n*_{i} subsequent time points 0 < *t*_{i1} < < *t*_{i,ni}, with values *Y*_{i0},*Y*_{i1},…,*Y*_{i,ni}, respectively. Viral load declines in the first few months of successful treatment; we restrict our attention to this phase of the study to focus on therapeutic efficacy. Similar to May and DeGruttola (2007), for 2 different subjects indexed by *i* and *j* and one viral measurement of each person measured at *t*_{ik} > 0 and *t*_{jl} > 0, respectively, we define

(1)

This score compares 2 outcome measurements of 2 different subjects; if the *i*th subject achieved a lower viral load at *t*_{ik} than did the *j*th subject at time *t*_{jl} ≥ *t*_{ik}, a score of 1 is assigned. On the other hand, if the *i*th subject had a higher viral load at *t*_{ik} than did the *j*th subject at time *t*_{jl} ≤ *t*_{ik}, a score of − 1 is assigned. Since viral load is expected to decline, we assign a score of 0 if a subject achieved a lower viral load level at a later time than did another subject, as we are unable to judge which performance was superior.

To compare the rates of decline for any 2 subjects, we sum over scores of all pairwise comparisons between their viral load measurements

(2)

This score compares the full viral load trajectories of the 2 subjects. A positive score implies that the *i*th subject had a better response to the therapy than did the *j*th subject.

When the viral loads of 2 subjects follow exactly the same downward trend and the observations at different times follow the same process, we can easily show that the *U*-type score comparing them has expectation 0. Prospective studies, including clinical trials, normally require the study participants to visit clinics at a fixed set of time points for collection of laboratory and clinical data. The date at which the visit is scheduled depends on the convenience of both the study subject and the clinic staff, and hence could differ from the pre-set timeline by a few days or even weeks. For the *i*th subject, let *t*_{ik}^{*} be the *k*th time point (*k* = 0,…,*m*) of scheduled clinical visit. Note that {*t*_{i0}^{*},…,*t*_{im}^{*}} could differ from the {*t*_{i0},…,*t*_{i,ni}} since the subject might miss one or more of the clinical visits. Let Δ_{ik} be an indicator of whether the *i*th subject had the *k*th scheduled visit. In the decline phase of an HIV study, we can generally assume that {*t*_{i0}^{*},…,*t*_{im}^{*}} and {Δ_{i0},…,Δ_{im}} are identically and independently distributed across all study participants. We also assume that the visit indicators are independent of all other data, that is, visits are missing completely at random.

Let *Y*_{ik}^{*} be the (possibly unobserved) viral load of the *i*th subject at the scheduled time *t*_{ik}^{*}. It can be seen that *D*(*i*,*j*) = (*n*_{i}*n*_{j})^{ − 1}∑_{k = 1}^{m}∑_{l = 1}^{m}Δ_{ik}Δ_{jl}*D*{(*Y*_{ik}^{*},*t*_{ik}^{*}),(*Y*_{jl}^{*},*t*_{jl}^{*})}. If the viral load process is the same for both the *i*th subject and the *j*th subject, viral load data of the *i*th and *j*th subjects are identically distributed. For any *k* and *l*, it can be easily seen that * E* *D*{(*Y*_{ik}^{*},*t*_{ik}^{*}),(*Y*_{jl}^{*},*t*_{jl}^{*})} = * E* *D*{(*Y*_{jk}^{*},*t*_{jk}^{*}),(*Y*_{il}^{*},*t*_{il}^{*})} = − * E* *D*{(*Y*_{il}^{*},*t*_{il}^{*}),(*Y*_{jk}^{*},*t*_{jk}^{*})}. In the special case of *k* = *l*, this reduces to * E* *D*{(*Y*_{ik}^{*},*t*_{ik}^{*}),(*Y*_{jk}^{*},*t*_{jk}^{*})} = 0. Since the visit indicators Δ_{ik}'s are assumed to be independent of all other data, we get E*D*(*i*,*j*) = 0.

When the viral load processes for subjects *i* and *j* are different, assume *Y*_{ik}^{*} = *μ*_{i}(*t*_{ik}^{*}) + *ϵ*_{i}(*t*_{ik}^{*}) and *Y*_{jl}^{*} = *μ*_{j}(*t*_{jl}^{*}) + *ϵ*_{j}(*t*_{jl}^{*}), where *μ*_{i}(·) and *μ*_{j}(·) are deterministic functions representing the means of the time-varying viral loads for the 2 subjects, and *ϵ*_{i}(·) and *ϵ*_{j}(·) are independently and identically distributed mean-zero error processes that are also independent of all other data. If *μ*_{j}(*t*) > *μ*_{i}(*t*) for all positive time *t* in the range we consider, we can easily prove that for 1 ≤ *k*≠*l* ≤ *m*, * E* [*D*{(*Y*_{ik}^{*},*t*_{ik}^{*}),(*Y*_{jl}^{*},*t*_{jl}^{*})} + *D*{(*Y*_{il}^{*},*t*_{il}^{*}),(*Y*_{jk}^{*},*t*_{jk}^{*})}] > 0, and for 1 ≤ *k* ≤ *m*, * E* *D*{(*Y*_{ik}^{*},*t*_{ik}^{*}),(*Y*_{jk}^{*},*t*_{jk}^{*})} > 0. We then have E*D*(*i*,*j*) > 0. Thus, we can be sure that when the viral load of subjects *i* and *j* are identically distributed, the expectation of *D*(*i*,*j*) is 0, and if the viral load of the *i*th subject has a mean curve lower than that of the *j*th subject, the expectation of *D*(*i*,*j*) is positive. These properties demonstrate that *D*(·,·) provides a valid pairwise comparison measure between any 2 subjects.

Flexibility in the definition of the pairwise comparison score allows accommodation of special characteristics of the outcome measures. For example, in a study with longitudinal viral loads as the outcomes, we can revise the score *D*(·,·) to take into account information about viral response. If a subject fails to attain viral load suppression below a certain threshold in the first few months of receiving a new therapy, the viral load is unlikely to become suppressed later on the same treatment, and the therapy is considered to have failed. If one subject always has higher viral load levels than does another, but neither attains viral load levels below a fixed threshold, both are considered treatment failures and we may not want to distinguish between them. Also, if the viral load of a subject who complies with therapy initially becomes suppressed but rebounds later, viral load is unlikely to be suppressed again on the same treatment, which is also considered to have failed. Again we might not want to distinguish between 2 such subjects. Regardless of the nature of treatment failure, it may be most appropriate to consider any subject who failed to have had a worse response than any other subject who did not fail, within the same interval of time.

Investigators may choose to consider subjects whose viral loads are never suppressed to have worse treatment response than those who have an initial viral suppression followed by a rebound. Alternatively, because subjects of the latter description are at more risk of developing new resistance mutations than are those who never responded to treatment, we may consider transient response to be worse than no response. The choice may depend on the goal of the analysis: focusing on biological activity would lead to different choices for the score definition than focusing on clinical benefit.

Viral suppression usually occurs within a few weeks after initiation of an effective new treatment. In HIV/AIDS studies, the most common thresholds for viral suppression are 400 copies/ml or 50 copies/ml, depending on assay characteristics. Then definition of viral rebound varies across studies; in the studies we consider, the definition was 2 consecutive viral loads of over 400 copies/ml (or 50 copies/ml) or one single viral load of over 4000 copies/ml after an initial viral suppression. For the *i*th subject, let

Now, we can define an alternative score comparing 2 different subjects indexed by *i* and *j*:

(3)

where *D*(*i*,*j*) is defined in (2). In this definition, an initial viral suppression followed by rebound is considered a response superior to no suppression.

Another marker of interest in HIV studies is CD4 T-lymphocyte count, which is expected to rise after initiation of an effective therapy. A more rapid rise is considered to be a more favorable response; hence, a score can be defined as in (1), reversing the directions of the inequalities involving the outcomes.

Here, we introduce a recursive partitioning method to correlate baseline variables of potentially high dimensions, like viral genetic mutations, with trajectories of a longitudinal biomarker, like viral load. A splitting criterion based on the *U*-type score *D*(·,·) defined in (2) measures the difference between 2 mutually exclusive groups of study subjects. Based on a viral genetic mutation or other baseline variable chosen by the splitting criterion, the cohort of all subjects (defined as the root node) is divided into 2 subgroups, called daughter nodes. The same splitting rule is then applied recursively to each daughter node to build a large tree. A pruning procedure is then developed to remove some branches to obtain a tree with proper size. The methods are motivated by those of Breiman *and others* (1984) and also by the survival tree methodology of LeBlanc and Crowley (1993), who used the log-rank statistic as the splitting criterion.

For a study of *n* subjects, assume all had the relevant genetic sequence at study entry. For the *i*th subject (*i* = 1,…,*n*), let *Z*_{ir} be an indicator of mutation at the *r*th codon (*r* = 1,…,*R*). Let *v* be an arbitrary node, which is a subset of study participants. For any *r*, let *w*_{r} be the subset of *v* with all study subjects having a mutation at the *r*th codon: *w*_{r} = {*i**v*:*Z*_{ir} = 1}. Thus, *v*\*w*_{r} consists of subjects with a wild-type *r*th codon. Let

where is an estimator for the standard deviation of the sum of *U*-type scores in the formula above. As for general tree methods, a binary split of *v* can also be defined by dichotomizing a baseline continuous or ordinal variable *Z*, where the 2 daughter nodes are defined as {*Z* ≥ *z*} and {*Z* < *z*} for a threshold *z* or the split can be defined by dichotomizing a baseline nominal (unordered categorical) variable *W*, where the 2 daughter nodes are defined as {*W**A*} and {*W*¬*A*} for a subset *A* of all levels of *W*.

The variance estimator can be obtained from general theory on two-sample *U*-statistics, which is covered in textbooks like Serfling (1980). For an arbitrary binary split of *v* into *w* and *v*\*w*, let *m*_{1} = |*w*| and *m*_{2} = |*v*\*w*|, the sizes of *w* and *v*\*w*, respectively, and let *m* = *m*_{1} + *m*_{2}. Define *U*_{w} = (*m*_{1}*m*_{2})^{ − 1}∑_{iw}∑_{jv\w}*D*(*i*,*j*). We have (*U*_{w} − * E* *U*_{w})→_{d}*N*(0,*σ*_{10}^{2}/*p* + *σ*_{01}^{2}/(1 − *p*)), where *σ*_{10}^{2} = Cov{*D*(*i*,*j*),*D*(*i*,*j*^{′})}, *σ*_{01}^{2} = Cov{*D*(*i*,*j*),*D*(*i*^{′},*j*)}, and *p* is the limit of *m*_{1}/*m*. Here, *i*,*i*^{′}*w*, and *i*≠*i*^{′}, while *j*,*j*^{′}*v*\*w*, and *j*≠*j*^{′}. These quantities can be easily estimated from the data: , and .

As mentioned earlier, a binary split can be defined based on any ordered or unordered baseline variable (Breiman *and others*, 1984). For brevity of notation, however, we assume that there is no baseline variable other than those mutation indicators. We now split the node *v* based on the mutation status at the *r*^{*}th codon, where *G*(*r*^{*},*v*) = max_{1 ≤ r ≤ R}*G*(*r*,*v*). The 2 daughter nodes are then *w*_{r*} and *v*\*w*_{r*}, with *G*(*v*)*G*(*r*^{*},*v*) a goodness of split measure indicating the difference in viral load trajectories of the 2 daughter nodes. We apply this procedure first to the root node of all subjects, and then recursively to the resulting daughter nodes. The procedure does not stop until every terminal node satisfies one of the criteria: the node size is smaller than a prespecified number, say 5; the node contains only homogeneous subjects with respect to the splitting criterion, that is, *G*(*r*,·) = 0 for all 1 ≤ *r* ≤ *R* or all subjects in the node have the same baseline viral sequence. This splitting process generally results in a large tree, and pruning is needed to avoid overfitting.

To describe the pruning procedure, we introduce some notation. For any tree or any branch of a tree *T*, $\stackrel{~}{T}$ denotes the set of all terminal nodes of *T*, and $\stackrel{o}{T}T\backslash \stackrel{~}{T}$ denotes all internal nodes of *T*. When *T*_{1} is a subtree of *T*_{2}, we write *T*_{1}*T*_{2}, while *T*_{1}*T*_{2} specifies that *T*_{1} is a proper subtree of *T*_{2}. Also *T*_{0} denotes the large tree obtained in the splitting procedure.

The goodness of split measure for a tree or a branch of a tree *T*, $G\left(T\right)={\sum}_{v\stackrel{o}{T}}$, is the sum of goodness of split measures for all internal nodes of *T*. When further splits are made, *G*(*T*) can only increase. As in Breiman *and others* (1984) and LeBlanc and Crowley (1993), we add a penalty term to the splitting criterion, penalizing the complexity of the tree. For *α* ≥ 0, define ${G}_{\alpha}\left(T\right)=G\left(T\right)-\alpha \left|\stackrel{o}{T}\right|$. Thus, *α* is the penalty deducted from the splitting criterion for each additional split. Our goal is to find a subtree of *T*_{0} that maximizes *G*_{α}(·) for any *α* ≥ 0. As pointed out by LeBlanc and Crowley (1993), *α* can be chosen between 2 and 4. Using *α* = 2 is in the same spirit of the Akaike information criterion, while a penalty of *α* = 4 is roughly equivalent to setting the significance level at 0.05 for each split.

For any tree *T*, if *T*^{′}*T* and *G*_{α}(*T*^{′}) = max_{T*T}*G*_{α}(*T*^{*}), we call *T*^{′} an optimally pruned subtree of *T* for the penalty value *α*. For *α* ≥ 0, let *T*(*α*) denote the smallest optimally pruned subtree of *T* with respect to *α*. For any node *v* of *T*, let *T*_{v} be the branch of *T* rooted at *v*. As in Breiman *and others* (1984), we can establish the existence of *T*(*α*) by induction on |*T*| and further prove that *T*(*α*) = {*v**T*:*G*_{α}(*T*_{v′}) > 0foranyancestor*v*^{′}of*v*}.

Note that *T*_{0}(0) = *T*_{0}. To identify *T*_{0}(*α*) for all *α* ≥ 0, for any node *v**T*_{0}, we define

Here, *T*_{0v} is the branch of *T*_{0} rooted at *v*. Let *α*_{1} = min_{vT0}*g*_{0}(*v*). Then *T*_{0} = *T*_{0}(*α*) for all *α* < *α*_{1} and *T*_{1} = *T*_{0}(*α*_{1}) can be obtained by removing from *T*_{0} every branch *T*_{0v} with *g*_{0}(*v*) = *α*_{1}. Now, for any *v**T*_{1}, define

Let *α*_{2} = min_{vT1}*g*_{1}(*v*). Then *T*_{1} = *T*_{0}(*α*) for *α*_{1} ≤ *α* < *α*_{2} and *T*_{2} = *T*_{0}(*α*_{2}) can be obtained by removing from *T*_{1} every branch *T*_{1v} with *g*_{1}(*v*) = *α*_{2}. This process is repeated until for some *S* > 0, *T*_{S} is reduced to the tree containing the root node of *T*_{0} only. We then have a sequence of trees *T*_{S}*T*_{1}*T*_{0} and a sequence of numbers *∞* = *α*_{S + 1} > *α*_{S} > > *α*_{1} > *α*_{0} = 0 so that *T*_{s} = *T*_{0}(*α*) for *α*_{s} ≤ *α* < *α*_{s + 1}.

Like LeBlanc and Crowley (1993), we define the cost for a node *v* as $R\left(v\right)={\sum}_{{v}^{\prime}{\stackrel{o}{T}}_{0v}}$ if $v\stackrel{o}{{T}_{0}}$, and 0 otherwise. For any *T**T*_{0}, let $R\left(T\right)={\sum}_{v\stackrel{~}{T}}$ and ${R}_{\alpha}\left(T\right)=R\left(T\right)+\alpha \left|\stackrel{~}{T}\right|$. This implies that *G*_{α}(*T*) = *G*(*T*_{0}) + *α* − *R*_{α}(*T*) and our proposed pruning process can be described in terms of *R*_{α}(·) instead of *G*_{α}(·), and all optimal properties of the pruning process can be proved as in Chapter 10 of Breiman *and others* (1984).

Since *G*_{α} is calculated from the same data used to grow and prune the tree, it is a biased estimate of the expected *G*_{α} for the same tree applied to a new independent data set. If the sample size is large, we can randomly split the data into a training sample and a test sample. A tree will be grown and pruned on the training sample, and the goodness of split measures can be calculated using the test sample, resulting in an unbiased estimate of *G*_{α}.

Data sets of moderate size do not allow for separate training and test samples, so resampling methods are used to calculate a proper estimate of *G*_{α} as in LeBlanc and Crowley (1993). Let *X*_{1} and *X*_{2} be 2 samples of data, and let *G*(*X*_{1};*X*_{2},*T*) be the *G* statistic calculated from *X*_{1} for a tree *T* built on *X*_{2}. Then *G*(*T*) = *G*(*X*;*X*,*T*) for the data set *X*, and what we need is an estimator for *G*^{*}(*T*) = *E*{*G*(*X*^{*};*X*,*T*)}, where *X*^{*} is a sample independent of *X*. Let *α*_{s}^{′} = for 1 ≤ *s* ≤ *S* − 1. We use the following resampling method to identify the proper tree size. First, draw *B* bootstrap samples from *X* and denote them by *X*_{b}(*b* = 1,…,*B*). Then grow a large tree *T*_{b} for each *b*, and for each *s* find *T*_{b}(*α*_{s}^{′}), the optimally pruned subtree of *T*_{b} for the per split penalty *α*_{s}^{′}. The overoptimism in *G*(*T*_{b}(*α*_{s}^{′})) can be estimated by *O*_{bs} = *G*(*X*_{b};*X*_{b},*T*_{b}(*α*_{s}^{′})) − *G*(*X*;*X*_{b},*T*_{b}(*α*_{s}^{′})). Then is a reasonable estimate for the overoptimism in *G*(*T*(*α*_{s}^{′})). For any fixed per split penalty *α*, we can then choose the optimal subtree as the *T*(*α*_{s}^{′}) that maximizes is maximized at *s* = *S* − 1 and , the optimal subtree contains only the root node.

Simulation studies are carried out under various settings. In each setting, *n* = 200 subjects have baseline (*t* = 0) viral load and viral genetic sequences and are scheduled to have 5 viral load measurements at *t* = 1,2,…,5. Thus, *n*_{i} = 5 for *i* = 1,…,*n*. For any scheduled *t* > 0, the actual measurement time follows a uniform distribution on [*t* − 0.25,*t* + 0.25]. For *r* = 1,…,30, let *C*_{ir} be the indicator of mutation at the *r*th codon, which are all independent in the simulation. Let *Y*_{ik} be the log_{10} viral load at time *t*_{ik} for the *i*th subject. In all scenarios, we assume *Y*_{ik} = *β*_{0} + *β*_{1i}*t*_{ik} + *ϵ*_{ik}, where the random error *ϵ*_{ik}~*N*(0,*σ*^{2}) is independent of other variables. We set *σ* = log_{10}2, so that a change of one standard deviation in the error is equivalent to doubling or halving the viral load on the original scale. The baseline mean *β*_{0} is set to be log_{10}(50000), close to the median observation in the efavirenz studies to be analysed in the next section. The slope *β*_{1i} depends on the resistance status of each patient. All simulated viral loads are imputed at 50, reflecting the threshold of accurate quantification for commonly used assays. At any split the left daughter node is associated with better response. Data are simulated under 4 general scenarios: I.

- I.Each codon has a 0.5 chance to be mutant and none is resistant. Two values of the slope
*β*_{1i}are simulated;*β*_{1i}= − log_{10}(2) (rapid decline) and*β*_{1i}= − log_{10}(2)/2.5 (slow decline). Here, the correct tree is one with root node only. II. - II.Each codon has a 0.5 chance to be mutant, and a mutant first codon confers resistance to the study drug. For resistant subjects
*β*_{1i}= 0, and for other subjects sensitive to drugs, we simulate 2 separate settings of rapid decline (*β*_{1i}= − log_{10}(2)) and slow decline (*β*_{1i}= − log_{10}(2)/2.5). Here, the correct tree has a single split on codon 1. III. - III.Each codon has a 0.5 chance to be mutant, and a subject is resistant to drugs only if the first and second codons are both mutant. Hence, about 25% of all subjects are resistant. For resistant subjects, the slope
*β*_{1i}= 0, and for the sensitive subjects, we simulate 2 separate settings of rapid decline (*β*_{1i}= − log_{10}(2)) and slow decline (*β*_{1i}= − log_{10}(2)/2.5). Here, the correct tree has 2 splits on the first 2 codons in either order, and the second split should be at the right daughter node of the root. IV. - IV.The first codon has probability of 1/3 to be mutant and the other codons have 0.5 chance to mutant. A subject is resistant if
*C*_{i1}= 1 with slope*β*_{1i}= 0. A subject is sensitive to drugs if*C*_{i1}=*C*_{i2}= 0, with the slope*β*_{1i}set to be negative. If*C*_{i1}= 0 and*C*_{i2}= 1, the viral load declines initially at the first 3 time points after baseline and then rebounds to baseline level. For these rebounding subjects, the slope*β*_{1i}is negative for*k*≤ 3 and*β*_{1i}= 0 for*k*= 4,5. Under this scenario, the cohort is divided into 3 subsets of approximately equal sizes: resistant, sensitive, and rebounding. Again we simulate 2 separate settings of rapid decline (*β*_{1i}= − log_{10}(2)) and slow decline (*β*_{1i}= − log_{10}(2)/2.5) for the sensitive subjects and for the initial decline phase of the rebounding subjects. Multiple trees are considered correct: root node is split on codon 1 and its left daughter node is split on codon 2, root node is split on codon 2 and its left daughter node is split on codon 1, and also this tree with an extra split added to the right daughter node based on codon 1.

One thousand data sets are generated for each setting. In the pruning process, we use 50 bootstrap samples to calculate the overoptimism in the goodness of split measure. Two values are used for the per-split penalty (*α* = 2 and 4). Table 1 provides the proportions of getting various types of trees for each setting. The correct trees under all scenarios have been described above. When there should be at least one split but the resulting tree has none we refer to it as a null tree. A noisy tree contains all correct split(s) and also some extra noise. A partially correct tree (denoted by “partial” in the table) contains some but not all correct splits, and an incorrect tree has none of the correct split(s) but some noise.

For the first scenario of no resistance mutations, the method performs well for all settings, with > 95% power to obtain the correct tree. Not surprisingly, the performance is slightly better in the slow decline setting, where observations at different time points are closer to each other. For the second scenario with one resistant mutation, the method performs well in all settings, with > 97% power to obtain the correct tree. For the small number of data sets where the correct tree is not obtained, all selected trees have the root node correctly split on codon 1, with some added noise. For the third scenario with interaction effect of 2 codons, the power to identify the correct tree is very high (96%) in the rapid decline setting. In the slow decline setting, viral load trajectories of resistant and sensitive subjects are not as well separated as in the rapid decline setting, and the power is moderately high ( > 80%). For the fourth scenario with resistant, sensitive and rebounding subjects, the power is very high ( > 97%) in the rapid decline setting and is moderately high ( > 76%) in the slow decline setting. The 2 penalty values *α* = 2 or 4 induced similar performance in all scenarios. From Table 1, it can be seen that the average tree sizes are close for these 2 *α* values.

Simulation studies are also carried out for the smaller sample sizes of *n* = 100 and *n* = 30, and results are summarized in Section 1 of the web-based supplementary material available at *Biostatistics* online. For *n* = 100, the proposed method still performs well except for the slow decline settings of scenarios III and IV, where the power is low to detect the resistant mutations. This is due to the small number of subjects with resistant mutations (e.g., 25 in scenario III). When *n* = 30 we see some further reduction in the power to detect resistant mutations.

For scenarios II and III, we also study through simulation the power of a parametric approach using the true functional form of the viral load over time. For each node, a GEE model is fitted and a splitting criterion based on the residual sum of squares is used. The algorithm has an almost 100% power to identify the correct resistant codons when *n* = 200 and 100. The power deteriorated substantially only in the slow decline setting of scenario III when *n* = 30. As discussed in Section 1, viral trajectories can be quite complicated as they reflect a complex interaction of viral dynamics, evolutionary dynamics, intercurrent illnesses, and behavior. These factors induce a high level of heterogeneity in the data, so parametric modeling will be very difficult, and sometimes even impossible.

Data from 3 phase II clinical trials of efavirenz (DMP266-003, 004 and 005) were analyzed using the proposed methods. Most study participants were previously exposed to nucleoside reverse transcriptase inhibitors (NRTIs), often leading to development of various resistance mutations to NRTIs. The drug under study belongs to the drug class of nonnucleoside reverse transcriptase inhibitors (NNRTIs) and proved to be useful in construction of potent regimens. Because most study participants received efavirenz in combination with 2 NRTIs, baseline NRTI mutations could predict response to treatment. The details of these studies can be found in Bacheler *and others* (2000). A total of 156 subjects had viral genetic sequences and viral load available at baseline, and at least one viral load measurement after study entry. The December 2009 version of the International AIDS Society-USA drug resistance mutations list (Johnson *and others*, 2009) specifies 25 NRTI and NNRTI resistance sites, and 12 of them appeared in the baseline data of these 3 studies. They are all considered as potential predictors for future viral load, along with the baseline viral load dichotomized at 50,000 copies/ml, which is close to the median of 55,115 copies/ml. These predictors are summarized in Table 2.

We first investigate plasma HIV-1 RNA trajectories in the first 12 weeks after study entry. The *U*-type score *D*(·,·) defined in (2) serves as the basis of the first analysis; another analysis makes use of the alternative score $\stackrel{~}{D}(\xb7,\xb7)$ defined in (3). In the latter, viral suppression is defined as HIV-1 RNA level below 400 copies/ml, and rebound, as 2 consecutive values above 400 copies/ml or a single one above 4000 copies/ml after an initial suppression. As viral load testing did not always happen on the scheduled day, we allow a 1-week window before and after the scheduled dates. In this way, we include all viral loads measured in the first 13 weeks in the 12-week analysis. In calculation of both scores, viral load measurements of 2 different subjects taken within 7 days of each other were considered to have occurred at the same time. During the period of time under consideration, the median number of viral load measurements was 7, and nearly 95% of subjects had at least 4 viral load measurements. For the alternative score, the numbers of subjects with *V*(*i*) = 0 (viral suppression), 1 (initial viral suppression followed by rebound) and 2 (no suppression) were 55(35%), 37(24%), and 64(41%), respectively. To identify appropriate size of a tree, the resampling method mentioned at the end of Section 2 was used to correct for the overoptimism in *G*(*T*), with 50 bootstrap samples. The per-split penalty *α* is first set to be 2, and analysis is repeated with *α* = 4.

Figure 2 shows the tree constructed from original scores with *α* set to be 2; subjects with better response were included in the left daughter nodes. We see that the root node was split on baseline viral load, with low levels associated with better response. For subjects with low baseline viral load, baseline mutation at RT184 predicted worse response, as did mutation at RT210 for those with high viral load. Subjects with low viral load and mutant RT184 were split based on RT69, and such subjects with wild-type RT69 were further split based on RT219. The mean profile for trajectories in each terminal node is shown in Figure 1 of the web-based supplementary material available at *Biostatistics* online. The alternative scores gave a very similar tree (not shown), but did not contain the split based on RT210 for subjects with high baseline viral loads. A smaller tree is expected, since the alternative score contains less information than did the original, but the 2 types of scores gave a quite consistent picture about resistant mutations. When *α* was set to be 4, the resulting tree was unchanged for both scores.

When viral load trajectories in the first 16 weeks are studied, setting *α* = 2 and using the original scores gave a similar tree (Figure 3) to that for response in the first 12 weeks. The tree has one fewer split—that associated with RT69. The mean profile for trajectories in each terminal node is shown in Figure 2 of the web-based supplementary material available at *Biostatistics* online. The alternative scores, on the other hand, did not give any split. When *α* is set at 4, both types of scores gave the tree with no split. We see that the association between baseline viral load and mutations with viral response during the 16-week period was weaker. This may reflect the impact of new resistance mutations after study entry.

As suggested by a referee we also carried out a traditional variable selection procedure using GEE models. All predictors significant at 0.10 level in single-covariate models were included in a backward elimination process at the 0.05 significance level. Similar models were obtained for both the 12-week and the 16-week periods, which show that high baseline viral load along with mutations at RT75 and RT101 were associated with poor outcome, while mutation at RT106 was associated with favorable outcome. One cause of the difference in results from this approach and our recursive partitioning model could be the additive versus interaction nature of the methods. Also, the proposed recursive partitioning method gives a predictive model, while GEE is used to find association between predictors and the outcome.

We have proposed a recursive partitioning method to correlate baseline variables, which could be of high dimension, with the entire observed trajectory of a longitudinally measured marker. No functional form is assumed for the marker trajectory, and no parametric assumptions are made on the marker distribution. This method is especially useful for markers like the viral load in HIV/AIDS studies where parametric modeling can be very complicated.

The definition of the *U*-type score is very flexible and can depend on the nature of specific markers and the goals of analysis. We explored different ways of defining the score for markers expected to fall (e.g. viral load) and those expected to rise (e.g. CD4 T-cell count). We can also incorporate qualitative information like suppression and rebound.

We showed through simulation that our method works very well for modest sample sizes. The pruning process is able to simultaneously identify signals in the data and exclude noise. Choice of per-split penalty *α* = 2 or 4 had little impact on trees—an observation that also applied in the analysis of DMP-266 data. The choice of *α* did not affect the resulting tree for the strong signal observed during the first 12 weeks of the study. During that time period, we see that similar trees were generated from the original and alternative scores, despite the very different score definitions. During the 16-week period in which the associations appeared to be somewhat weaker, the original scores, containing more information, led to a greater number of splits, as did the choice of *α* = 2 compared to *α* = 4.

In the development of the score, we assumed that longitudinal marker measurements are missing completely at random. More general missing mechanism will be explored in future research. When viral loads are missing at random, the missingness mechanism will be modeled, and an inverse probability-weighted version of *D*(*i*,*j*) can be defined.

In Section 3, we see that the power of the proposed method is low for some scenarios when *n* ≤ 100. The score comparing 2 viral load observations (*Y*_{1},*t*_{1}) and (*Y*_{2},*t*_{2}) from 2 different subjects is set to 0 if *Y*_{1} < *Y*_{2} but *t*_{1} > *t*_{2}. This score avoids specification of the mean trajectory and parametric assumptions for the longitudinal viral load but does not utilize all information contained in such pair of observations. Making use of such information, when possible, could improve the power of the proposed method. Further research on this issue will use smoothing techniques in the calculation of scores.

Supplementary material is available at http://biostatistics.oxfordjournals.org.

National Institutes of Health (AI51164).

The authors would like to thank Bristol-Myers Squibb Company for providing the DMP 266 data, and 2 reviewers and the associate editor for helpful suggestions and comments. *Conflict of Interest:* None declared.

- Bacheler L, Anton E, Kudish P, Baker D, Bunville J, Krakowski K, Bolling L, Aujay MXW, Ellis D, Becker M. and others. Human immunodeficiency virus type 1 mutations selected in patients failing efavirenz combination therapy. Antimicrobial Agents and Chemotherapy. 2000;44:2475–2484. [PMC free article] [PubMed]
- Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D, Korn K, Selbig J. Diversity and complexity of HIV-1 drug resistance: a bioinformatics approach to predicting phenotype from genotype. Proceedings of the National Academy of Sciences of the United States of America. 2002;99:8271–8276. [PubMed]
- Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Boca Raton, FL: Chapman & Hall/CRC Press; 1984.
- Foulkes AS, DeGruttola V. Characterizing the relationship between HIV-1 genotype and phenotype: prediction based classification. Biometrics. 2002;58:145–156. [PubMed]
- Foulkes AS, DeGruttola V, Hertogs K. Combining genotype groups and recursive partitioning: an application to human immunodeficiency virus type 1 genetics data. Applied Statistician. 2004;53:311–323.
- Johnson VA, Brun-Vézinet F, Clotet B, Günthard HF, Kuritzkes DR, Pillay D, Schapiro JM, Richman DD. Update of the drug resistance mutations in HIV-1: December 2009. Topics in HIV Medicine. 2009;17:138–145. [PubMed]
- LeBlanc M, Crowley J. Survival trees by goodness of split. Journal of the American Statistical Association. 1993;88:457–467.
- Lee SK. On classification and regression trees for multiple responses and its application. Journal of Classification. 2006;23:123–141.
- May S, DeGruttola V. Nonparametric tests for dependent observations obtained at varying time points. Biometrics. 2007;63:194–200. [PubMed]
- Segal MR. Tree-structured methods for longitudinal data. Journal of the American Statistical Association. 1992;87:407–418.
- Serfling RJ. Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons; 1980.
- Sevin AD, De Gruttola V, Nijhuis M, Schapiro JM, Foulkes AS, Para MF, Boucher CA. Methods for investigation of the relationship between drug-susceptibility phenotype and human immunodeficiency virus type-1 genotype with applications to AIDS clinical trials group 333. Journal of Infectious Diseases. 2000;182:59–67. [PubMed]
- Zhang HP. Classification trees for multiple binary responses. Journal of the American Statistical Association. 1998;93:180–193.
- Zhang HP, Singer B. Recursive Partitioning in the Health Sciences. New York: Springer; 1999.

Articles from Biostatistics (Oxford, England) are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |