2.1. Definition of U-type score

For a study of

*n* subjects, assume that the viral load of the

*i*th subject (

*i* = 1,…,

*n*) was measured at baseline

*t*_{i0} = 0 and

*n*_{i} subsequent time points 0 <

*t*_{i1} <

<

*t*_{i,ni}, with values

*Y*_{i0},

*Y*_{i1},…,

*Y*_{i,ni}, respectively. Viral load declines in the first few months of successful treatment; we restrict our attention to this phase of the study to focus on therapeutic efficacy. Similar to

May and DeGruttola (2007), for 2 different subjects indexed by

*i* and

*j* and one viral measurement of each person measured at

*t*_{ik} > 0 and

*t*_{jl} > 0, respectively, we define

This score compares 2 outcome measurements of 2 different subjects; if the *i*th subject achieved a lower viral load at *t*_{ik} than did the *j*th subject at time *t*_{jl} ≥ *t*_{ik}, a score of 1 is assigned. On the other hand, if the *i*th subject had a higher viral load at *t*_{ik} than did the *j*th subject at time *t*_{jl} ≤ *t*_{ik}, a score of − 1 is assigned. Since viral load is expected to decline, we assign a score of 0 if a subject achieved a lower viral load level at a later time than did another subject, as we are unable to judge which performance was superior.

To compare the rates of decline for any 2 subjects, we sum over scores of all pairwise comparisons between their viral load measurements

This score compares the full viral load trajectories of the 2 subjects. A positive score implies that the *i*th subject had a better response to the therapy than did the *j*th subject.

2.2. Properties of U-type score

When the viral loads of 2 subjects follow exactly the same downward trend and the observations at different times follow the same process, we can easily show that the *U*-type score comparing them has expectation 0. Prospective studies, including clinical trials, normally require the study participants to visit clinics at a fixed set of time points for collection of laboratory and clinical data. The date at which the visit is scheduled depends on the convenience of both the study subject and the clinic staff, and hence could differ from the pre-set timeline by a few days or even weeks. For the *i*th subject, let *t*_{ik}^{*} be the *k*th time point (*k* = 0,…,*m*) of scheduled clinical visit. Note that {*t*_{i0}^{*},…,*t*_{im}^{*}} could differ from the {*t*_{i0},…,*t*_{i,ni}} since the subject might miss one or more of the clinical visits. Let Δ_{ik} be an indicator of whether the *i*th subject had the *k*th scheduled visit. In the decline phase of an HIV study, we can generally assume that {*t*_{i0}^{*},…,*t*_{im}^{*}} and {Δ_{i0},…,Δ_{im}} are identically and independently distributed across all study participants. We also assume that the visit indicators are independent of all other data, that is, visits are missing completely at random.

Let *Y*_{ik}^{*} be the (possibly unobserved) viral load of the *i*th subject at the scheduled time *t*_{ik}^{*}. It can be seen that *D*(*i*,*j*) = (*n*_{i}*n*_{j})^{ − 1}∑_{k = 1}^{m}∑_{l = 1}^{m}Δ_{ik}Δ_{jl}*D*{(*Y*_{ik}^{*},*t*_{ik}^{*}),(*Y*_{jl}^{*},*t*_{jl}^{*})}. If the viral load process is the same for both the *i*th subject and the *j*th subject, viral load data of the *i*th and *j*th subjects are identically distributed. For any *k* and *l*, it can be easily seen that * E* *D*{(*Y*_{ik}^{*},*t*_{ik}^{*}),(*Y*_{jl}^{*},*t*_{jl}^{*})} = * E* *D*{(*Y*_{jk}^{*},*t*_{jk}^{*}),(*Y*_{il}^{*},*t*_{il}^{*})} = − * E* *D*{(*Y*_{il}^{*},*t*_{il}^{*}),(*Y*_{jk}^{*},*t*_{jk}^{*})}. In the special case of *k* = *l*, this reduces to * E* *D*{(*Y*_{ik}^{*},*t*_{ik}^{*}),(*Y*_{jk}^{*},*t*_{jk}^{*})} = 0. Since the visit indicators Δ_{ik}'s are assumed to be independent of all other data, we get E*D*(*i*,*j*) = 0.

When the viral load processes for subjects *i* and *j* are different, assume *Y*_{ik}^{*} = *μ*_{i}(*t*_{ik}^{*}) + *ϵ*_{i}(*t*_{ik}^{*}) and *Y*_{jl}^{*} = *μ*_{j}(*t*_{jl}^{*}) + *ϵ*_{j}(*t*_{jl}^{*}), where *μ*_{i}(·) and *μ*_{j}(·) are deterministic functions representing the means of the time-varying viral loads for the 2 subjects, and *ϵ*_{i}(·) and *ϵ*_{j}(·) are independently and identically distributed mean-zero error processes that are also independent of all other data. If *μ*_{j}(*t*) > *μ*_{i}(*t*) for all positive time *t* in the range we consider, we can easily prove that for 1 ≤ *k*≠*l* ≤ *m*, * E* [*D*{(*Y*_{ik}^{*},*t*_{ik}^{*}),(*Y*_{jl}^{*},*t*_{jl}^{*})} + *D*{(*Y*_{il}^{*},*t*_{il}^{*}),(*Y*_{jk}^{*},*t*_{jk}^{*})}] > 0, and for 1 ≤ *k* ≤ *m*, * E* *D*{(*Y*_{ik}^{*},*t*_{ik}^{*}),(*Y*_{jk}^{*},*t*_{jk}^{*})} > 0. We then have E*D*(*i*,*j*) > 0. Thus, we can be sure that when the viral load of subjects *i* and *j* are identically distributed, the expectation of *D*(*i*,*j*) is 0, and if the viral load of the *i*th subject has a mean curve lower than that of the *j*th subject, the expectation of *D*(*i*,*j*) is positive. These properties demonstrate that *D*(·,·) provides a valid pairwise comparison measure between any 2 subjects.

2.3. Alternative definition of score

Flexibility in the definition of the pairwise comparison score allows accommodation of special characteristics of the outcome measures. For example, in a study with longitudinal viral loads as the outcomes, we can revise the score *D*(·,·) to take into account information about viral response. If a subject fails to attain viral load suppression below a certain threshold in the first few months of receiving a new therapy, the viral load is unlikely to become suppressed later on the same treatment, and the therapy is considered to have failed. If one subject always has higher viral load levels than does another, but neither attains viral load levels below a fixed threshold, both are considered treatment failures and we may not want to distinguish between them. Also, if the viral load of a subject who complies with therapy initially becomes suppressed but rebounds later, viral load is unlikely to be suppressed again on the same treatment, which is also considered to have failed. Again we might not want to distinguish between 2 such subjects. Regardless of the nature of treatment failure, it may be most appropriate to consider any subject who failed to have had a worse response than any other subject who did not fail, within the same interval of time.

Investigators may choose to consider subjects whose viral loads are never suppressed to have worse treatment response than those who have an initial viral suppression followed by a rebound. Alternatively, because subjects of the latter description are at more risk of developing new resistance mutations than are those who never responded to treatment, we may consider transient response to be worse than no response. The choice may depend on the goal of the analysis: focusing on biological activity would lead to different choices for the score definition than focusing on clinical benefit.

Viral suppression usually occurs within a few weeks after initiation of an effective new treatment. In HIV/AIDS studies, the most common thresholds for viral suppression are 400 copies/ml or 50 copies/ml, depending on assay characteristics. Then definition of viral rebound varies across studies; in the studies we consider, the definition was 2 consecutive viral loads of over 400 copies/ml (or 50 copies/ml) or one single viral load of over 4000 copies/ml after an initial viral suppression. For the

*i*th subject, let

Now, we can define an alternative score comparing 2 different subjects indexed by

*i* and

*j*:

where *D*(*i*,*j*) is defined in (2). In this definition, an initial viral suppression followed by rebound is considered a response superior to no suppression.

Another marker of interest in HIV studies is CD4 T-lymphocyte count, which is expected to rise after initiation of an effective therapy. A more rapid rise is considered to be a more favorable response; hence, a score can be defined as in (1), reversing the directions of the inequalities involving the outcomes.

2.4. Construction of trees

Here, we introduce a recursive partitioning method to correlate baseline variables of potentially high dimensions, like viral genetic mutations, with trajectories of a longitudinal biomarker, like viral load. A splitting criterion based on the

*U*-type score

*D*(·,·) defined in (2) measures the difference between 2 mutually exclusive groups of study subjects. Based on a viral genetic mutation or other baseline variable chosen by the splitting criterion, the cohort of all subjects (defined as the root node) is divided into 2 subgroups, called daughter nodes. The same splitting rule is then applied recursively to each daughter node to build a large tree. A pruning procedure is then developed to remove some branches to obtain a tree with proper size. The methods are motivated by those of Breiman

*and others* (1984) and also by the survival tree methodology of

LeBlanc and Crowley (1993), who used the log-rank statistic as the splitting criterion.

For a study of

*n* subjects, assume all had the relevant genetic sequence at study entry. For the

*i*th subject (

*i* = 1,…,

*n*), let

*Z*_{ir} be an indicator of mutation at the

*r*th codon (

*r* = 1,…,

*R*). Let

*v* be an arbitrary node, which is a subset of study participants. For any

*r*, let

*w*_{r} be the subset of

*v* with all study subjects having a mutation at the

*r*th codon:

*w*_{r} = {

*i**v*:

*Z*_{ir} = 1}. Thus,

*v*\

*w*_{r} consists of subjects with a wild-type

*r*th codon. Let

where

is an estimator for the standard deviation of the sum of

*U*-type scores in the formula above. As for general tree methods, a binary split of

*v* can also be defined by dichotomizing a baseline continuous or ordinal variable

*Z*, where the 2 daughter nodes are defined as {

*Z* ≥

*z*} and {

*Z* <

*z*} for a threshold

*z* or the split can be defined by dichotomizing a baseline nominal (unordered categorical) variable

*W*, where the 2 daughter nodes are defined as {

*W**A*} and {

*W*¬

*A*} for a subset

*A* of all levels of

*W*.

The variance estimator

can be obtained from general theory on two-sample

*U*-statistics, which is covered in textbooks like

Serfling (1980). For an arbitrary binary split of

*v* into

*w* and

*v*\

*w*, let

*m*_{1} = |

*w*| and

*m*_{2} = |

*v*\

*w*|, the sizes of

*w* and

*v*\

*w*, respectively, and let

*m* =

*m*_{1} +

*m*_{2}. Define

*U*_{w} = (

*m*_{1}*m*_{2})

^{ − 1}∑

_{iw}∑

_{jv\w}*D*(

*i*,

*j*). We have

(

*U*_{w} −

* E* *U*_{w})→

_{d}*N*(0,

*σ*_{10}^{2}/

*p* +

*σ*_{01}^{2}/(1 −

*p*)), where

*σ*_{10}^{2} = Cov{

*D*(

*i*,

*j*),

*D*(

*i*,

*j*^{′})},

*σ*_{01}^{2} = Cov{

*D*(

*i*,

*j*),

*D*(

*i*^{′},

*j*)}, and

*p* is the limit of

*m*_{1}/

*m*. Here,

*i*,

*i*^{′}*w*, and

*i*≠

*i*^{′}, while

*j*,

*j*^{′}*v*\

*w*, and

*j*≠

*j*^{′}. These quantities can be easily estimated from the data:

, and

.

As mentioned earlier, a binary split can be defined based on any ordered or unordered baseline variable (Breiman

*and others*, 1984). For brevity of notation, however, we assume that there is no baseline variable other than those mutation indicators. We now split the node

*v* based on the mutation status at the

*r*^{*}th codon, where

*G*(

*r*^{*},

*v*) = max

_{1 ≤ r ≤ R}*G*(

*r*,

*v*). The 2 daughter nodes are then

*w*_{r*} and

*v*\

*w*_{r*}, with

*G*(

*v*)

*G*(

*r*^{*},

*v*) a goodness of split measure indicating the difference in viral load trajectories of the 2 daughter nodes. We apply this procedure first to the root node of all subjects, and then recursively to the resulting daughter nodes. The procedure does not stop until every terminal node satisfies one of the criteria: the node size is smaller than a prespecified number, say 5; the node contains only homogeneous subjects with respect to the splitting criterion, that is,

*G*(

*r*,·) = 0 for all 1 ≤

*r* ≤

*R* or all subjects in the node have the same baseline viral sequence. This splitting process generally results in a large tree, and pruning is needed to avoid overfitting.

2.5. Determination of proper tree size

To describe the pruning procedure, we introduce some notation. For any tree or any branch of a tree

*T*,

denotes the set of all terminal nodes of

*T*, and

denotes all internal nodes of

*T*. When

*T*_{1} is a subtree of

*T*_{2}, we write

*T*_{1}*T*_{2}, while

*T*_{1}*T*_{2} specifies that

*T*_{1} is a proper subtree of

*T*_{2}. Also

*T*_{0} denotes the large tree obtained in the splitting procedure.

The goodness of split measure for a tree or a branch of a tree

*T*,

, is the sum of goodness of split measures for all internal nodes of

*T*. When further splits are made,

*G*(

*T*) can only increase. As in Breiman

*and others* (1984) and

LeBlanc and Crowley (1993), we add a penalty term to the splitting criterion, penalizing the complexity of the tree. For

*α* ≥ 0, define

. Thus,

*α* is the penalty deducted from the splitting criterion for each additional split. Our goal is to find a subtree of

*T*_{0} that maximizes

*G*_{α}(·) for any

*α* ≥ 0. As pointed out by

LeBlanc and Crowley (1993),

*α* can be chosen between 2 and 4. Using

*α* = 2 is in the same spirit of the Akaike information criterion, while a penalty of

*α* = 4 is roughly equivalent to setting the significance level at 0.05 for each split.

For any tree

*T*, if

*T*^{′}*T* and

*G*_{α}(

*T*^{′}) = max

_{T*T}*G*_{α}(

*T*^{*}), we call

*T*^{′} an optimally pruned subtree of

*T* for the penalty value

*α*. For

*α* ≥ 0, let

*T*(

*α*) denote the smallest optimally pruned subtree of

*T* with respect to

*α*. For any node

*v* of

*T*, let

*T*_{v} be the branch of

*T* rooted at

*v*. As in Breiman

*and others* (1984), we can establish the existence of

*T*(

*α*) by induction on |

*T*| and further prove that

*T*(

*α*) = {

*v**T*:

*G*_{α}(

*T*_{v′}) > 0

for

any

ancestor

*v*^{′}of

*v*}.

Note that

*T*_{0}(0) =

*T*_{0}. To identify

*T*_{0}(

*α*) for all

*α* ≥ 0, for any node

*v**T*_{0}, we define

Here,

*T*_{0v} is the branch of

*T*_{0} rooted at

*v*. Let

*α*_{1} = min

_{vT0}*g*_{0}(

*v*). Then

*T*_{0} =

*T*_{0}(

*α*) for all

*α* <

*α*_{1} and

*T*_{1} =

*T*_{0}(

*α*_{1}) can be obtained by removing from

*T*_{0} every branch

*T*_{0v} with

*g*_{0}(

*v*) =

*α*_{1}. Now, for any

*v**T*_{1}, define

Let

*α*_{2} = min

_{vT1}*g*_{1}(

*v*). Then

*T*_{1} =

*T*_{0}(

*α*) for

*α*_{1} ≤

*α* <

*α*_{2} and

*T*_{2} =

*T*_{0}(

*α*_{2}) can be obtained by removing from

*T*_{1} every branch

*T*_{1v} with

*g*_{1}(

*v*) =

*α*_{2}. This process is repeated until for some

*S* > 0,

*T*_{S} is reduced to the tree containing the root node of

*T*_{0} only. We then have a sequence of trees

*T*_{S}*T*_{1}*T*_{0} and a sequence of numbers

*∞* =

*α*_{S + 1} >

*α*_{S} >

>

*α*_{1} >

*α*_{0} = 0 so that

*T*_{s} =

*T*_{0}(

*α*) for

*α*_{s} ≤

*α* <

*α*_{s + 1}.

Like

LeBlanc and Crowley (1993), we define the cost for a node

*v* as

if

, and 0 otherwise. For any

*T**T*_{0}, let

and

. This implies that

*G*_{α}(

*T*) =

*G*(

*T*_{0}) +

*α* −

*R*_{α}(

*T*) and our proposed pruning process can be described in terms of

*R*_{α}(·) instead of

*G*_{α}(·), and all optimal properties of the pruning process can be proved as in Chapter 10 of Breiman

*and others* (1984).

Since *G*_{α} is calculated from the same data used to grow and prune the tree, it is a biased estimate of the expected *G*_{α} for the same tree applied to a new independent data set. If the sample size is large, we can randomly split the data into a training sample and a test sample. A tree will be grown and pruned on the training sample, and the goodness of split measures can be calculated using the test sample, resulting in an unbiased estimate of *G*_{α}.

Data sets of moderate size do not allow for separate training and test samples, so resampling methods are used to calculate a proper estimate of

*G*_{α} as in

LeBlanc and Crowley (1993). Let

*X*_{1} and

*X*_{2} be 2 samples of data, and let

*G*(

*X*_{1};

*X*_{2},

*T*) be the

*G* statistic calculated from

*X*_{1} for a tree

*T* built on

*X*_{2}. Then

*G*(

*T*) =

*G*(

*X*;

*X*,

*T*) for the data set

*X*, and what we need is an estimator for

*G*^{*}(

*T*) =

*E*{

*G*(

*X*^{*};

*X*,

*T*)}, where

*X*^{*} is a sample independent of

*X*. Let

*α*_{s}^{′} =

for 1 ≤

*s* ≤

*S* − 1. We use the following resampling method to identify the proper tree size. First, draw

*B* bootstrap samples from

*X* and denote them by

*X*_{b}(

*b* = 1,…,

*B*). Then grow a large tree

*T*_{b} for each

*b*, and for each

*s* find

*T*_{b}(

*α*_{s}^{′}), the optimally pruned subtree of

*T*_{b} for the per split penalty

*α*_{s}^{′}. The overoptimism in

*G*(

*T*_{b}(

*α*_{s}^{′})) can be estimated by

*O*_{bs} =

*G*(

*X*_{b};

*X*_{b},

*T*_{b}(

*α*_{s}^{′})) −

*G*(

*X*;

*X*_{b},

*T*_{b}(

*α*_{s}^{′})). Then

is a reasonable estimate for the overoptimism in

*G*(

*T*(

*α*_{s}^{′})). For any fixed per split penalty

*α*, we can then choose the optimal subtree as the

*T*(

*α*_{s}^{′}) that maximizes

is maximized at

*s* =

*S* − 1 and

, the optimal subtree contains only the root node.