1.  Recursive Partitioning for Monotone Missing at Random Longitudinal Markers 
Statistics in medicine  2012;32(6):978-994.
The development of HIV resistance mutations reduces the efficacy of specific antiretroviral drugs used to treat HIV infection, and cross-resistance within classes of drugs is common. Recursive partitioning has been extensively used to identify resistance mutations associated with a reduced virologic response measured at a single time point; here we describe a statistical method that accommodates a large set of genetic or other covariates and a longitudinal response. This recursive partitioning approach for continuous longitudinal data uses the kernel of a U-statistic as the splitting criterion, and avoids the need for parametric assumptions regarding the relationship between observed response trajectories and covariates. We propose an extension of this approach that allows longitudinal measurements to be monotone missing at random by making use of inverse probability weights. We assess the performance of our method using extensive simulation studies, and apply them to data collected by the Forum for Collaborative HIV Research as part of an investigation of the viral genetic mutations associated with reduced clinical efficacy of the drug abacavir.
PMCID: PMC3754451  PMID: 22941582
Inverse Probability Weighting; Recursive Partitioning; U-statistics
2.  Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm 
Nucleic Acids Research  2010;38(15):e158.
It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen–Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative segmentation algorithm, DJS, using two sets of simulated genomic sequences. Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas DJS failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion. By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.
PMCID: PMC2926622  PMID: 20571085
3.  Evaluating uses of data mining techniques in propensity score estimation: a simulation study† 
In propensity score modeling, it is a standard practice to optimize the prediction of exposure status based on the covariate information. In a simulation study, we examined in what situations analyses based on various types of exposure propensity score (EPS) models using data mining techniques such as recursive partitioning (RP) and neural networks (NN) produce unbiased and/or efficient results.
We simulated data for a hypothetical cohort study (n=2000) with a binary exposure/outcome and 10 binary/ continuous covariates with seven scenarios differing by non-linear and/or non-additive associations between exposure and covariates. EPS models used logistic regression (LR) (all possible main effects), RP1 (without pruning), RP2 (with pruning), and NN. We calculated c-statistics (C), standard errors (SE), and bias of exposure-effect estimates from outcome models for the PS-matched dataset.
Data mining techniques yielded higher C than LR (mean: NN, 0.86; RPI, 0.79; RP2, 0.72; and LR, 0.76). SE tended to be greater in models with higher C. Overall bias was small for each strategy, although NN estimates tended to be the least biased. C was not correlated with the magnitude of bias (correlation coefficient [COR]=−0.3, p=0.1) but increased SE (COR=0.7, p<0.001).
Effect estimates from EPS models by simple LR were generally robust. NN models generally provided the least numerically biased estimates. C was not associated with the magnitude of bias but was with the increased SE.
PMCID: PMC2905676  PMID: 18311848
propensity score; logistic regression; neural networks; recursive partitioning
4.  On the Adaptive Partition Approach to the Detection of Multiple Change-Points 
PLoS ONE  2011;6(5):e19754.
With an adaptive partition procedure, we can partition a “time course” into consecutive non-overlapped intervals such that the population means/proportions of the observations in two adjacent intervals are significantly different at a given level . However, the widely used recursive combination or partition procedures do not guarantee a global optimization. We propose a modified dynamic programming algorithm to achieve a global optimization. Our method can provide consistent estimation results. In a comprehensive simulation study, our method shows an improved performance when it is compared to the recursive combination/partition procedures. In practice, can be determined based on a cross-validation procedure. As an application, we consider the well-known Pima Indian Diabetes data. We explore the relationship among the diabetes risk and several important variables including the plasma glucose concentration, body mass index and age.
PMCID: PMC3101215  PMID: 21629694
5.  An Introduction to Tree-Structured Modeling with Application to Quality of Life (QOL) Data 
Nursing research  2011;60(4):247-255.
Investigators addressing nursing research are faced increasingly with the need to analyze data that involve variables of mixed types and are characterized by complex nonlinearity and interactions. Tree-based methods, also called recursive partitioning, are gaining popularity in various fields. In addition to efficiency and flexibility in handling multifaceted data, tree-based methods offer ease of interpretation.
To introduce tree-based methods, discuss their advantages and pitfalls in application, and describe their potential use in nursing research.
In this paper, (a) an introduction to tree-structured methods is presented, (b) the technique is illustrated via quality of life (QOL) data collected in the Breast Cancer Education Intervention (BCEI) study, and (c) implications for their potential use in nursing research are discussed.
As illustrated by the QOL analysis example, tree methods generate interesting and easily understood findings that cannot be uncovered via traditional linear regression analysis. The expanding breadth and complexity of nursing research may entail the use of new tools to improve efficiency and gain new insights. In certain situations, tree-based methods offer an attractive approach that help address such needs.
PMCID: PMC3136208  PMID: 21720217
breast cancer survivors; CART; quality of life; tree-based methods; random forests
6.  Simulation via direct computation of partition functions 
In this paper, we demonstrate the efficiency of simulations via direct computation of the partition function under various macroscopic conditions, such as different temperatures or volumes. The method can compute partition functions by flattening histograms, through, for example, the Wang-Landau recursive scheme, outside the energy space. This method offers a more general and flexible framework for handling various types of ensembles, especially ones in which computation of the density of states is not convenient. It can be easily scaled to large systems, and it is flexible in incorporating Monte Carlo cluster algorithms or molecular dynamics. High efficiency is shown in simulating large Ising models, in finding ground states of simple protein models, and in studying the liquid-vapor phase transition of a simple fluid. The method is very simple to implement and we expect it to be efficient in studying complex systems with rugged energy landscapes, e.g., biological macromolecules.
PMCID: PMC3133746  PMID: 17930362
7.  Ultrasensitive Allele-Specific PCR Reveals Rare Preexisting Drug-Resistant Variants and a Large Replicating Virus Population in Macaques Infected with a Simian Immunodeficiency Virus Containing Human Immunodeficiency Virus Reverse Transcriptase 
Journal of Virology  2012;86(23):12525-12530.
It has been proposed that most drug-resistant mutants, resulting from a single-nucleotide change, exist at low frequency in human immunodeficiency virus type 1 (HIV-1) and simian immunodeficiency virus (SIV) populations in vivo prior to the initiation of antiretroviral therapy (ART). To test this hypothesis and to investigate the emergence of resistant mutants with drug selection, we developed a new ultrasensitive allele-specific PCR (UsASP) assay, which can detect drug resistance mutations at a frequency of ≥0.001% of the virus population. We applied this assay to plasma samples obtained from macaques infected with an SIV variant containing HIV-1 reverse transcriptase (RT) (RT-simian-human immunodeficiency [SHIV]mne), before and after they were exposed to a short course of efavirenz (EFV) monotherapy. We detected RT inhibitor (RTI) resistance mutations K65R and M184I but not K103N in 2 of 2 RT-SHIV-infected macaques prior to EFV exposure. After three doses over 4 days of EFV monotherapy, 103N mutations (AAC and AAT) rapidly emerged and increased in the population to levels of ∼20%, indicating that they were present prior to EFV exposure. The rapid increase of 103N mutations from <0.001% to 20% of the viral population indicates that the replicating virus population size in RT-SHIV-infected macaques must be 106 or more infected cells per replication cycle.
PMCID: PMC3497681  PMID: 22933296
8.  Baseline predictors of treatment outcome in Internet-based alcohol interventions: a recursive partitioning analysis alongside a randomized trial 
BMC Public Health  2013;13:455.
Internet-based interventions are seen as attractive for harmful users of alcohol and lead to desirable clinical outcomes. Some participants will however not achieve the desired results. In this study, harmful users of alcohol have been partitioned in subgroups with low, intermediate or high probability of positive treatment outcome, using recursive partitioning classification tree analysis.
Data were obtained from a randomized controlled trial assessing the effectiveness of two Internet-based alcohol interventions. The main outcome variable was treatment response, a dichotomous outcome measure for treatment success. Candidate predictors for the classification analysis were first selected using univariate regression. Next, a tree decision model to classify participants in categories with a low, medium and high probability of treatment response was constructed using recursive partitioning software.
Based on literature review, 46 potentially relevant baseline predictors were identified. Five variables were selected using univariate regression as candidate predictors for the classification analysis. Two variables were found most relevant for classification and selected for the decision tree model: ‘living alone’, and ‘interpersonal sensitivity’. Using sensitivity analysis, the robustness of the decision tree model was supported.
Harmful alcohol users in a shared living situation, with high interpersonal sensitivity, have a significantly higher probability of positive treatment outcome. The resulting decision tree model may be used as part of a decision support system but is on its own insufficient as a screening algorithm with satisfactory clinical utility.
Trial registration
Netherlands Trial Register (Cochrane Collaboration): NTR-TC1155.
PMCID: PMC3662562  PMID: 23651767
Alcohol; Internet; Intervention; Outcome predictors; RCT; Recursive partitioning
9.  Random forest methodology for model-based recursive partitioning: the mobForest package for R 
BMC Bioinformatics  2013;14:125.
Recursive partitioning is a non-parametric modeling technique, widely used in regression and classification problems. Model-based recursive partitioning is used to identify groups of observations with similar values of parameters of the model of interest. The mob() function in the party package in R implements model-based recursive partitioning method. This method produces predictions based on single tree models. Predictions obtained through single tree models are very sensitive to small changes to the learning sample. We extend the model-based recursive partition method to produce predictions based on multiple tree models constructed on random samples achieved either through bootstrapping (random sampling with replacement) or subsampling (random sampling without replacement) on learning data.
Here we present an R package called “mobForest” that implements bagging and random forests methodology for model-based recursive partitioning. The mobForest package constructs large number of model-based trees and the predictions are aggregated across these trees resulting in more stable predictions. The package also includes functions for computing predictive accuracy estimates and plots, residuals plot, and variable importance plot.
The mobForest package implements a random forest type approach for model-based recursive partitioning. The R package along with it source code is available at
PMCID: PMC3626834  PMID: 23577585
Random forests; Model-based recursive partitioning; Ensemble; R
10.  Loss of Heterozygosity (LOH) Profiles – Validated Risk Predictors for Progression to Oral Cancer 
A major barrier to oral cancer prevention has been the lack of validated risk predictors for oral premalignant lesions (OPLs). In 2000, we proposed a loss of heterozygosity (LOH) risk model in a retrospective study. This paper validated the previously reported LOH profiles as risk predictors and developed refined models via the largest longitudinal study to date of low-grade OPLs from a population-based patient group. Analysis involved a prospective cohort of 296 patients with primary mild/moderate oral dysplasia enrolled in the Oral Cancer Prediction Longitudinal Study. LOH status was determined in these OPLs. Patients were classified into high-risk or low-risk profiles to validate the 2000 model. Risk models were refined using recursive partitioning and Cox regression analyses. The prospective cohort validated that the high-risk lesions (3p &/or 9p LOH) had a 22·6 - fold increase in risk (P = 0·002) compared to low-risk lesions (3p & 9p retention). Addition of another two markers (loci on 4q/17p) further improved the risk prediction, with five-year progression rates of 3·1%, 16·3%, and 63·1% for the low-, intermediate-, and high-risk lesions, respectively. Compared to the low-risk group, intermediate- and high-risk groups had 11·6-fold and 52·1-fold increase in risk (P < 0·001). LOH profiles as risk predictors in the refined model were validated in the retrospective cohort. Multi-covariate analysis with clinical features showed LOH models to be the most significant predictors of progression. LOH profiles can reliably differentiate progression risk for OPLs. Potential uses include increasing surveillance for patients with elevated risk, improving target intervention for high-risk patients while sparing a large number of low-risk patients from needless screening and treatment.
PMCID: PMC3793638  PMID: 22911111
11.  Use of tree-based models to identify subgroups and increase power to detect linkage to cardiovascular disease traits 
BMC Genetics  2003;4(Suppl 1):S66.
Our goal was to identify subgroups of sib pairs from the Framingham Heart Study data set that provided higher evidence of linkage to particular candidate regions for cardiovascular disease traits. The focus of this method is not to claim identification of significant linkage to a particular locus but to show that tree models can be used to identify subgroups for use in selected sib-pair sampling schemes.
We report results using a novel recursive partitioning procedure to identify subgroups of sib pairs with increased evidence of linkage to systolic blood pressure and other cardiovascular disease-related quantitative traits, using the Framingham Heart Study data set provided by the Genetic Analysis Workshop 13. This procedure uses a splitting rule based on Haseman-Elston regression that recursively partitions sib-pair data into homogeneous subgroups.
Using this procedure, we identified a subgroup definition for use as a selected sib-pair sampling scheme. Using the characteristics that define the subgroup with higher evidence for linkage, we have identified an area of focus for further study.
PMCID: PMC1866504  PMID: 14975134
12.  Recursive gene selection based on maximum margin criterion: a comparison with SVM-RFE 
BMC Bioinformatics  2006;7:543.
In class prediction problems using microarray data, gene selection is essential to improve the prediction accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVM-RFE) has become one of the leading methods and is being widely used. The SVM-based approach performs gene selection using the weight vector of the hyperplane constructed by the samples on the margin. However, the performance can be easily affected by noise and outliers, when it is applied to noisy, small sample size microarray data.
In this paper, we propose a recursive gene selection method using the discriminant vector of the maximum margin criterion (MMC), which is a variant of classical linear discriminant analysis (LDA). To overcome the computational drawback of classical LDA and the problem of high dimensionality, we present efficient and stable algorithms for MMC-based RFE (MMC-RFE). The MMC-RFE algorithms naturally extend to multi-class cases. The performance of MMC-RFE was extensively compared with that of SVM-RFE using nine cancer microarray datasets, including four multi-class datasets.
Our extensive comparison has demonstrated that for binary-class datasets MMC-RFE tends to show intermediate performance between hard-margin SVM-RFE and SVM-RFE with a properly chosen soft-margin parameter. Notably, MMC-RFE achieves significantly better performance with a smaller number of genes than SVM-RFE for multi-class datasets. The results suggest that MMC-RFE is less sensitive to noise and outliers due to the use of average margin, and thus may be useful for biomarker discovery from noisy data.
PMCID: PMC1790716  PMID: 17187691
13.  Frequent Emergence of N348I in HIV-1 Subtype C Reverse Transcriptase with Failure of Initial Therapy Reduces Susceptibility to Reverse-Transcriptase Inhibitors 
N348I emerges frequently with failure of first-line antiretroviral therapy (ART) in subtype C human immunodeficiency virus type 1 infection and affects susceptibility to nevirapine, efavirenz, etravirine, and zidovudine. This finding has implications for cross-resistance to subsequent ART regimens in resource-limited settings.
Background. It is not known how often mutations in the connection and ribonuclease H domains of reverse transcriptase (RT) emerge with failure of first-line antiretroviral therapy (ART) in subtype C human immunodeficiency virus type 1 (HIV-1) infection and how these mutations affect susceptibility to other antiretrovirals.
Methods. We compared full-length RT sequences in plasma obtained before therapy and at virologic failure of initial ART among 63 participants with subtype C HIV-1 infection enrolled in the Comprehensive International Program of Research on AIDS in South Africa (CIPRA-SA) study. Recombinant viruses containing full-length plasma-derived RT sequences from participants with N348I at virologic failure were assayed for drug susceptibility.
Results. Y181C and M184V mutations in the RT polymerase domain were associated with failure of stavudine-lamivudine-nevirapine (d4T/3TC/NVP; P < .01), and K103N, V106M, and M184V with failure of d4T/3TC/efavirenz (EFV; P < .01). N348I in the RT connection domain emerged in 45% (P = .002) and 12% (P = .06) of participants receiving failing regimens containing NVP or EFV, respectively. Longitudinal analyses revealed that nonnucleoside RT inhibitor resistance mutations in the polymerase domain generally appeared first. N348I emerged at the same time, or after, M184V. N348I in the context of polymerase domain mutations reduced susceptibility to NVP (8.9–13-fold), EFV (4–56-fold), etravirine (ETV; 1.9–4.7-fold) and decreased hypersusceptibility to zidovudine (AZT; 1.4–2.2-fold).
Conclusions. N348I emerges frequently with virologic failure of first-line ART in subtype C HIV-1 infection and reduces susceptibility to NVP, EFV, ETV, and AZT. Additional studies are warranted to characterize the effects of N348I on virologic response to second- and third-line regimens in resource-limited settings where subtype C predominates.
PMCID: PMC3491849  PMID: 22618567
14.  Efavirenz Therapy in Rhesus Macaques Infected with a Chimera of Simian Immunodeficiency Virus Containing Reverse Transcriptase from Human Immunodeficiency Virus Type 1 
The specificity of nonnucleoside reverse transcriptase (RT) inhibitors (NNRTIs) for the RT of human immunodeficiency virus type 1 (HIV-1) has prevented the use of simian immunodeficiency virus (SIV) in the study of NNRTIs and NNRTI-based highly active antiretroviral therapy. However, a SIV-HIV-1 chimera (RT-SHIV), in which the RT from SIVmac239 was replaced with the RT-encoding region from HIV-1, is susceptible to NNRTIs and is infectious to rhesus macaques. We have evaluated the antiviral activity of efavirenz against RT-SHIV and the emergence of efavirenz-resistant mutants in vitro and in vivo. RT-SHIV was susceptible to efavirenz with a mean effective concentration of 5.9 ± 4.5 nM, and RT-SHIV variants selected with efavirenz in cell culture displayed 600-fold-reduced susceptibility. The efavirenz-resistant mutants of RT-SHIV had mutations in RT similar to those of HIV-1 variants that were selected under similar conditions. Efavirenz monotherapy of RT-SHIV-infected macaques produced a 1.82-log-unit decrease in plasma viral-RNA levels after 1 week. The virus load rebounded within 3 weeks in one treated animal and more slowly in a second animal. Virus isolated from these two animals contained the K103N and Y188C or Y188L mutations. The RT-SHIV-rhesus macaque model may prove useful for studies of antiretroviral drug combinations that include efavirenz.
PMCID: PMC514752  PMID: 15328115
15.  Identification of Extremely Premature Infants at High Risk of Rehospitalization 
Pediatrics  2011;128(5):e1216-e1225.
Extremely low birth weight infants often require rehospitalization during infancy. Our objective was to identify at the time of discharge which extremely low birth weight infants are at higher risk for rehospitalization.
Data from extremely low birth weight infants in Eunice Kennedy Shriver National Institute of Child Health and Human Development Neonatal Research Network centers from 2002–2005 were analyzed. The primary outcome was rehospitalization by the 18- to 22-month follow-up, and secondary outcome was rehospitalization for respiratory causes in the first year. Using variables and odds ratios identified by stepwise logistic regression, scoring systems were developed with scores proportional to odds ratios. Classification and regression-tree analysis was performed by recursive partitioning and automatic selection of optimal cutoff points of variables.
A total of 3787 infants were evaluated (mean ± SD birth weight: 787 ± 136 g; gestational age: 26 ± 2 weeks; 48% male, 42% black). Forty-five percent of the infants were rehospitalized by 18 to 22 months; 14.7% were rehospitalized for respiratory causes in the first year. Both regression models (area under the curve: 0.63) and classification and regression-tree models (mean misclassification rate: 40%–42%) were moderately accurate. Predictors for the primary outcome by regression were shunt surgery for hydrocephalus, hospital stay of >120 days for pulmonary reasons, necrotizing enterocolitis stage II or higher or spontaneous gastrointestinal perforation, higher fraction of inspired oxygen at 36 weeks, and male gender. By classification and regression-tree analysis, infants with hospital stays of >120 days for pulmonary reasons had a 66% rehospitalization rate compared with 42% without such a stay.
The scoring systems and classification and regression-tree analysis models identified infants at higher risk of rehospitalization and might assist planning for care after discharge.
PMCID: PMC3208965  PMID: 22007016
logistic models; infant; premature; predictive value of tests
16.  Common Functional Correlates of Head-Strike Behavior in the Pachycephalosaur Stegoceras validum (Ornithischia, Dinosauria) and Combative Artiodactyls 
PLoS ONE  2011;6(6):e21422.
Pachycephalosaurs were bipedal herbivorous dinosaurs with bony domes on their heads, suggestive of head-butting as seen in bighorn sheep and musk oxen. Previous biomechanical studies indicate potential for pachycephalosaur head-butting, but bone histology appears to contradict the behavior in young and old individuals. Comparing pachycephalosaurs with fighting artiodactyls tests for common correlates of head-butting in their cranial structure and mechanics.
Methods/Principal Findings
Computed tomographic (CT) scans and physical sectioning revealed internal cranial structure of ten artiodactyls and pachycephalosaurs Stegoceras validum and Prenocephale prenes. Finite element analyses (FEA), incorporating bone and keratin tissue types, determined cranial stress and strain from simulated head impacts. Recursive partition analysis quantified strengths of correlation between functional morphology and actual or hypothesized behavior. Strong head-strike correlates include a dome-like cephalic morphology, neurovascular canals exiting onto the cranium surface, large neck muscle attachments, and dense cortical bone above a sparse cancellous layer in line with the force of impact. The head-butting duiker Cephalophus leucogaster is the closest morphological analog to Stegoceras, with a smaller yet similarly rounded dome. Crania of the duiker, pachycephalosaurs, and bighorn sheep Ovis canadensis share stratification of thick cortical and cancellous layers. Stegoceras, Cephalophus, and musk ox crania experience lower stress and higher safety factors for a given impact force than giraffe, pronghorn, or the non-combative llama.
Anatomy, biomechanics, and statistical correlation suggest that some pachycephalosaurs were as competent at head-to-head impacts as extant analogs displaying such combat. Large-scale comparisons and recursive partitioning can greatly refine inference of behavioral capability for fossil animals.
PMCID: PMC3125168  PMID: 21738658
17.  Evidence of linkage to chromosome 1 for early age of onset of rheumatoid arthritis and HLA marker DRB1 genotype in NARAC data 
BMC Proceedings  2007;1(Suppl 1):S78.
Focusing on chromosome 1, a recursive partitioning linkage algorithm (RP) was applied to perform linkage analysis on the rheumatoid arthritis NARAC data, incorporating covariates such as HLA-DRB1 genotype, age at onset, severity, anti-cyclic citrullinated peptide (anti-CCP), and life time smoking. All 617 affected sib pairs from the ascertained families were used, and an RP linkage model was used to identify linkage possibly influenced by covariates. This algorithm includes a likelihood ratio (LR)-based splitting rule, a pruning algorithm to identify optimal tree size, and a bootstrap method for final tree selection.
The strength of the linkage signals was evaluated by empirical p-values, obtained by simulating marker data under null hypothesis of no linkage. Two suggestive linkage regions on chromosome 1 were detected by the RP linkage model, with identified associated covariates HLA-DRB1 genotype and age at onset. These results suggest possible gene × gene and gene × environment interactions at chromosome 1 loci and provide directions for further gene mapping.
PMCID: PMC2367509  PMID: 18466580
18.  Recursive Feature Selection with Significant Variables of Support Vectors 
The development of DNA microarray makes researchers screen thousands of genes simultaneously and it also helps determine high- and low-expression level genes in normal and disease tissues. Selecting relevant genes for cancer classification is an important issue. Most of the gene selection methods use univariate ranking criteria and arbitrarily choose a threshold to choose genes. However, the parameter setting may not be compatible to the selected classification algorithms. In this paper, we propose a new gene selection method (SVM-t) based on the use of t-statistics embedded in support vector machine. We compared the performance to two similar SVM-based methods: SVM recursive feature elimination (SVMRFE) and recursive support vector machine (RSVM). The three methods were compared based on extensive simulation experiments and analyses of two published microarray datasets. In the simulation experiments, we found that the proposed method is more robust in selecting informative genes than SVMRFE and RSVM and capable to attain good classification performance when the variations of informative and noninformative genes are different. In the analysis of two microarray datasets, the proposed method yields better performance in identifying fewer genes with good prediction accuracy, compared to SVMRFE and RSVM.
PMCID: PMC3426197  PMID: 22927888
19.  Heuristic Algorithms for Assigning Hispanic Ethnicity 
PLoS ONE  2013;8(2):e55689.
We compared several techniques for assigning Hispanic ethnicity to records in data systems where this information may be missing, variously making use of country of origin, surname, race, and county of residence. We considered an algorithm in use by the North American Association of Central Cancer Registries (NAACCR), a variation of this developed by the authors, a “fast and frugal” algorithm developed with the aid of recursive partitioning methods, and conventional logistic regression. With the exception of logistic regression, each approach was rule-based: if specific criteria were met, an ethnicity assignment was made; otherwise, the next criterion was considered, until all records were assigned. We evaluated the algorithms on a sample of over 500,000 female clients from the New York State Cancer Services Program for whom self-reported Hispanic ethnicity was known. We found that all approaches yielded similarly high accuracy, sensitivity, and positive predictive value in all parts of the state, from areas with very low to very high Hispanic populations. An advantage of the fast and frugal method is that it consists of a small number of easily remembered steps.
PMCID: PMC3566036  PMID: 23405197
20.  Minor HIV-1 Variants with the K103N Resistance Mutation during Intermittent Efavirenz-Containing Antiretroviral Therapy and Virological Failure 
PLoS ONE  2011;6(6):e21655.
The impact of minor drug-resistant variants of the type 1 immunodeficiency virus (HIV-1) on the failure of antiretroviral therapy remains unclear. We have evaluated the importance of detecting minor populations of viruses resistant to non-nucleoside reverse-transcriptase inhibitors (NNRTI) during intermittent antiretroviral therapy, a high-risk context for the emergence of drug-resistant HIV-1. We carried out a longitudinal study on plasma samples taken from 21 patients given efavirenz and enrolled in the intermittent arm of the ANRS 106 trial. Allele-specific real-time PCR was used to detect and quantify minor K103N mutants during off-therapy periods. The concordance with ultra-deep pyrosequencing was assessed for 11 patients. The pharmacokinetics of efavirenz was assayed to determine whether its variability could influence the emergence of K103N mutants. Allele-specific real-time PCR detected K103N mutants in 15 of the 19 analyzable patients at the end of an off-therapy period while direct sequencing detected mutants in only 6 patients. The frequency of K103N mutants was <0.1% in 7 patients by allele-specific real-time PCR without further selection, and >0.1% in 8. It was 0.1%–10% in 6 of these 8 patients. The mutated virus populations of 4 of these 6 patients underwent further selection and treatment failed for 2 of them. The K103N mutant frequency was >10% in the remaining 2, treatment failed for one. The copy numbers of K103N variants quantified by allele-specific real-time PCR and ultra-deep pyrosequencing agreed closely (ρ = 0.89 P<0.0001). The half-life of efavirenz was higher (50.5 hours) in the 8 patients in whom K103N emerged (>0.1%) than in the 11 patients in whom it did not (32 hours) (P = 0.04). Thus ultrasensitive methods could prove more useful than direct sequencing for predicting treatment failure in some patients. However the presence of minor NNRTI-resistant viruses need not always result in virological escape.
Trial registration NCT00122551
PMCID: PMC3124548  PMID: 21738752
21.  Identifying Subgroups of Patients with Depression Who Are at High Risk for Suicide 
The Journal of clinical psychiatry  2009;70(11):1495-1500.
Although prior research has identified a number of separate risk factors for suicide among patients with depression, little is known about how these factors may interact to modify suicide risks. Using an empirically-based decision tree analysis for a large national sample of Veterans Affairs (VA) health system patients treated for depression, we identify subgroups with particularly high or low rates of suicide.
We identified 887,859 VA patients treated for depression between April 1, 1999 and September 30, 2004. Randomly splitting the data into two samples (primary and replication samples), we developed a decision tree for the primary sample using recursive partitioning. We then tested whether the groups developed within the primary sample were associated with increased suicide risk in the replication sample.
The exploratory data analysis produced a decision tree with subgroups of patients at differing levels of risk for suicide. These were identified by a combination of factors including a co-occurring substance use disorder diagnosis, male gender, African American race and psychiatric hospitalization in the past year. The groups developed as part of the decision tree accurately discriminated between those with and without suicide in the replication sample. The patients at highest risk for suicide were those with a substance use disorder, who were non-African American and had an inpatient psychiatric stay within the past 12 months.
Study findings suggest that the identification of depressed patients at increased risk for suicide is improved through the examination of higher order interactions between potential risk factors.
PMCID: PMC3057750  PMID: 20031094
suicide; depression; substance use disorders; decision trees; data mining
22.  CATCH: a clinical decision rule for the use of computed tomography in children with minor head injury 
There is controversy about which children with minor head injury need to undergo computed tomography (CT). We aimed to develop a highly sensitive clinical decision rule for the use of CT in children with minor head injury.
For this multicentre cohort study, we enrolled consecutive children with blunt head trauma presenting with a score of 13–15 on the Glasgow Coma Scale and loss of consciousness, amnesia, disorientation, persistent vomiting or irritability. For each child, staff in the emergency department completed a standardized assessment form before any CT. The main outcomes were need for neurologic intervention and presence of brain injury as determined by CT. We developed a decision rule by using recursive partitioning to combine variables that were both reliable and strongly associated with the outcome measures and thus to find the best combinations of predictor variables that were highly sensitive for detecting the outcome measures with maximal specificity.
Among the 3866 patients enrolled (mean age 9.2 years), 95 (2.5%) had a score of 13 on the Glasgow Coma Scale, 282 (7.3%) had a score of 14, and 3489 (90.2%) had a score of 15. CT revealed that 159 (4.1%) had a brain injury, and 24 (0.6%) underwent neurologic intervention. We derived a decision rule for CT of the head consisting of four high-risk factors (failure to reach score of 15 on the Glasgow coma scale within two hours, suspicion of open skull fracture, worsening headache and irritability) and three additional medium-risk factors (large, boggy hematoma of the scalp; signs of basal skull fracture; dangerous mechanism of injury). The high-risk factors were 100.0% sensitive (95% CI 86.2%–100.0%) for predicting the need for neurologic intervention and would require that 30.2% of patients undergo CT. The medium-risk factors resulted in 98.1% sensitivity (95% CI 94.6%–99.4%) for the prediction of brain injury by CT and would require that 52.0% of patients undergo CT.
The decision rule developed in this study identifies children at two levels of risk. Once the decision rule has been prospectively validated, it has the potential to standardize and improve the use of CT for children with minor head injury.
PMCID: PMC2831681  PMID: 20142371
23.  Recursively Imputed Survival Trees 
We propose recursively imputed survival tree (RIST) regression for right-censored data. This new nonparametric regression procedure uses a novel recursive imputation approach combined with extremely randomized trees that allows significantly better use of censored data than previous tree based methods, yielding improved model fit and reduced prediction error. The proposed method can also be viewed as a type of Monte Carlo EM algorithm which generates extra diversity in the tree-based fitting process. Simulation studies and data analyses demonstrate the superior performance of RIST compared to previous methods.
PMCID: PMC3486435  PMID: 23125470
Trees; Ensemble; Random Forests; Censored data; Imputation; Survival Analysis
24.  Interaction Trees with Censored Survival Data 
We propose an interaction tree (IT) procedure to optimize the subgroup analysis in comparative studies that involve censored survival times. The proposed method recursively partitions the data into two subsets that show the greatest interaction with the treatment, which results in a number of objectively defined subgroups: in some of them the treatment effect is prominent while in others the treatment may have a negligible or even negative effect. The resultant tree structure can be used to explore the overall interaction between treatment and other covariates and help identify and describe possible target populations on which an experimental treatment demonstrates desired efficacy. We follow the standard CART (Breiman, et al., 1984) methodology to develop the interaction tree structure. Variable importance information is extracted via random forests of interaction trees. Both simulated experiments and an analysis of the primary billiary cirrhosis (PBC) data are provided for evaluation and illustration of the proposed procedure.
PMCID: PMC2835451  PMID: 20231911
25.  Nonparametric Multi-state Representations of Survival and Longitudinal Data with Measurement Error 
Statistics in medicine  2012;31(21):10.1002/sim.5369.
This paper proposes a nonparametric procedure to describe the progression of longitudinal cohorts over time from a population averaged perspective, leading to multi-state probability curves with the states defined jointly by survival and longitudinal outcomes measured with error. To account for the challenges of informative dropout and nonlinear shapes of the longitudinal trajectories, a bias corrected penalized spline regression is applied to estimate the unobserved longitudinal trajectory for each subject. The multi-state probability curves are then estimated based on the survival data and the estimated longitudinal trajectories. Simulation Extrapolation (SIMEX) is further used to reduce the estimation bias caused by the randomness of the estimated trajectories. A bootstrap test is developed to compare multi-state probability curves between groups. We present theoretical justification of the estimation procedure along with a simulation study to demonstrate finite sample performance. The procedure is illustrated by data from the African American Study of Kidney Disease and Hypertension, and it can be widely applied in longitudinal studies.
PMCID: PMC3845220  PMID: 22535711
Multi-state representations; penalized spline; SIMEX

