Search tips
Search criteria

Results 1-25 (51)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
author:("Cai, tianqi")
The annals of applied statistics  2010;4(1):520-532.
To investigate whether treating cancer patients with erythropoiesis-stimulating agents (ESAs) would increase the mortality risk, Bennett et al. [Journal of the American Medical Association 299 (2008) 914–924] conducted a meta-analysis with the data from 52 phase III trials comparing ESAs with placebo or standard of care. With a standard parametric random effects modeling approach, the study concluded that ESA administration was significantly associated with increased average mortality risk. In this article we present a simple nonparametric inference procedure for the distribution of the random effects. We re-analyzed the ESA mortality data with the new method. Our results about the center of the random effects distribution were markedly different from those reported by Bennett et al. Moreover, our procedure, which estimates the distribution of the random effects, as opposed to just a simple population average, suggests that the ESA may be beneficial to mortality for approximately a quarter of the study populations. This new meta-analysis technique can be implemented with study-level summary statistics. In contrast to existing methods for parametric random effects models, the validity of our proposal does not require the number of studies involved to be large. From the results of an extensive numerical study, we find that the new procedure performs well even with moderate individual study sample sizes.
PMCID: PMC4321956
Bivariate beta; conditional permutation test; erythropoiesis-stimulating agents; logit-normal; two-level hierachical model
2.  Landmark Estimation of Survival and Treatment Effect in a Randomized Clinical Trial 
In many studies with a survival outcome, it is often not feasible to fully observe the primary event of interest. This often leads to heavy censoring and thus, difficulty in efficiently estimating survival or comparing survival rates between two groups. In certain diseases, baseline covariates and the event time of non-fatal intermediate events may be associated with overall survival. In these settings, incorporating such additional information may lead to gains in efficiency in estimation of survival and testing for a difference in survival between two treatment groups. If gains in efficiency can be achieved, it may then be possible to decrease the sample size of patients required for a study to achieve a particular power level or decrease the duration of the study. Most existing methods for incorporating intermediate events and covariates to predict survival focus on estimation of relative risk parameters and/or the joint distribution of events under semiparametric models. However, in practice, these model assumptions may not hold and hence may lead to biased estimates of the marginal survival. In this paper, we propose a semi-nonparametric two-stage procedure to estimate and compare t-year survival rates by incorporating intermediate event information observed before some landmark time, which serves as a useful approach to overcome semi-competing risks issues. In a randomized clinical trial setting, we further improve efficiency through an additional calibration step. Simulation studies demonstrate substantial potential gains in efficiency in terms of estimation and power. We illustrate our proposed procedures using an AIDS Clinical Trial Protocol 175 dataset by estimating survival and examining the difference in survival between two treatment groups: zidovudine and zidovudine plus zalcitabine.
PMCID: PMC3960087  PMID: 24659838
Efficiency Augmentation; Kaplan Meier; Landmark Prediction; Semi-competing Risks; Survival Analysis
3.  Lipid and lipoprotein levels and trends in rheumatoid arthritis compared to the general population 
Arthritis care & research  2013;65(12):2046-2050.
Differences in lipid levels associated with cardiovascular (CV) risk between rheumatoid arthritis (RA) and the general population remain unclear. Determining these differences is important in understanding the role of lipids in CV risk in RA.
We studied 2,005 RA subjects from two large academic medical centers. We extracted electronic medical record (EMR) data on the first low density lipoprotein (LDL), total cholesterol (TChol) and high density lipoprotein (HDL) within 1 year of the LDL. Subjects with an electronic statin prescription prior to the first LDL were excluded.
We compared lipid levels in RA to levels from the general United States population (Carroll, et al., JAMA 2012), using the t-test and stratifying by published parameters, i.e. 2007–2010, women. We determined lipid trends using separate linear regression models for TChol, LDL and HDL, testing the association between year of measurement (1989–2010) and lipid level, adjusted by age and gender. Lipid trends were qualitatively compared to those reported in Carroll, et al.
Women with RA had a significantly lower Tchol (186 vs 200mg/dL, p=0.002) and LDL (105 vs 118mg/dL, p=0.001) compared to the general population (2007–2010). HDL was not significantly different in the two groups. In the RA cohort, Tchol and LDL significantly decreased each year, while HDL increased (all with p<0.0001), consistent with overall trends observed in Carroll, et al.
RA patients appear to have an overall lower Tchol and LDL than the general population, despite the general overall risk of CVD in RA from observational studies.
PMCID: PMC4060244  PMID: 23925980
4.  Assessment of Biomarkers for Risk Prediction with Nested Case Control Studies 
Clinical trials (London, England)  2013;10(5):677-679.
Accurate risk prediction plays a key role in disease prevention and disease management; emergence of new biomarkers may lead to an important question about how much improvement in prediction accuracy it would achieve by adding the new markers into the existing risk prediction tools. However, in large prospective cohort studies, the standard full-cohort design, requiring marker measurement on the entire cohort, may be infeasible due to cost and low rate of the clinical condition of interest. To overcome such difficulties, nested case-control (NCC) studies provide cost-effective alternatives but bring about challenges in statistical analyses due to complex datasets generated. To evaluate prognostic accuracy of a risk model, Cai and Zheng1 proposed a class of nonparametric inverse probability weighting (IPW) estimators for accuracy measures in the time-dependent receiver operating characteristic curve analysis. To accommodate a three-phase NCC design in Nurses' Health Study, we extend the double IPW estimators of Cai and Zheng1 to develop risk prediction models under time-dependent generalized linear models and evaluate the incremental values of new biomarkers and genetic markers. Our results suggest that aggregating the information from both the genetic markers and biomarkers substantially improves the accuracy for predicting 5-year and 10-year risks of rheumatoid arthritis.
PMCID: PMC3800233  PMID: 24013405
5.  Adopting nested case–control quota sampling designs for the evaluation of risk markers 
Lifetime data analysis  2013;19(4):568-588.
Two-phase study methods, in which more detailed or more expensive exposure information is only collected on a sample of individuals with events and a small proportion of other individuals, are expected to play a critical role in biomarker validation research. One major limitation of standard two-phase designs is that they are most conveniently employed with study cohorts in which information on longitudinal follow-up and other potential matching variables is electronically recorded. However for many practical situations, at the sampling stage such information may not be readily available for every potential candidates. Study eligibility needs to be verified by reviewing information from medical charts one by one. In this manuscript, we study in depth a novel study design commonly undertaken in practice that involves sampling until quotas of eligible cases and controls are identified. We propose semiparametric methods to calculate risk distributions and a wide variety of prediction indices when outcomes are censored failure times and data are collected under the quota sampling design. Consistency and asymptotic normality of our estimators are established using empirical process theory. Simulation results indicate that the proposed procedures perform well in finite samples. Application is made to the evaluation of a new risk model for predicting the onset of cardiovascular disease.
PMCID: PMC3903399  PMID: 23807695
Biomarker; Nested case; control study; Prediction accuracy; Prognosis; Quota sampling; Risk prediction
6.  Landmark Risk Prediction of Residual Life for Breast Cancer Survival 
Statistics in medicine  2013;32(20):3459-3471.
The importance of developing personalized risk prediction estimates has become increasingly evident in recent years. In general, patient populations may be heterogenous and represent a mixture of different unknown subtypes of disease. When the source of this heterogeneity and resulting subtypes of disease are unknown, accurate prediction of survival may be difficult. However, in certain disease settings the onset time of an observable short term event may be highly associated with these unknown subtypes of disease and thus may be useful in predicting long term survival. One approach to incorporate short term event information along with baseline markers for the prediction of long term survival is through a landmark Cox model, which assumes a proportional hazards model for the residual life at a given landmark point. In this paper, we use this modeling framework to develop procedures to assess how a patient’s long term survival trajectory may change over time given good short term outcome indications along with prognosis based on baseline markers. We first propose time-varying accuracy measures to quantify the predictive performance of landmark prediction rules for residual life and provide resampling-based procedures to make inference about such accuracy measures. Simulation studies show that the proposed procedures perform well in finite samples. Throughout, we illustrate our proposed procedures using a breast cancer dataset with information on time to metastasis and time to death. In addition to baseline clinical markers available for each patient, a chromosome instability genetic score, denoted by CIN25, is also available for each patient and has been shown to be predictive of survival for various types of cancer. We provide procedures to evaluate the incremental value of CIN25 for the prediction of residual life and examine how the residual life profile changes over time. This allows us to identify an informative landmark point, t0, such that accurate risk predictions of the residual life could be made for patients who survive past t0 without metastasis.
PMCID: PMC3744612  PMID: 23494768
landmark prediction; biomarkers; disease prognosis; predictive accuracy; risk prediction; survival analysis
7.  Normalization of Plasma 25-hydroxy Vitamin D is Associated with Reduced Risk of Surgery in Crohn’s Disease 
Inflammatory bowel diseases  2013;19(9):1921-1927.
Vitamin D may have an immunological role in Crohn’s disease (CD) and ulcerative colitis (UC). Retrospective studies suggested a weak association between vitamin D status and disease activity but have significant limitations.
Using a multi-institution inflammatory bowel disease (IBD) cohort, we identified all CD and UC patients who had at least one measured plasma 25-hydroxy vitamin D [25(OH)D]. Plasma 25(OH)D was considered sufficient at levels ≥ 30ng/mL. Logistic regression models adjusting for potential confounders were used to identify impact of measured plasma 25(OH)D on subsequent risk of IBD-related surgery or hospitalization. In a subset of patients where multiple measures of 25(OH)D were available, we examined impact of normalization of vitamin D status on study outcomes.
Our study included 3,217 patients (55% CD, mean age 49 yrs). The median lowest plasma 25(OH)D was 26ng/ml (IQR 17–35ng/ml). In CD, on multivariable analysis, plasma 25(OH)D < 20ng/ml was associated with an increased risk of surgery (OR 1.76 (1.24 – 2.51) and IBD-related hospitalization (OR 2.07, 95% CI 1.59 – 2.68) compared to those with 25(OH)D ≥ 30ng/ml. Similar estimates were also seen for UC. Furthermore, CD patients who had initial levels < 30ng/ml but subsequently normalized their 25(OH)D had a reduced likelihood of surgery (OR 0.56, 95% CI 0.32 – 0.98) compared to those who remained deficient.
Low plasma 25(OH)D is associated with increased risk of surgery and hospitalizations in both CD and UC and normalization of 25(OH)D status is associated with a reduction in the risk of CD-related surgery.
PMCID: PMC3720838  PMID: 23751398
Crohn’s disease; ulcerative colitis; vitamin D; surgery; hospitalization
8.  Evaluating incremental values from new predictors with net reclassification improvement in survival analysis 
Lifetime data analysis  2012;19(3):350-370.
Developing individualized prediction rules for disease risk and prognosis has played a key role in modern medicine. When new genomic or biological markers become available to assist in risk prediction, it is essential to assess the improvement in clinical usefulness of the new markers over existing routine variables. Net reclassification improvement (NRI) has been proposed to assess improvement in risk reclassification in the context of comparing two risk models and the concept has been quickly adopted in medical journals. We propose both nonparametric and semiparametric procedures for calculating NRI as a function of a future prediction time t with a censored failure time outcome. The proposed methods accommodate covariate-dependent censoring, therefore providing more robust and sometimes more efficient procedures compared with the existing nonparametric-based estimators. Simulation results indicate that the proposed procedures perform well in finite samples. We illustrate these procedures by evaluating a new risk model for predicting the onset of cardiovascular disease.
PMCID: PMC3686882  PMID: 23254468
Inverse probability weighted (IPW) estimator; Net reclassification improvement (NRI); Risk prediction; Survival analysis
9.  Improving Case Definition of Crohn’s Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing: A Novel Informatics Approach 
Inflammatory bowel diseases  2013;19(7):1411-1420.
Prior studies identifying patients with inflammatory bowel disease (IBD) utilizing administrative codes have yielded inconsistent results. Our objective was to develop a robust electronic medical record (EMR) based model for classification of IBD leveraging the combination of codified data and information from clinical text notes using natural language processing (NLP).
Using the EMR of 2 large academic centers, we created data marts for Crohn’s disease (CD) and ulcerative colitis (UC) comprising patients with ≥ 1 ICD-9 code for each disease. We utilized codified (i.e. ICD9 codes, electronic prescriptions) and narrative data from clinical notes to develop our classification model. Model development and validation was performed in a training set of 600 randomly selected patients for each disease with medical record review as the gold standard. Logistic regression with the adaptive LASSO penalty was used to select informative variables.
We confirmed 399 (67%) CD cases in the CD training set and 378 (63%) UC cases in the UC training set. For both, a combined model including narrative and codified data had better accuracy (area under the curve (AUC) for CD 0.95; UC 0.94) than models utilizing only disease ICD-9 codes (AUC 0.89 for CD; 0.86 for UC). Addition of NLP narrative terms to our final model resulted in classification of 6–12% more subjects with the same accuracy.
Inclusion of narrative concepts identified using NLP improves the accuracy of EMR case-definition for CD and UC while simultaneously identifying more subjects compared to models using codified data alone.
PMCID: PMC3665760  PMID: 23567779
Crohn’s disease; ulcerative colitis; disease cohort; natural language processing; informatics
10.  The association between low density lipoprotein (LDL) and RA genetic factors with LDL levels in rheumatoid arthritis and non-RA controls 
Annals of the rheumatic diseases  2013;73(6):1170-1175.
While genetic determinants of LDL cholesterol levels are well characterized in the general population, they are understudied in rheumatoid arthritis (RA). Our objective was to determine the association of established LDL and RA genetic alleles with LDL levels in RA cases compared to non-RA controls.
Using electronic medical records (EMR) data, we linked validated RA cases and non-RA controls to discarded blood samples. For each individual, we extracted data on: 1st LDL measurement, age, gender, and year of LDL measurement. We genotyped subjects for 11 LDL and 44 non-HLA RA alleles, and calculated RA and LDL genetic risk scores (GRS). We tested the association between each GRS and LDL level using multivariate linear regression models adjusted by age, gender, year of LDL measurement, and RA status.
Among 567 RA cases and 979 controls, 80% were female and the mean age at 1st LDL measurement was 55 years. RA cases had significantly lower mean LDL levels than controls (117.2 vs. 125.6mg/dL, respectively, p<0.0001). Each unit increase in LDL GRS was associated with 0.8mg/dL higher LDL levels in both RA cases and controls (p=3.0×10−7). Each unit increase in RA GRS was associated with 4.3mg/dL lower LDL levels in both groups (p=0.01).
LDL alleles were associated with higher LDL levels in RA. RA alleles were associated with lower LDL levels in both RA cases and controls. Since RA cases carry more RA alleles, these findings suggest a genetic basis for epidemiologic observations of lower LDL levels in RA.
PMCID: PMC3815491  PMID: 23716066
Rheumatoid arthritis; low density lipoprotein; genetics; human leukocyte antigen
11.  Improvement in Stroke Risk Prediction: Role of c-reactive protein (CRP) and Lipoprotein-Associated Phospholipase A2 (Lp-PLA2) in the Women’s Health Initiative 
Background and Purpose
Classification of risk of ischemic stroke is important for medical care and public health reasons. Whether addition of biomarkers adds to predictive power of the Framingham Stroke Risk or other traditional risk factors has not been studied in older women.
The Hormones and Biomarkers Predicting Stroke (HaBPS) Study is a case-control study of blood biomarkers assayed in 972 ischemic stroke cases and 972 controls, nested in the Women’s Health Initiative Observational Study of 93,676 postmenopausal women followed for an average of 8 years. We evaluated additive predictive value of two commercially available biomarkers: c-reactive protein (CRP) and Lipoprotein-Associated Phospholipase A2 (Lp-PLA2) to determine if they added to risk prediction by the Framingham Stroke Risk Score (FSRS) or by traditional risk factors (TRF) which included lipids and other variables not included in the FSRS. As measures of additive predictive value, we used the c-statistic, Net Reclassification Improvement (NRI), category-less NRI, and Integrated Discrimination Improvement Index (IDI).
Addition of CRP to Framingham risk models or additional traditional risk factors overall modestly improved prediction of ischemic stroke and resulted in overall NRI of 6.3%, (case NRI=3.9%, control NRI=2.4%) .In particular, hs-CRP was useful in prediction of cardioembolic strokes (NRI=12.0%; 95%CI: 4.3-19.6%) and in strokes occurring in less than 3 years (NRI=7.9%, 95%CI: 0.8-14.9%). Lp-PLA2 was useful in risk prediction of large artery strokes (NRI=19.8%, 95%CI: 7.4 -32.1%) and in early strokes (NRI=5.8%, 95%CI: 0.4-11.2%).
CRP and Lp-PLA2 can improve prediction of certain subtypes of ischemic stroke in older women, over the Framingham stroke risk model and traditional risk factors, and may help to guide surveillance and treatment of women at risk.
PMCID: PMC3556354  PMID: 23088183
12.  Similar risk of Depression and Anxiety following surgery or hospitalization for Crohn’s disease and Ulcerative colitis 
Psychiatric co-morbidity is common in Crohn’s disease (CD) and ulcerative colitis (UC). IBD-related surgery or hospitalizations represent major events in the natural history of disease. Whether there is a difference in risk of psychiatric co-morbidity following surgery in CD and UC has not been examined previously.
We used a multi-institution cohort of IBD patients without a diagnosis code for anxiety or depression preceding their IBD-related surgery or hospitalization. Demographic, disease, and treatment related variables were retrieved. Multivariate logistic regression analysis was performed to individually identify risk factors for depression and anxiety.
Our study included a total of 707 CD and 530 UC patients who underwent bowel resection surgery and did not have depression prior to surgery. The risk of depression 5 years after surgery was 16% and 11% in CD and UC respectively. We found no difference in the risk of depression following surgery in CD and UC patients (adjusted OR 1.11, 95%CI 0.84 – 1.47). Female gender, co-morbidity, immunosuppressant use, perianal disease, stoma surgery, and early surgery within 3 years of care predicted depression after CD-surgery; only female gender and co-morbidity predicted depression in UC. Only 12% of the CD cohort had ≥ 4 risk factors for depression, but among them nearly 44% were subsequently received a diagnosis code for depression.
IBD-related surgery or hospitalization is associated with a significant risk for depression and anxiety with a similar magnitude of risk in both diseases.
PMCID: PMC3627544  PMID: 23337479
Crohn’s disease; depression; anxiety; surgery; hospitalization
13.  Subgroup specific incremental value of new markers for risk prediction 
Lifetime data analysis  2012;19(2):142-169.
In many clinical applications, understanding when measurement of new markers is necessary to provide added accuracy to existing prediction tools could lead to more cost effective disease management. Many statistical tools for evaluating the incremental value (IncV) of the novel markers over the routine clinical risk factors have been developed in recent years. However, most existing literature focuses primarily on global assessment. Since the IncVs of new markers often vary across subgroups, it would be of great interest to identify subgroups for which the new markers are most/least useful in improving risk prediction. In this paper we provide novel statistical procedures for systematically identifying potential traditional-marker based subgroups in whom it might be beneficial to apply a new model with measurements of both the novel and traditional markers. We consider various conditional time-dependent accuracy parameters for censored failure time outcome to assess the subgroup-specific IncVs. We provide non-parametric kernel-based estimation procedures to calculate the proposed parameters. Simultaneous interval estimation procedures are provided to account for sampling variation and adjust for multiple testing. Simulation studies suggest that our proposed procedures work well in finite samples. The proposed procedures are applied to the Framingham Offspring Study to examine the added value of an inflammation marker, C-reactive protein, on top of the traditional Framingham risk score for predicting 10-year risk of cardiovascular disease.
PMCID: PMC3633735  PMID: 23263882
Incremental value; Partial area under the ROC curve; Prognostic accuracy; Risk prediction; Subgroup analysis; Time dependent ROC analysis
14.  Autoantibodies, autoimmune risk alleles and clinical associations in rheumatoid arthritis cases and non-RA controls in the electronic medical records 
Arthritis and rheumatism  2013;65(3):571-581.
The significance of non-RA autoantibodies in patients with rheumatoid arthritis (RA) is unclear. We studied associations between autoimmune risk alleles and autoantibodies in RA cases and non-RA controls, and autoantibodies and clinical diagnoses from the electronic medical records (EMR).
We studied 1,290 RA cases and 1,236 non-RA controls of European genetic ancestry from the EMR from two large academic centers. We measured antibodies to citrullinated peptides (ACPA), anti-nuclear antibodies (ANA), antibodies to tissue transglutaminase (anti-tTG), antibodies to thyroid peroxidase (anti-TPO). We genotyped subjects for autoimmune risk alleles, and studied the association between number of autoimmune risk alleles and number of types of autoantibodies present. We conducted a phenome-wide association study (PheWAS) to study potential associations between autoantibodies and clinical diagnoses among RA cases and controls.
Mean age was 60.7 in RA and 64.6 years in controls, and both were 79% female. The prevalence of ACPA and ANA was higher in RA cases compared to controls (p<0.0001, both); we observed no difference in anti-TPO and anti-tTG. Carriage of higher numbers of autoimmune risk alleles was associated with increasing types of autoantibodies in RA cases (p=4.4x10−6) and controls (p=0.002). From the PheWAS, ANA was significantly associated with Sjogren’s/siccain RA cases.
The increased frequency of autoantibodies in RA cases and controls was associated with the number of autoimmune risk alleles carried by an individual. PheWAS analyses within the EMR linked to blood samples provide a novel method to test for the clinical significance of biomarkers in disease.
PMCID: PMC3582761  PMID: 23233247
15.  Psychiatric co-morbidity is Associated with Increased risk of Surgery in Crohn’s disease 
Psychiatric co-morbidity, in particular major depression and anxiety is common in patients with Crohn’s disease (CD) and ulcerative colitis (UC). Prior studies examining this may be confounded by the co-existence of functional bowel symptoms. Limited data exists examining an association between depression or anxiety and disease-specific endpoints such as bowel surgery.
Using a multi-institution cohort of patients with CD and UC, we identified those who also had co-existing psychiatric co-morbidity (major depressive disorder or generalized anxiety). After excluding those diagnosed with such co-morbidity for the first time following surgery, we used multivariate logistic regression to examine the independent effect of psychiatric co-morbidity on IBD-related surgery and hospitalization. To account for confounding by disease severity, we adjusted for a propensity score estimating likelihood of psychiatric co-morbidity influenced by severity of disease in our models.
A total of 5,405 CD and 5,429 UC patients were included in this study; one-fifth had either major depressive disorder or generalized anxiety. In multivariate analysis, adjusting for potential confounders and the propensity score, presence of mood or anxiety co-morbidity was associated with a 28% increase in risk of surgery in CD (OR 1.28, 95% CI 1.03 – 1.57) but not UC (OR 1.01, 95% CI 0.80 – 1.28). Psychiatric co-morbidity was associated with increased healthcare utilization.
Depressive disorder or generalized anxiety is associated with a modestly increased risk of surgery in patients with CD. Interventions addressing this may improve patient outcomes.
PMCID: PMC3552092  PMID: 23289600
Crohn’s disease; ulcerative colitis; depression; surgery; hospitalization
16.  Resampling Procedures for Making Inference under Nested Case-control Studies 
The nested case-control (NCC) design have been widely adopted as a cost-effective solution in many large cohort studies for risk assessment with expensive markers, such as the emerging biologic and genetic markers. To analyze data from NCC studies, conditional logistic regression (Goldstein and Langholz, 1992; Borgan et al., 1995) and maximum likelihood (Scheike and Juul, 2004; Zeng et al., 2006) based methods have been proposed. However, most of these methods either cannot be easily extended beyond the Cox model (Cox, 1972) or require additional modeling assumptions. More generally applicable approaches based on inverse probability weighting (IPW) have been proposed as useful alternatives (Samuelsen, 1997; Chen, 2001; Samuelsen et al., 2007). However, due to the complex correlation structure induced by repeated finite risk set sampling, interval estimation for such IPW estimators remain challenging especially when the estimation involves non-smooth objective functions or when making simultaneous inferences about functions. Standard resampling procedures such as the bootstrap cannot accommodate the correlation and thus are not directly applicable. In this paper, we propose a resampling procedure that can provide valid estimates for the distribution of a broad class of IPW estimators. Simulation results suggest that the proposed procedures perform well in settings when analytical variance estimator is infeasible to derive or gives less optimal performance. The new procedures are illustrated with data from the Framingham Offspring Study to characterize individual level cardiovascular risks over time based on the Framingham risk score, C-reactive protein (CRP) and a genetic risk score.
PMCID: PMC3891801  PMID: 24436503
Biomarker study; Interval Estimation; Inverse Probability Weighting; Nested case-control study; Resampling methods, Risk Prediction; Simultaneous Confidence Band; Survival Model
When comparing a new treatment with a control in a randomized clinical study, the treatment effect is generally assessed by evaluating a summary measure over a specific study population. The success of the trial heavily depends on the choice of such a population. In this paper, we show a systematic, effective way to identify a promising population, for which the new treatment is expected to have a desired benefit, utilizing the data from a current study involving similar comparator treatments. Specifically, using the existing data, we first create a parametric scoring system as a function of multiple multiple baseline covariates to estimate subject-specific treatment differences. Based on this scoring system, we specify a desired level of treatment difference and obtain a subgroup of patients, defined as those whose estimated scores exceed this threshold. An empirically calibrated threshold-specific treatment difference curve across a range of score values is constructed. The subpopulation of patients satisfying any given level of treatment benefit can then be identified accordingly. To avoid bias due to overoptimism, we utilize a cross-training-evaluation method for implementing the above two-step procedure. We then show how to select the best scoring system among all competing models. Furthermore, for cases in which only a single pre-specified working model is involved, inference procedures are proposed for the average treatment difference over a range of score values using the entire data set, and are justified theoretically and numerically. Lastly, the proposals are illustrated with the data from two clinical trials in treating HIV and cardiovascular diseases. Note that if we are not interested in designing a new study for comparing similar treatments, the new procedure can also be quite useful for the management of future patients, so that treatment may be targeted towards those who would receive nontrivial benefits to compensate for the risk or cost of the new treatment.
PMCID: PMC3775385  PMID: 24058223
Cross-training-evaluation; Lasso procedure; Personalized medicine; Prediction; Ridge regression; Stratified medicine; Subgroup analysis; Variable selection
18.  Omnibus Risk Assessment via Accelerated Failure Time Kernel Machine Modeling 
Biometrics  2013;69(4):10.1111/biom.12098.
Integrating genomic information with traditional clinical risk factors to improve the prediction of disease outcomes could profoundly change the practice of medicine. However, the large number of potential markers and possible complexity of the relationship between markers and disease make it difficult to construct accurate risk prediction models. Standard approaches for identifying important markers often rely on marginal associations or linearity assumptions and may not capture non-linear or interactive effects. In recent years, much work has been done to group genes into pathways and networks. Integrating such biological knowledge into statistical learning could potentially improve model interpretability and reliability. One effective approach is to employ a kernel machine (KM) framework, which can capture nonlinear effects if nonlinear kernels are used (Scholkopf and Smola, 2002; Liu et al., 2007, 2008). For survival outcomes, KM regression modeling and testing procedures have been derived under a proportional hazards (PH) assumption (Li and Luan, 2003; Cai et al., 2011). In this paper, we derive testing and prediction methods for KM regression under the accelerated failure time model, a useful alternative to the PH model. We approximate the null distribution of our test statistic using resampling procedures. When multiple kernels are of potential interest, it may be unclear in advance which kernel to use for testing and estimation. We propose a robust Omnibus Test that combines information across kernels, and an approach for selecting the best kernel for estimation. The methods are illustrated with an application in breast cancer.
PMCID: PMC3869038  PMID: 24328713
Accelerated Failure Time Model; Kernel Machines; Omnibus Test; Resampling; Risk Prediction; Survival Analysis
19.  Joint Effects of Common Genetic Variants on the Risk for Type 2 Diabetes in U.S. Men and Women of European Ancestry 
Annals of internal medicine  2009;150(8):541-550.
Genome-wide association studies have identified novel type 2 diabetes loci, each of which has a modest impact on risk.
To examine the joint effects of several type 2 diabetes risk variants and their combination with conventional risk factors on type 2 diabetes risk in 2 prospective cohorts.
Nested case–control study.
United States.
2809 patients with type 2 diabetes and 3501 healthy control participants of European ancestry from the Health Professionals Follow-up Study and Nurses’ Health Study.
A genetic risk score (GRS) was calculated on the basis of 10 polymorphisms in 9 loci.
After adjustment for age and body mass index (BMI), the odds ratio for type 2 diabetes with each point of GRS, corresponding to 1 risk allele, was 1.19 (95% CI, 1.14 to 1.24) and 1.16 (CI, 1.12 to 1.20) for men and women, respectively. Persons with a BMI of 30 kg/m2 or greater and a GRS in the highest quintile had an odds ratio of 14.06 (CI, 8.90 to 22.18) compared with persons with a BMI less than 25 kg/m2 and a GRS in the lowest quintile after adjustment for age and sex. Persons with a positive family history of diabetes and a GRS in the highest quintile had an odds ratio of 9.20 (CI, 5.50 to 15.40) compared with persons without a family history of diabetes and with a GRS in the lowest quintile. The addition of the GRS to a model of conventional risk factors improved discrimination by 1% (P < 0.001).
The study focused only on persons of European ancestry; whether GRS is associated with type 2 diabetes in other ethnic groups remains unknown.
Although its discriminatory value is currently limited, a GRS that combines information from multiple genetic variants might be useful for identifying subgroups with a particularly high risk for type 2 diabetes.
PMCID: PMC3825275  PMID: 19380854
20.  Modeling Disease Severity in Multiple Sclerosis Using Electronic Health Records 
PLoS ONE  2013;8(11):e78927.
To optimally leverage the scalability and unique features of the electronic health records (EHR) for research that would ultimately improve patient care, we need to accurately identify patients and extract clinically meaningful measures. Using multiple sclerosis (MS) as a proof of principle, we showcased how to leverage routinely collected EHR data to identify patients with a complex neurological disorder and derive an important surrogate measure of disease severity heretofore only available in research settings.
In a cross-sectional observational study, 5,495 MS patients were identified from the EHR systems of two major referral hospitals using an algorithm that includes codified and narrative information extracted using natural language processing. In the subset of patients who receive neurological care at a MS Center where disease measures have been collected, we used routinely collected EHR data to extract two aggregate indicators of MS severity of clinical relevance multiple sclerosis severity score (MSSS) and brain parenchymal fraction (BPF, a measure of whole brain volume).
The EHR algorithm that identifies MS patients has an area under the curve of 0.958, 83% sensitivity, 92% positive predictive value, and 89% negative predictive value when a 95% specificity threshold is used. The correlation between EHR-derived and true MSSS has a mean R2 = 0.38±0.05, and that between EHR-derived and true BPF has a mean R2 = 0.22±0.08. To illustrate its clinical relevance, derived MSSS captures the expected difference in disease severity between relapsing-remitting and progressive MS patients after adjusting for sex, age of symptom onset and disease duration (p = 1.56×10−12).
Incorporation of sophisticated codified and narrative EHR data accurately identifies MS patients and provides estimation of a well-accepted indicator of MS severity that is widely used in research settings but not part of the routine medical records. Similar approaches could be applied to other complex neurological disorders.
PMCID: PMC3823928  PMID: 24244385
21.  Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test 
Biostatistics (Oxford, England)  2012;13(4):776-790.
In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology 23, 429–435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One 2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics 63, 1079–1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics 9, 292–2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics 86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50–57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over the existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB-based score test and the adaptive test, we further construct an omnibus test that attains good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or via distributional approximations. Through extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.
PMCID: PMC3440238  PMID: 22734045
Adaptive procedures; Empirical Bayes; GWAS; Pathway analysis; Score test; SNP sets
22.  Landmark Prediction of Long Term Survival Incorporating Short Term Event Time Information 
In recent years, a wide range of markers have become available as potential tools to predict risk or progression of disease. In addition to such biological and genetic markers, short term outcome information may be useful in predicting long term disease outcomes. When such information is available, it would be desirable to combine this along with predictive markers to improve the prediction of long term survival. Most existing methods for incorporating censored short term event information in predicting long term survival focus on modeling the disease process and are derived under restrictive parametric models in a multi-state survival setting. When such model assumptions fail to hold, the resulting prediction of long term outcomes may be invalid or inaccurate. When there is only a single discrete baseline covariate, a fully non-parametric estimation procedure to incorporate short term event time information has been previously proposed. However, such an approach is not feasible for settings with one or more continuous covariates due to the curse of dimensionality. In this paper, we propose to incorporate short term event time information along with multiple covariates collected up to a landmark point via a flexible varying-coefficient model. To evaluate and compare the prediction performance of the resulting landmark prediction rule, we use robust non-parametric procedures which do not require the correct specification of the proposed varying coefficient model. Simulation studies suggest that the proposed procedures perform well in finite samples. We illustrate them here using a dataset of post-dialysis patients with end-stage renal disease.
PMCID: PMC3535339  PMID: 23293405
Landmark Prediction; Risk Prediction; Survival Time; Varying Coefficient Model
23.  A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data 
Statistics in medicine  2012;32(14):2430-2442.
Risk prediction procedures can be quite useful for the patient’s treatment selection, prevention strategy, or disease management in evidence-based medicine. Often, potentially important new predictors are available in addition to the conventional markers. The question is how to quantify the improvement from the new markers for prediction of the patient’s risk in order to aid cost–benefit decisions. The standard method, using the area under the receiver operating characteristic curve, to measure the added value may not be sensitive enough to capture incremental improvements from the new markers. Recently, some novel alternatives to area under the receiver operating characteristic curve, such as integrated discrimination improvement and net reclassification improvement, were proposed. In this paper, we consider a class of measures for evaluating the incremental values of new markers, which includes the preceding two as special cases. We present a unified procedure for making inferences about measures in the class with censored event time data. The large sample properties of our procedures are theoretically justified. We illustrate the new proposal with data from a cancer study to evaluate a new gene score for prediction of the patient’s survival.
PMCID: PMC3734387  PMID: 23037800
area under the receiver operating characteristic curve; C-statistic; Cox’s regression; integrated discrimination improvement; net reclassification improvement; risk prediction
24.  Pathprinting: An integrative approach to understand the functional basis of disease 
Genome Medicine  2013;5(7):68.
New strategies to combat complex human disease require systems approaches to biology that integrate experiments from cell lines, primary tissues and model organisms. We have developed Pathprint, a functional approach that compares gene expression profiles in a set of pathways, networks and transcriptionally regulated targets. It can be applied universally to gene expression profiles across species. Integration of large-scale profiling methods and curation of the public repository overcomes platform, species and batch effects to yield a standard measure of functional distance between experiments. We show that pathprints combine mouse and human blood developmental lineage, and can be used to identify new prognostic indicators in acute myeloid leukemia. The code and resources are available at
PMCID: PMC3971351  PMID: 23890051
25.  Evaluating the Predictive Value of Biomarkers with Stratified Case-Cohort Design 
Biometrics  2012;68(4):1219-1227.
Identification of novel biomarkers for risk assessment is important for both effective disease prevention and optimal treatment recommendation. Discovery relies on the precious yet limited resource of stored biological samples from large prospective cohort studies. Case-cohort sampling design provides a cost-effective tool in the context of biomarker evaluation, especially when the clinical condition of interest is rare. Existing statistical methods focus on making efficient inference on relative hazard parameters from the Cox regression model. Drawing on recent theoretical development on the weighted likelihood for semiparametric models under two-phase studies (Breslow and Wellner, 2007), we propose statistical methods to evaluate accuracy and predictiveness of a risk prediction biomarker, with censored time-to-event outcome under stratified case-cohort sampling. We consider nonparametric methods and a semiparametric method. We derive large sample properties of proposed estimators and evaluate their finite sample performance using numerical studies. We illustrate new procedures using data from Framingham Offspring study to evaluate the accuracy of a recently developed risk score incorporating biomarker information for predicting cardiovascular disease.
PMCID: PMC3718317  PMID: 23173848
Case Cohort Sampling; Negative predictive value; Positive predictive value; Receiver Operating Characteristics Curve (ROC curve); Integrated Discrimination Improvement (IDI); Risk prediction; Survival analysis; Two-phase study

Results 1-25 (51)