Search tips
Search criteria


Important Notice

PubMed Central Canada to be taken offline in February 2018

On February 23, 2018, PubMed Central Canada (PMC Canada) will be taken offline permanently. No author manuscripts will be deleted, and the approximately 2,900 manuscripts authored by Canadian Institutes of Health Research (CIHR)-funded researchers currently in the archive will be copied to the National Research Council’s (NRC) Digital Repository over the coming months. These manuscripts along with all other content will also remain publicly searchable on PubMed Central (US) and Europe PubMed Central, meaning such manuscripts will continue to be compliant with the Tri-Agency Open Access Policy on Publications.

Read more

Results 1-25 (103)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
author:("Cai, tianqi")
1.  Large-scale identification of patients with cerebral aneurysms using natural language processing 
Neurology  2017;88(2):164-168.
To use natural language processing (NLP) in conjunction with the electronic medical record (EMR) to accurately identify patients with cerebral aneurysms and their matched controls.
ICD-9 and Current Procedural Terminology codes were used to obtain an initial data mart of potential aneurysm patients from the EMR. NLP was then used to train a classification algorithm with .632 bootstrap cross-validation used for correction of overfitting bias. The classification rule was then applied to the full data mart. Additional validation was performed on 300 patients classified as having aneurysms. Controls were obtained by matching age, sex, race, and healthcare use.
We identified 55,675 patients of 4.2 million patients with ICD-9 and Current Procedural Terminology codes consistent with cerebral aneurysms. Of those, 16,823 patients had the term aneurysm occur near relevant anatomic terms. After training, a final algorithm consisting of 8 coded and 14 NLP variables was selected, yielding an overall area under the receiver-operating characteristic curve of 0.95. After the final algorithm was applied, 5,589 patients were classified as having aneurysms, and 54,952 controls were matched to those patients. The positive predictive value based on a validation cohort of 300 patients was 0.86.
We harnessed the power of the EMR by applying NLP to obtain a large cohort of patients with intracranial aneurysms and their matched controls. Such algorithms can be generalized to other diseases for epidemiologic and genetic studies.
PMCID: PMC5224711  PMID: 27927935
2.  Origins of lymphatic and distant metastases in human colorectal cancer 
Science (New York, N.Y.)  2017;357(6346):55-60.
The spread of cancer cells from primary tumors to regional lymph nodes is often associated with reduced survival. One prevailing model to explain this association posits that fatal, distant metastases are seeded by lymph node metastases. This view provides a mechanistic basis for the TNM staging system and is the rationale for surgical resection of tumor-draining lymph nodes. Here, we examine the evolutionary relationship between primary tumor, lymph node and distant metastases in human colorectal cancer. Studying 213 archival biopsy samples from 17 patients, we used somatic variants in hypermutable DNA regions to reconstruct high-confidence phylogenetic trees. We found that in 65% of cases, lymphatic and distant metastases arose from independent subclones in the primary tumor, whereas in 35% of cases they shared common subclonal origin. Therefore, two different lineage relationships between lymphatic and distant metastases exist in colorectal cancer.
PMCID: PMC5536201  PMID: 28684519
3.  A General Statistical Framework for Subgroup Identification and Comparative Treatment Scoring 
Biometrics  2017;73(4):1199-1209.
Many statistical methods have recently been developed for identifying subgroups of patients who may benefit from different available treatments. Compared with the traditional outcome-modeling approaches, these methods focus on modeling interactions between the treatments and covariates while by-pass or minimize modeling the main effects of covariates because the subgroup identification only depends on the sign of the interaction. However these methods are scattered and often narrow in scope. In this paper, we propose a general framework, by weighting and A-learning, for subgroup identification in both randomized clinical trials and observational studies. Our framework involves minimum modeling for the relationship between the outcome and covariates pertinent to the subgroup identification. Under the proposed framework, we may also estimate the magnitude of the interaction, which leads to the construction of scoring system measuring the individualized treatment effect. The proposed methods are quite flexible and include many recently proposed estimators as special cases. As a result, some estimators originally proposed for randomized clinical trials can be extended to observational studies, and procedures based on the weighting method can be converted to an A-learning method and vice versa. Our approaches also allow straightforward incorporation of regularization methods for high-dimensional data, as well as possible efficiency augmentation and generalization to multiple treatments. We examine the empirical performance of several procedures belonging to the proposed framework through extensive numerical studies.
PMCID: PMC5561419  PMID: 28211943
A-learning; Individualized treatment rules; Observational studies; Propensity score; Regularization
4.  Augmented Estimation for t-year Survival with Censored Regression Models 
Biometrics  2017;73(4):1169-1178.
Reliable and accurate risk prediction is fundamental for successful management of clinical conditions. Estimating comprehensive risk prediction models precisely, however, is a difficult task, especially when the outcome of interest is time to a rare event and the number of candidate predictors, p, is not very small. Another challenge in developing accurate risk models arises from potential model misspecification. Time-specific generalized linear models estimated with inverse censoring probability weighting are robust to model misspecification, but may be inefficient in the rare event setting. To improve the efficiency of such robust estimation procedures, various augmentation methods have been proposed in the literature. These procedures can also leverage auxiliary variables such as intermediate outcomes that are predictive of event risk. However, most existing methods do not perform well in the rare event setting, especially when p is not small. In this paper, we propose a two-step, imputation-based augmentation procedure that can improve estimation efficiency and that is robust to model misspecification. We also develop regularized augmentation procedures for settings where p is not small, along with procedures to improve the estimation of individualized treatment effect in risk reduction. Numerical studies suggest that our proposed methods substantially outperform existing methods in efficiency gains. The proposed methods are applied to an AIDS clinical trial for treating HIV-infected patients.
PMCID: PMC5592155  PMID: 28294286
Efficiency augmentation; Intermediate outcomes; Model misspecification; Risk prediction; Robustness; Survival
5.  Analysis of Multiple Diverse Phenotypes via Semiparametric Canonical Correlation Analysis 
Biometrics  2017;73(4):1254-1265.
Studying multiple outcomes simultaneously allows researchers to begin to identify underlying factors that affect all of a set of diseases (i.e., shared etiology) and what may give rise to differences in disorders between patients (i.e., disease subtypes). In this work, our goal is to build risk scores that are predictive of multiple phenotypes simultaneously and identify subpopulations at high risk of multiple phenotypes. Such analyses could yield insight into etiology or point to treatment and prevention strategies. The standard canonical correlation analysis (CCA) can be used to relate multiple continuous outcomes to multiple predictors. However, in order to capture the full complexity of a disorder, phenotypes may include a diverse range of data types, including binary, continuous, ordinal, and censored variables. When phenotypes are diverse in this way, standard CCA is not possible and no methods currently exist to model them jointly. In the presence of such complications, we propose a semi-parametric CCA method to develop risk scores that are predictive of multiple phenotypes. To guard against potential model mis-specification, we also propose a nonparametric calibration method to identify subgroups that are at high risk of multiple disorders. A resampling procedure is also developed to account for the variability in these estimates. Our method opens the door to synthesizing a wide array of data sources for the purposes of joint prediction.
PMCID: PMC5640493  PMID: 28407213
Canonical correlation analysis; Multiple phenotypes; Nonparametric calibration; Semi-parametric transformation models
6.  Prioritizing Individual Genetic Variants After Kernel Machine Testing Using Variable Selection 
Genetic epidemiology  2016;40(8):722-731.
Kernel machine learning methods, such as the SNP-set kernel association test (SKAT), have been widely used to test associations between traits and genetic polymorphisms. In contrast to traditional single-SNP analysis methods, these methods are designed to examine the joint effect of a set of related SNPs (such as a group of SNPs within a gene or a pathway) and are able to identify sets of SNPs that are associated with the trait of interest. However, as with many multi-SNP testing approaches, kernel machine testing can draw conclusion only at the SNP-set level, and do not directly inform on which one(s) of the identified SNP set is actually driving the associations. A recently proposed procedure, KerNel Iterative Feature Extraction (KNIFE), provides a general framework for incorporating variable selection into kernel machine methods. In this article, we focus on quantitative traits and relatively common SNPs, and adapt the KNIFE procedure to genetic association studies and propose an approach to identify driver SNPs after the application of SKAT to gene set analysis. Our approach accommodates several kernels that are widely used in SNP analysis, such as the linear kernel and the Identity By State (IBS) kernel. The proposed approach provides practically useful utilities to prioritize SNPs, and fills the gap between SNP set analysis and biological functional studies. Both simulation studies and real data application are used to demonstrate the proposed approach.
PMCID: PMC5118060  PMID: 27488097
Genetic association studies; Kernel machine methods; KNIFE; Set-based; Variable selection
The annals of applied statistics  2017;11(2):638-654.
Cost-effective yet efficient designs are critical to the success of biomarker evaluation research. Two-phase sampling designs, under which expensive markers are only measured on a subsample of cases and non-cases within a prospective cohort, are useful in novel biomarker studies for preserving study samples and minimizing cost of biomarker assaying. Statistical methods for quantifying the predictiveness of biomarkers under two-phase studies have been proposed (Cai and Zheng, 2012; Liu, Cai and Zheng, 2012). These methods are based on a class of inverse probability weighted (IPW) estimators where weights are ‘true’ sampling weights that simply reflect the sampling strategy of the study. While simple to implement, existing IPW estimators are limited by lack of practicality and efficiency. In this manuscript, we investigate a variety of two-phase design options and provide statistical approaches aimed at improving the efficiency of simple IPW estimators by incorporating auxiliary information available for the entire cohort. We consider accuracy summary estimators that accommodate auxiliary information in the context of evaluating the incremental values of novel biomarkers over existing prediction tools. In addition, we evaluate the relative efficiency of a variety of sampling and estimation options under two-phase studies, shedding light on issues pertaining to both the design and analysis of biomarker validation studies. We apply our methods to the evaluation of a novel biomarker for liver cancer risk conducted with a two-phase nested case control design (Lok et al., 2010).
PMCID: PMC5604898  PMID: 28943991
biomarker; prediction accuracy; risk prediction; two-phase study
8.  Inference for survival prediction under the regularized Cox model 
Biostatistics (Oxford, England)  2016;17(4):692-707.
When a moderate number of potential predictors are available and a survival model is fit with regularization to achieve variable selection, providing accurate inference on the predicted survival can be challenging. We investigate inference on the predicted survival estimated after fitting a Cox model under regularization guaranteeing the oracle property. We demonstrate that existing asymptotic formulas for the standard errors of the coefficients tend to underestimate the variability for some coefficients, while typical resampling such as the bootstrap tends to overestimate it; these approaches can both lead to inaccurate variance estimation for predicted survival functions. We propose a two-stage adaptation of a resampling approach that brings the estimated error in line with the truth. In stage 1, we estimate the coefficients in the observed data set and in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$B$\end{document} resampled data sets, and allow the resampled coefficient estimates to vote on whether each coefficient should be 0. For those coefficients voted as zero, we set both the point and interval estimates to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\{0\}$\end{document}. In stage 2, to make inference about coefficients not voted as zero in stage 1, we refit the penalized model in the observed data and in the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$B$\end{document} resampled data sets with only variables corresponding to those coefficients. We demonstrate that ensemble voting-based point and interval estimators of the coefficients perform well in finite samples, and prove that the point estimator maintains the oracle property. We extend this approach to derive inference procedures for survival functions and demonstrate that our proposed interval estimation procedures substantially outperform estimators based on asymptotic inference or standard bootstrap. We further illustrate our proposed procedures to predict breast cancer survival in a gene expression study.
PMCID: PMC5031946  PMID: 27107008
Bootstrap; Ensemble methods; Oracle property; Proportional hazards model; Regularized estimation; Resampling; Risk prediction; Simultaneous confidence intervals; Survival functions
9.  Structured Matrix Completion with Applications to Genomic Data Integration 
Matrix completion has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. We provide theoretical justification for the proposed SMC method and derive lower bound for the estimation errors, which together establish the optimal rate of recovery over certain classes of approximately low-rank matrices. Simulation studies show that the method performs well in finite sample under a variety of configurations. The method is applied to integrate several ovarian cancer genomic studies with different extent of genomic measurements, which enables us to construct more accurate prediction rules for ovarian cancer survival.
PMCID: PMC5198844  PMID: 28042188
Constrained minimization; genomic data integration; low-rank matrix; matrix completion; singular value decomposition; structured matrix completion
10.  Statin use is Associated With Reduced Risk of Colorectal Cancer in Patients with Inflammatory Bowel Diseases 
Background & Aims
Inflammatory bowel diseases (IBDs) such as Crohn’s disease and ulcerative colitis are associated with an increased risk of colorectal cancer (CRC). Chemopreventive strategies have produced weak or inconsistent results. Statins have been inversely associated with sporadic CRC. We examined their role as chemopreventive agents in patients with IBD.
We collected data from 11,001 patients with IBD receiving care at hospitals in the Greater Boston metropolitan area from 1998 through 2010. Diagnoses of CRC were determined using validated ICD-9-CM codes. Statin use prior to diagnosis was assessed through analysis of electronic prescriptions. We performed multivariate logistic regression analyses, adjusting for potential confounders including primary sclerosing cholangitis, smoking, increased levels of inflammation markers, and CRC screening practices to identify independent association between statin use and CRC. We performed sensitivity analyses using propensity score adjustment and variation in definition of statin use.
In our cohort 1376 of the patients (12.5%) received 1 or more prescriptions for a statin. Patients using statins were more likely to be older, male, white, smokers, and have greater comorbidity than non-users. Over a follow-up period of 9 years, 2% of statin users developed CRC compared to 3% of non-users (age-adjusted odds ratio, 0.35; 95% confidence interval, 0.24–0.53). On multivariate analysis, statin use remained independently and inversely associated with CRC (odds ratio, 0.42; 95% confidence interval, 0.28–0.62). Our findings were robust on a variety of sensitivity and subgroup analyses.
Statin use is inversely associated with risk of CRC in a large IBD cohort. Prospective studies on the role of statins as chemopreventive agents are warranted.
PMCID: PMC4912917  PMID: 26905907
Crohn’s disease; ulcerative colitis; statin; colon cancer; HMG-CoA reductase inhibitors; lipid-lowering drug
11.  Genomewide Association Studies of Posttraumatic Stress Disorder in Two Cohorts of US Army Soldiers 
JAMA psychiatry  2016;73(7):695-704.
Posttraumatic stress disorder (PTSD) is a prevalent, serious public health concern, particularly in the military. The identification of genetic risk factors for PTSD may provide important insights into the biological basis of vulnerability and comorbidity.
To discover genetic loci associated with lifetime PTSD risk in two cohorts from the Army Study To Assess Risk and Resilience in Servicemembers (Army STARRS).
Design, Setting and Participants
Two coordinated genomewide association studies of mental health in the US military: New Soldier Study (NSS, N=3167 cases and 4607 trauma-exposed controls) and Pre/Post Deployment Study (PPDS, N=947 cases and 4969 trauma-exposed controls). The primary analysis compared lifetime DSM-IV PTSD cases to trauma-exposed controls without lifetime PTSD.
Main Outcomes and Measures
Association analyses were conducted for PTSD using logistic regression models within each of 3 ancestral groups (European, African, Latino) by study and then meta-analyzed. Heritability and genetic correlation and pleiotropy with other psychiatric and immune-related disorders were estimated.
We observed a genomewide significant locus in ANKRD55 on chromosome 5 (rs159572; odds ratio [OR] = 1.62, p-value =2.43×10−8; adjusted for cumulative trauma exposure [AOR] = 1.68, p-value = 1.18×10−8) in the African American samples from NSS. We also observed a genomewide significant locus in or near ZNF626 on chromosome 19 (rs11085374; OR = 0.77, p-value = 4.59 ×10−8) in the European American samples from NSS. We did not find similar results for either SNP in the corresponding ancestry group from the PPDS sample, or in other ancestral groups or trans-ancestral meta-analyses. SNP-based heritability was non-significant, and no significant genetic correlations were observed between PTSD and six mental disorders and nine immune-related disorders. Significant evidence of pleiotropy was observed between PTSD and rheumatoid arthritis and, to a lesser extent, psoriasis.
Conclusions and Relevance
In the largest GWAS of PTSD to date, involving a US military sample, we found limited evidence of association for specific loci. Further efforts are needed to replicate the genomewide significant association with ANKRD55 – associated in prior research with several autoimmune and inflammatory disorders – and to clarify the nature of the genetic overlap observed between PTSD and rheumatoid arthritis and psoriasis.
PMCID: PMC4936936  PMID: 27167565
genomewide association; genetic; immune; inflammatory; military; posttraumatic stress disorder; pleiotropy; risk; trauma
12.  Estimation and testing for multiple regulation of multivariate mixed outcomes 
Biometrics  2016;72(4):1194-1205.
Considerable interest has recently been focused on studying multiple phenotypes simultaneously in both epidemiological and genomic studies, either to capture the multidimensionality of complex disorders or to understand shared etiology of related disorders. We seek to identify multiple regulators or predictors that are associated with multiple outcomes when these outcomes may be measured on very different scales or composed of a mixture of continuous, binary, and not-fully-observed elements. We first propose an estimation technique to put all effects on similar scales, and we induce sparsity on the estimated effects. We provide standard asymptotic results for this estimator and show that resampling can be used to quantify uncertainty in finite samples. We finally provide a multiple testing procedure which can be geared specifically to the types of multiple regulators of interest, and we establish that, under standard regularity conditions, the familywise error rate will approach 0 as sample size diverges. Simulation results indicate that our approach can improve over unregularized methods both in reducing bias in estimation and improving power for testing.
PMCID: PMC5459434  PMID: 26910481
multiple phenotypes; familywise error control; semiparametric models; multiple testing; hierarchical lasso; multiple regulation; resampling; stepdown testing
13.  Suicidal Behavior-Related Hospitalizations among Pregnant Women in the United States, 2006 – 2012 
Archives of women's mental health  2015;19(3):463-472.
Suicide is one of the leading causes of maternal mortality in many countries, but little is known about the epidemiology of suicide and suicidal behavior among pregnant women in the US. We sought to examine trends and provide nationally representative estimates for suicidal behavior (including suicidal ideation, and suicide and self-inflicted injury) among pregnant women from 2006 to 2012 in the US.
Pregnant women aged 12-55 years were identified through pregnancy- and delivery-related hospitalization records from the National (Nationwide) Inpatient Sample. Suicidal behavior was identified by the International Classification of Diseases, Ninth Revision, Clinical Modification codes. Annual, nationwide estimates and trends were determined using discharge and hospital weights.
The prevalence of suicidal ideation more than doubled from 2006 to 2012 (47.5 to 115.0 per 100,000 pregnancy- and delivery-related hospitalizations), whereas the prevalence of suicide and self-inflicted injury remained stable. Nearly 10% of suicidal behavior occurred in the 12-18-year group, showing the highest prevalence per 100,000 pregnancy- and delivery-related hospitalizations (158.8 in 2006 and 308.7 in 2012) over the study period. For suicidal ideation, blacks had higher prevalence than whites; women in the lowest income quartile had the highest prevalence. Although the prevalence of suicidal behavior was higher among hospitalizations with depression diagnoses, more than 30% of hospitalizations were for suicidal behavior without depression diagnoses.
Our findings highlight the increasing burden and racial differences in suicidal ideation among US pregnant women. Targeted suicide prevention efforts are needed for high-risk pregnant women including teens, blacks, and low-income women.
PMCID: PMC4871736  PMID: 26680447
suicidal ideation; suicide and self-inflicted injury; pregnant women; National (Nationwide) Inpatient Sample
14.  Testing Differential Networks with Applications to Detecting Gene-by-Gene Interactions 
Biometrika  2015;102(2):247-266.
Model organisms and human studies have led to increasing empirical evidence that interactions among genes contribute broadly to genetic variation of complex traits. In the presence of gene-by-gene interactions, the dimensionality of the feature space becomes extremely high relative to the sample size. This imposes a significant methodological challenge in identifying gene-by-gene interactions. In the present paper, through a Gaussian graphical model framework, we translate the problem of identifying gene-by-gene interactions associated with a binary trait D into an inference problem on the difference of two high-dimensional precision matrices, which summarize the conditional dependence network structures of the genes. We propose a procedure for testing the differential network globally that is particularly powerful against sparse alternatives. In addition, a multiple testing procedure with false discovery rate control is developed to infer the specific structure of the differential network. Theoretical justification is provided to ensure the validity of the proposed tests and optimality results are derived under sparsity assumptions. A simulation study demonstrates that the proposed tests maintain the desired error rates under the null and have good power under the alternative. The methods are applied to a breast cancer gene expression study.
PMCID: PMC5426514
Differential network; false discovery rate; Gaussian graphical model; gene-by-gene interaction; highdimensional precision matrix; large scale multiple testing
15.  L1-Regularized Least Squares for Support Recovery of High Dimensional Single Index Models with Gaussian Designs 
It is known that for a certain class of single index models (SIMs) zˇS0c, support recovery is impossible when X ~ 𝒩(0, 𝕀p×p) and a model complexity adjusted sample size is below a critical threshold. Recently, optimal algorithms based on Sliced Inverse Regression (SIR) were suggested. These algorithms work provably under the assumption that the design X comes from an i.i.d. Gaussian distribution. In the present paper we analyze algorithms based on covariance screening and least squares with L1 penalization (i.e. LASSO) and demonstrate that they can also enjoy optimal (up to a scalar) rescaled sample size in terms of support recovery, albeit under slightly different assumptions on f and ε compared to the SIR based algorithms. Furthermore, we show more generally, that LASSO succeeds in recovering the signed support of β0 if X ~ 𝒩 (0, Σ), and the covariance Σ satisfies the irrepresentable condition. Our work extends existing results on the support recovery of LASSO for the linear model, to a more general class of SIMs.
PMCID: PMC5426818
Single index models; Sparsity; Support recovery; High-dimensional statistics; LASSO
16.  Phenome‐Wide Association Study of Autoantibodies to Citrullinated and Noncitrullinated Epitopes in Rheumatoid Arthritis 
Patients with rheumatoid arthritis (RA) develop autoantibodies against a spectrum of antigens, but the clinical significance of these autoantibodies is unclear. Using a phenome‐wide association study (PheWAS) approach, we examined the association between autoantibodies and clinical subphenotypes of RA.
This study was conducted in a cohort of RA patients identified from the electronic medical records (EMRs) of 2 tertiary care centers. Using a published multiplex bead assay, we measured 36 autoantibodies targeting epitopes implicated in RA. We extracted all International Classification of Diseases, Ninth Revision (ICD‐9) codes for each subject and grouped them into disease categories (PheWAS codes), using a published method. We tested for the association of each autoantibody (grouped by the targeted protein) with PheWAS codes. To determine significant associations (at a false discovery rate [FDR] of ≤0.1), we reviewed the medical records of 50 patients with each PheWAS code to determine positive predictive values (PPVs).
We studied 1,006 RA patients; the mean ± SD age of the patients was 61.0 ± 12.9 years, and 79.0% were female. A total of 3,568 unique ICD‐9 codes were grouped into 625 PheWAS codes; the 206 PheWAS codes with a prevalence of ≥3% were studied. Using the PheWAS method, we identified 24 significant associations of autoantibodies to epitopes at an FDR of ≤0.1. The associations that were strongest and had the highest PPV for the PheWAS code were autoantibodies against fibronectin and obesity (P = 6.1 × 10−4, PPV 100%), and that between fibrinogen and pneumonopathy (P = 2.7 × 10−4, PPV 96%). Pneumonopathy codes included diagnoses for cryptogenic organizing pneumonia and obliterative bronchiolitis.
We demonstrated application of a bioinformatics method, the PheWAS, to screen for the clinical significance of RA‐related autoantibodies. Using the PheWAS approach, we identified potentially significant links between variations in the levels of autoantibodies and comorbidities of interest in RA.
PMCID: PMC5378622  PMID: 27792870
17.  Comparative effectiveness of infliximab and adalimumab in Crohn’s disease and ulcerative colitis 
Inflammatory bowel diseases  2016;22(4):880-885.
The availability of monoclonal antibodies to tumor necrosis factor α (anti-TNF) has revolutionized management of Crohn’s disease (CD) and ulcerative colitis (UC). However, limited data exists regarding comparative effectiveness of these agents to inform clinical practice.
This study consisted of patients with CD or UC initiation either infliximab (IFX) or adalimumab (ADA) between 1998 and 2010. A validated likelihood of non-response classification score utilizing frequency of narrative mentions of relevant symptoms in the electronic health record (EHR) was applied to assess comparative effectiveness at 1 year. IBD-related surgery, hospitalization, and use of steroids was determined during this period.
Our final cohort included 1,060 new initiations of IFX (68% for CD) and 391 of ADA (79% for CD). In CD, the likelihood of non-response was higher in ADA than IFX (OR 1.62, 95% CI 1.21 – 2.17). Similar differences favoring efficacy of IFX was observed for the individual symptoms of diarrhea, pain, bleeding, and fatigue. However, there was no difference in IBD-related surgery, hospitalizations or prednisone use within 1 year after initiation of IFX or ADA in CD. There was no difference in narrative or codified outcomes between the two agents in UC.
We identified a modestly higher likelihood of symptomatic non-response at 1 year for ADA compared to IFX in patients with CD. However, there were no differences in IBD-related surgery or hospitalizations suggesting these treatments are broadly comparable in effectiveness in routine clinical practice.
PMCID: PMC4792716  PMID: 26933751
Crohn’s disease; ulcerative colitis; treatment response; biologic; infliximab
18.  On Longitudinal Prediction with Time-to-Event Outcome: Comparison of Modeling Options 
Biometrics  2016;73(1):83-93.
Long term follow-up is common in many medical investigations where the interest lies in predicting patients’ risks for a future adverse outcome using repeatedly measured predictors over time. A key quantity is the likelihood of developing an adverse outcome among individuals who survived up to time s given their covariate information up to time s. Simple, yet reliable, methodology for updating the predicted risk of disease progression using longitudinal markers remains elusive. Two main approaches have been considered in the literature. One approach, based on joint modeling (JM) of failure time and longitudinal covariate process (Tsiatis and Davidian, 2004), derives such longitudinal predictive probability from the joint probability of a longitudinal marker and an event at a given time. A second approach, the partly conditional (PC) modeling (Zheng and Heagerty, 2005), directly models the predictive probability conditional on survival up to a landmark time and information accrued by that time. In this paper we propose new PC models for longitudinal prediction that are more flexible than joint modeling and improve the prediction accuracy over existing PC models. We provide procedures for making inference regarding future risk for an individual with longitudinal measures up to a given time. In addition, we conduct simulations to evaluate both JM and PC approaches in order to provide practical guidance on modeling choices. We use standard measures of predictive accuracy adapted to our setting to explore the predictiveness of the two approaches. We illustrate the performance of the two approaches on a dataset from the End Stage Renal Disease Study (ESRDS).
PMCID: PMC5250577  PMID: 27438160
joint model; longitudinal data analysis; partly conditional model; risk prediction; survival analysis
19.  Identifying Predictive Markers for Personalized Treatment Selection 
Biometrics  2016;72(4):1017-1025.
It is now well recognized that the effectiveness and potential risk of a treatment often vary by patient subgroups. Although trial-and-error and one-size-fits-all approaches to treatment selection remains a common practice, much recent focus has been placed on individualized treatment selection based on patient information (La Thangue and Kerr, 2011; Ong et al., 2012). Genetic and molecular markers are becoming increasingly available to guide treatment selection for various diseases including HIV and breast cancer (Mallal et al., 2008; Zujewski and Kamin, 2008). In recent years, many statistical procedures for developing individualized treatment rules (ITRs) have been proposed. However, less focus has been given to efficient selection of predictive biomarkers for treatment selection. The standard Wald test for interactions between treatment and the set of markers of interest may not work well when the marker effects are non-linear. Furthermore, interaction based test is scale dependent and may fail to capture markers useful for predicting individualized treatment differences. In this paper, we propose to overcome these difficulties by developing a kernel machine (KM) score test that can efficiently identify markers predictive of treatment difference. Simulation studies show that our proposed KM based score test is more powerful than the Wald test when there is non-linear effect among the predictors and when the outcome is binary with non-linear link functions. Furthermore, when there is high-correlation among predictors and when the number of predictors is not small, our method also over-performs Wald test. The proposed method is illustrated with two randomized clinical trials.
PMCID: PMC5352461  PMID: 26999054
Treatment selection; Score test; Kernel machine; Kernel PCA; Perturbation
20.  Identification of non-response to treatment using narrative data in an electronic health record inflammatory bowel disease cohort 
Inflammatory bowel diseases  2016;22(1):151-158.
Electronic health records (EHR), increasingly a part of healthcare, provide a wealth of untapped narrative free text data that has the potential to accurately inform clinical outcomes.
From a validated cohort of patients with Crohn’s disease (CD) or ulcerative colitis (UC), we identified patients with ≥ 1 coded or narrative mention of monoclonal antibodies to tumor necrosis factor α (anti-TNF). Chart review by ascertained true use of therapy, time of initiation and cessation of treatment, as well as clinical response stratified as non-response, partial, or complete response at one year. Internal consistency was assessed in an independent validation cohort.
A total of 3,087 patients had a mention of an anti-TNF. Actual therapy initiation was within 60 days of the first coded mention in 74% of patients. In the derivation cohort, 18% of anti-TNF starts were classified as non-response at 1 year, 21% as partial, and 56% as complete response. On multivariate analysis, the number of narrative mentions of diarrhea (OR 1.08, 95% CI 1.02 – 1.14) and fatigue (OR 1.16, 95% CI 1.02 – 1.32) were independently associated with non-response at 1 year (AUC 0.82). A likelihood of non-response score comprising a weighted sum of both demonstrated a good dose response relationship across non-responders (2.18), partial (1.20), and complete (0.50) responders (p < 0.0001) and correlated well with need for surgery or hospitalizations.
Narrative data in an EHR offers considerable potential to define temporally evolving disease outcomes such as non-response to treatment.
PMCID: PMC4772891  PMID: 26332313
Crohn’s disease; ulcerative colitis; treatment response; biologic; Responders; non-response; infliximab; IBD; Crohn’s disease; ulcerative colitis
21.  Robust Risk Prediction with Biomarkers under Two-Phase Stratified Cohort Design 
Biometrics  2016;72(4):1037-1045.
Identification of novel biomarkers for risk prediction is important for disease prevention and optimal treatment selection. However, studies aiming to discover which biomarkers are useful for risk prediction often require the use of stored biological samples from large assembled cohorts, and thus the depletion of a finite and precious resource. To make efficient use of such these stored samples, two-phase sampling designs are often adopted as resource-efficient sampling strategies, especially when the outcome of interest is rare. Existing methods for analyzing data from two-phase studies focus primarily on single marker analysis or fitting the Cox regression model to combine information from multiple markers. However, the Cox model may not fit the data well. Under model misspecification, the composite score derived from the Cox model may not perform well in predicting the outcome. Under a general two-phase stratified cohort sampling design, we present a novel approach to combining multiple markers to optimize prediction by fitting a flexible non-parametric transformation model. Using inverse probability weighting to account for the outcome dependent sampling, we propose to estimate the model parameters by maximizing an objective function which can be interpreted as a weighted C-statistic for survival outcomes. Regardless of model adequacy, the proposed procedure yields a sensible composite risk score for prediction. A major obstacle for making inference under two phase studies is due to the correlation induced by the finite population sampling, which prevents standard inference procedures such as the bootstrap from being used for variance estimation. We propose a resampling procedure to derive valid confidence intervals for the model parameters and the C-statistic accuracy measure. We illustrate the new methods with simulation studies and an analysis of a two-phase study of high-density lipoprotein cholesterol (HDL-C) subtypes for predicting the risk of coronary heart disease.
PMCID: PMC5045782  PMID: 27037494
C-statistic; Finite population sampling; Outcome-dependent sampling; Perturbation resampling; Risk prediction; Robustness; Two-phase stratified cohort sampling
22.  Retrospective Likelihood Based Methods for Analyzing Case-Cohort Genetic Association Studies 
Biometrics  2015;71(4):960-968.
The case cohort (CCH) design is a cost effective design for assessing genetic susceptibility with time-to-event data especially when the event rate is low. In this work, we propose a powerful pseudo score test for assessing the association between a single nucleotide polymorphism (SNP) and the event time under the CCH design. The pseudo score is derived from a pseudo likelihood which is an estimated retrospective likelihood that treats the SNP genotype as the dependent variable and time-to-event outcome and other covariates as independent variables. It exploits the fact that the genetic variable is often distributed independent of covariates or only related to a low-dimensional subset. Estimates of hazard ratio parameters for association can be obtained by maximizing the pseudo likelihood. A unique advantage of our method is that it allows the censoring distribution to depend on covariates that are only measured for the CCH sample while not requiring the knowledge of follow up or covariate information on subjects not selected into the CCH sample. In addition to these flexibilities, the proposed method has high relative efficiency compared with commonly used alternative approaches. We study large sample properties of this method and assess its finite sample performance using both simulated and real data examples.
PMCID: PMC4751872  PMID: 26177343
Case-cohort design; Cox proportional hazards model; Genetic association; Inverse probability weighting; Pseudo-likelihood; Polytomous regression
23.  Major depressive disorder subtypes to predict long-term course 
Depression and anxiety  2014;31(9):765-777.
Variation in course of major depressive disorder (MDD) is not strongly predicted by existing subtype distinctions. A new subtyping approach is considered here.
Two data mining techniques, ensemble recursive partitioning and Lasso generalized linear models (GLMs) followed by k-means cluster analysis, are used to search for subtypes based on index episode symptoms predicting subsequent MDD course in the World Mental Health (WMH) Surveys. The WMH surveys are community surveys in 16 countries. Lifetime DSM-IV MDD was reported by 8,261 respondents. Retrospectively reported outcomes included measures of persistence (number of years with an episode; number of with an episode lasting most of the year) and severity (hospitalization for MDD; disability due to MDD).
Recursive partitioning found significant clusters defined by the conjunctions of early onset, suicidality, and anxiety (irritability, panic, nervousness-worry-anxiety) during the index episode. GLMs found additional associations involving a number of individual symptoms. Predicted values of the four outcomes were strongly correlated. Cluster analysis of these predicted values found three clusters having consistently high, intermediate, or low predicted scores across all outcomes. The high-risk cluster (30.0% of respondents) accounted for 52.9-69.7% of high persistence and severity and was most strongly predicted by index episode severe dysphoria, suicidality, anxiety, and early onset. A total symptom count, in comparison, was not a significant predictor.
Despite being based on retrospective reports, results suggest that useful MDD subtyping distinctions can be made using data mining methods. Further studies are needed to test and expand these results with prospective data.
PMCID: PMC5125445  PMID: 24425049
Epidemiology; Depression; Anxiety/Anxiety Disorders; Suicide/Self Harm; Panic Attacks
24.  Common genetic variants influence circulating vitamin D levels in inflammatory bowel diseases 
Inflammatory bowel diseases  2015;21(11):2507-2514.
The accuracy and utility of electronic health record (EHR)-derived phenotypes in replicating genotype-phenotype relationships has been infrequently examined. Low circulating vitamin D levels are associated with severe outcomes in inflammatory bowel disease (IBD); however, the genetic basis for vitamin D insufficiency in this population has not been examined previously.
We compared the accuracy of physician-assigned phenotypes in a large prospective IBD registry to that identified by an EHR-algorithm incorporating codified and structured data. Genotyping for IBD risk alleles was performed on the Immunochip and a genetic risk score calculated and compared between EHR-defined patients and those in the registry. Additionally, four vitamin D risk alleles were genotyped and serum 25-hydroxy vitamin D [25(OH)D] levels compared across genotypes.
A total of 1,131 patients captured by our EHR algorithm were also included in our prospective registry (656 Crohn's disease (CD), 475 ulcerative colitis (UC)). The overall genetic risk score for CD (p=0.13) and UC (p=0.32) was similar between EHR-defined patients and a prospective registry. Three of the four vitamin D risk alleles were associated with low vitamin D levels in patients with IBD and contributed an additional 3% of the variance explained. Vitamin D genetic risk score did not predict normalization of vitamin D levels.
EHR cohorts form valuable data sources for examining genotype-phenotype relationships. Vitamin D risk alleles explain 3% of the variance in vitamin D levels in patients with IBD.
PMCID: PMC4615315  PMID: 26241000
Crohn's disease; ulcerative colitis; vitamin D; electronic health records; genetics
25.  A predictive enrichment procedure to identify potential responders to a new therapy for randomized, comparative controlled clinical studies 
Biometrics  2015;72(3):877-887.
To evaluate a new therapy versus a control via a randomized, comparative clinical study or a series of trials, due to heterogeneity of the study patient population, a pre-specified, predictive enrichment procedure may be implemented to identify an “enrichable” subpopulation. For patients in this subpopulation, the therapy is expected to have a desirable overall risk-benefit profile. To develop and validate such a “therapy-diagnostic co-development” strategy, a three-step procedure may be conducted with three independent data sets from a series of similar studies or a single trial. At the first stage, we create various candidate scoring systems based on the baseline information of the patients via, for example, parametric models using the first data set. Each individual score reflects an anticipated average treatment difference for future patients who share similar baseline profiles. A large score indicates that these patients tend to benefit from the new therapy. At the second step, a potentially promising, enrichable subgroup is identified using the totality of evidence from these scoring systems. At the final stage, we validate such a selection via two-sample inference procedures for assessing the treatment effectiveness statistically and clinically with the third data set, the so-called holdout sample. When the study size is not large, one may combine the first two steps using a “cross-training-evaluation” process. Comprehensive numerical studies are conducted to investigate the operational characteristics of the proposed method. The entire enrichment procedure is illustrated with the data from a cardiovascular trial to evaluate a beta-blocker versus a placebo for treating chronic heart failure patients.
PMCID: PMC4916037  PMID: 26689167
Cox model; Cross-validation; Stratified medicine; Survival analysis; Therapy-diagnostic co-development

Results 1-25 (103)