In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology
23, 429–435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One
2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics
63, 1079–1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics
9, 292–2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics
86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50–57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over the existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB-based score test and the adaptive test, we further construct an omnibus test that attains good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or via distributional approximations. Through extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.
Adaptive procedures; Empirical Bayes; GWAS; Pathway analysis; Score test; SNP sets
In recent years, a wide range of markers have become available as potential tools to predict risk or progression of disease. In addition to such biological and genetic markers, short term outcome information may be useful in predicting long term disease outcomes. When such information is available, it would be desirable to combine this along with predictive markers to improve the prediction of long term survival. Most existing methods for incorporating censored short term event information in predicting long term survival focus on modeling the disease process and are derived under restrictive parametric models in a multi-state survival setting. When such model assumptions fail to hold, the resulting prediction of long term outcomes may be invalid or inaccurate. When there is only a single discrete baseline covariate, a fully non-parametric estimation procedure to incorporate short term event time information has been previously proposed. However, such an approach is not feasible for settings with one or more continuous covariates due to the curse of dimensionality. In this paper, we propose to incorporate short term event time information along with multiple covariates collected up to a landmark point via a flexible varying-coefficient model. To evaluate and compare the prediction performance of the resulting landmark prediction rule, we use robust non-parametric procedures which do not require the correct specification of the proposed varying coefficient model. Simulation studies suggest that the proposed procedures perform well in finite samples. We illustrate them here using a dataset of post-dialysis patients with end-stage renal disease.
Landmark Prediction; Risk Prediction; Survival Time; Varying Coefficient Model
Risk prediction procedures can be quite useful for the patient’s treatment selection, prevention strategy, or disease management in evidence-based medicine. Often, potentially important new predictors are available in addition to the conventional markers. The question is how to quantify the improvement from the new markers for prediction of the patient’s risk in order to aid cost–benefit decisions. The standard method, using the area under the receiver operating characteristic curve, to measure the added value may not be sensitive enough to capture incremental improvements from the new markers. Recently, some novel alternatives to area under the receiver operating characteristic curve, such as integrated discrimination improvement and net reclassification improvement, were proposed. In this paper, we consider a class of measures for evaluating the incremental values of new markers, which includes the preceding two as special cases. We present a unified procedure for making inferences about measures in the class with censored event time data. The large sample properties of our procedures are theoretically justified. We illustrate the new proposal with data from a cancer study to evaluate a new gene score for prediction of the patient’s survival.
area under the receiver operating characteristic curve; C-statistic; Cox’s regression; integrated discrimination improvement; net reclassification improvement; risk prediction
Identification of novel biomarkers for risk assessment is important for both effective disease prevention and optimal treatment recommendation. Discovery relies on the precious yet limited resource of stored biological samples from large prospective cohort studies. Case-cohort sampling design provides a cost-effective tool in the context of biomarker evaluation, especially when the clinical condition of interest is rare. Existing statistical methods focus on making efficient inference on relative hazard parameters from the Cox regression model. Drawing on recent theoretical development on the weighted likelihood for semiparametric models under two-phase studies (Breslow and Wellner, 2007), we propose statistical methods to evaluate accuracy and predictiveness of a risk prediction biomarker, with censored time-to-event outcome under stratified case-cohort sampling. We consider nonparametric methods and a semiparametric method. We derive large sample properties of proposed estimators and evaluate their finite sample performance using numerical studies. We illustrate new procedures using data from Framingham Offspring study to evaluate the accuracy of a recently developed risk score incorporating biomarker information for predicting cardiovascular disease.
Case Cohort Sampling; Negative predictive value; Positive predictive value; Receiver Operating Characteristics Curve (ROC curve); Integrated Discrimination Improvement (IDI); Risk prediction; Survival analysis; Two-phase study
In this paper we develop model checking techniques for assessing functional form specifications of covariates in censored linear regression models. These procedures are based on a censored data analog to taking cumulative sums of “robust” residuals over the space of the covariate under investigation. These cumulative sums are formed by integrating certain Kaplan-Meier estimators and may be viewed as “robust” censored data analogs to the processes considered by Lin, Wei & Ying (2002). The null distributions of these stochastic processes can be approximated by the distributions of certain zero-mean Gaussian processes whose realizations can be generated by computer simulation. Each observed process can then be graphically compared with a few realizations from the Gaussian process. We also develop formal test statistics for numerical comparison. Such comparisons enable one to assess objectively whether an apparent trend seen in a residual plot reects model misspecification or natural variation. We illustrate the methods with a well known dataset. In addition, we examine the finite sample performance of the proposed test statistics in simulation experiments. In our simulation experiments, the proposed test statistics have good power of detecting misspecification while at the same time controlling the size of the test.
Censored linear regression; Goodness-of-fit; Partial linear model; Partial residual; Quantile regression; Resampling method; Rank estimation
To develop more targeted intervention strategies, an important research goal is to identify markers predictive of clinical events. A crucial step towards this goal is to characterize the clinical performance of a marker for predicting different types of events. In this manuscript, we present statistical methods for evaluating the performance of a prognostic marker in predicting multiple competing events. To capture the potential time-varying predictive performance of the marker and incorporate competing risks, we define time- and cause-specific accuracy summaries by stratifying cases based on causes of failure. Such definition would allow one to evaluate the predictive accuracy of a marker for each type of event and compare its predictiveness across event types. Extending the nonparametric crude cause-specific ROC curve estimators by Saha and Heagerty (2010), we develop inference procedures for a range of cause-specific accuracy summaries. To estimate the accuracy measures and assess how covariates may affect the accuracy of a marker under the competing risk setting, we consider two forms of semiparametric models through the cause-specific hazard framework. These approaches enable a flexible modeling of the relationships between the marker and failure times for each cause, while efficiently accommodating additional covariates. We investigate the asymptotic property of the proposed accuracy estimators and demonstrate the finite sample performance of these estimators through simulation studies. The proposed procedures are illustrated with data from a prostate cancer prognostic study.
Biomarker evaluation; Cause-specific Hazard; Competing risk; Negative predictive value; Positive predictive value; Receiver Operating Characteristics Curve (ROC curve); Survival analysis
To test whether data extracted from full text patient visit notes from an electronic medical record (EMR) would improve the classification of PsA compared to an algorithm based on codified data.
From the > 1,350,000 adults in a large academic EMR, all 2318 patients with a billing code for PsA were extracted and 550 were randomly selected for chart review and algorithm training. Using codified data and phrases extracted from narrative data using natural language processing, 31 predictors were extracted and three random forest algorithms trained using coded, narrative, and combined predictors. The receiver operator curve (ROC) was used to identify the optimal algorithm and a cut point was chosen to achieve the maximum sensitivity possible at a 90% positive predictive value (PPV). The algorithm was then used to classify the remaining 1768 charts and finally validated in a random sample of 300 cases predicted to have PsA.
The PPV of a single PsA code was 57% (95%CI 55%–58%). Using a combination of coded data and NLP the random forest algorithm reached a PPV of 90% (95%CI 86%–93%) at sensitivity of 87% (95% CI 83% – 91%) in the training data. The PPV was 93% (95%CI 89%–96%) in the validation set. Adding NLP predictors to codified data increased the area under the ROC (p < 0.001).
Using NLP with text notes from electronic medical records improved the performance of the prediction algorithm significantly. Random forests were a useful tool to accurately classify psoriatic arthritis cases to enable epidemiological research.
Electronic health records (EHR) can allow for the generation of large cohorts of individuals with given diseases for clinical and genomic research. A rate-limiting step is the development of electronic phenotype selection algorithms to find such cohorts. This study evaluated the portability of a published phenotype algorithm to identify rheumatoid arthritis (RA) patients from EHR records at three institutions with different EHR systems.
Materials and Methods
Physicians reviewed charts from three institutions to identify patients with RA. Each institution compiled attributes from various sources in the EHR, including codified data and clinical narratives, which were searched using one of two natural language processing (NLP) systems. The performance of the published model was compared with locally retrained models.
Applying the previously published model from Partners Healthcare to datasets from Northwestern and Vanderbilt Universities, the area under the receiver operating characteristic curve was found to be 92% for Northwestern and 95% for Vanderbilt, compared with 97% at Partners. Retraining the model improved the average sensitivity at a specificity of 97% to 72% from the original 65%. Both the original logistic regression models and locally retrained models were superior to simple billing code count thresholds.
These results show that a previously published algorithm for RA is portable to two external hospitals using different EHR systems, different NLP systems, and different target NLP vocabularies. Retraining the algorithm primarily increased the sensitivity at each site.
Electronic phenotype algorithms allow rapid identification of case populations in multiple sites with little retraining.
Automated learning; biomedical informatics; discovery and text and data mining methods; electronic health record; genetic; improving the education and skills training of health professionals; infection control; knowledge representations; linking the genotype and phenotype; medical informatics; natural language processing; other methods of information extraction; phenotype algorithms DNA databank machine learning; phenotype identification; phenotyping; rheumatoid arthritis; rheumatology; translational research – application of biological knowledge to clinical care
To estimate an overall treatment difference with data from a randomized comparative clinical study, baseline covariates are often utilized to increase the estimation precision. Using the standard analysis of covariance technique for making inferences about such an average treatment difference may not be appropriate, especially when the fitted model is nonlinear. On the other hand, the novel augmentation procedure recently studied, for example, by Zhang and others (2008. Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics
64, 707–715) is quite flexible. However, in general, it is not clear how to select covariates for augmentation effectively. An overly adjusted estimator may inflate the variance and in some cases be biased. Furthermore, the results from the standard inference procedure by ignoring the sampling variation from the variable selection process may not be valid. In this paper, we first propose an estimation procedure, which augments the simple treatment contrast estimator directly with covariates. The new proposal is asymptotically equivalent to the aforementioned augmentation method. To select covariates, we utilize the standard lasso procedure. Furthermore, to make valid inference from the resulting lasso-type estimator, a cross validation method is used. The validity of the new proposal is justified theoretically and empirically. We illustrate the procedure extensively with a well-known primary biliary cirrhosis clinical trial data set.
ANCOVA; Cross validation; Efficiency augmentation; Mayo PBC data; Semi-parametric efficiency
Nested case–control (NCC) design is used frequently in epidemiological studies as a cost-effective subcohort sampling strategy to conduct biomarker research. Sampling strategy, on the other hoand, creates challenges for data analysis because of outcome-dependent missingness in biomarker measurements. In this paper, we propose inverse probability weighted (IPW) methods for making inference about the prognostic accuracy of a novel biomarker for predicting future events with data from NCC studies. The consistency and asymptotic normality of these estimators are derived using the empirical process theory and convergence theorems for sequences of weakly dependent random variables. Simulation and analysis using Framingham Offspring Study data suggest that the proposed methods perform well in finite samples.
Inverse probability weighting; Nested case–control study; Time-dependent accuracy
To evaluate whether coronary heart disease (CHD)-susceptibility loci identified by genome-wide association studies of the general population also contribute to CHD in type 2 diabetes.
No study has examined the effects of these genetic variants on CHD in diabetic patients.
We genotyped 15 genetic markers of 12 loci in three studies of diabetic patients: the prospective Nurses’ Health Study (309 CHD cases and 544 controls) and Health Professional Follow-up Study (345 CHD cases and 451 controls), and the cross-sectional Joslin Heart Study (422 CHD cases and 435 controls).
Five SNPs, rs4977574 (CDKN2A/2B), rs12526453 (PHACTR1), rs646776 (CELSR2-PSRC1-SORT1), rs2259816 (HNF1A), and rs11206510 (PCSK9) showed directionally consistent associations with CHD in the three studies, with combined odds ratios (ORs) ranging from 1.17 to 1.25 (p=0.03 to 0.0002). None of the other SNPs reached significance in individual or combined analyses. A genetic risk score (GRS) was created by combining the risk alleles of the five significantly associated loci. The OR of CHD per GRS unit was 1.19 (95% confidence interval [CI] 1.13– 1.26; p<0.0001). Individuals with GRS ≥8 (19% of diabetic subjects) had almost a two-fold increase in CHD risk (OR=1.94, 95% CI 1.60–2.35) as compared to individuals with GRS ≤5 (30% of diabetic subjects). Prediction of CHD was significantly improved (p<0.001) when the GRS was added to a model including clinical predictors in the combined samples.
Our results illustrate the consistency and differences in the determinants of genetic susceptibility to CHD in diabetic patients and the general populations.
genetics; CHD; diabetes
Quantitative procedures for evaluating added values from new markers over a conventional risk scoring system for predicting event rates at specific time points have been extensively studied. However, a single summary statistic, for example, the area under the receiver operating characteristic curve or its derivatives, may not provide a clear picture about the relationship between the conventional and the new risk scoring systems. When there are no censored event time observations in the data, two simple scatterplots with individual conventional and new scores for “cases” and “controls” provide valuable information regarding the overall and the subject-specific level incremental values from the new markers. Unfortunately, in the presence of censoring, it is not clear how to construct such plots. In this paper, we propose a nonparametric estimation procedure for the distributions of the differences between two risk scores conditional on the conventional score. The resulting quantile curves of these differences over the subject-specific conventional score provide extra information about the overall added value from the new marker. They also help us to identify a subgroup of future subjects who need the new predictors, especially when there is no unified utility function available for cost-risk-benefit decision making. The procedure is illustrated with two data sets. The first is from a well-known Mayo Clinic PBC liver study. The second is from a recent breast cancer study on evaluating the added value from a gene score, which is relatively expensive to measure compared with the routinely used clinical biomarkers for predicting the patient's survival after surgery.
Discriminant analysis; Nonparametric function estimation; Prediction; Receiver operating characteristic curve
In this paper, we develop a powerful test for identifying SNP-sets that are predictive of survival with data from genome-wide association studies (GWAS). We first group typed SNPs into SNP-sets based on genomic features and then apply a score test to assess the overall effect of each SNP-set on the survival outcome through a kernel machine Cox regression framework. This approach uses genetic information from all SNPs in the SNP-set simultaneously and accounts for linkage disequilibrium (LD), leading to a powerful test with reduced degrees of freedom when the typed SNPs are in LD with each other. This type of test also has the advantage of capturing the potentially non-linear effects of the SNPs, SNP-SNP interactions (epistasis), and the joint effects of multiple causal variants. By simulating SNP data based on the LD structure of real genes from the HapMap project, we demonstrate that our proposed test is more powerful than the standard single SNP minimum p-value based test for association studies with censored survival outcomes. We illustrate the proposed test with a real data application.
cox model; genetic studies; gene-based analysis; kernel machine; multi-locus test; score test; single nucleotide polymorphism
In disease screening and prognosis studies, an important task is to determine useful markers for identifying high-risk subgroups. Once such markers are established, they can be incorporated into public health practice to provide appropriate strategies for treatment or disease monitoring based on each individual’s predicted risk. In the recent years, genetic and biological markers have been examined extensively for their potential to signal progression or risk of disease. In addition to these markers, it has often been argued that short-term outcomes may be helpful in making a better prediction of disease outcomes in clinical practice. In this paper we propose model-free non-parametric procedures to incorporate short-term event information to improve the prediction of a long-term terminal event. We include the optional availability of a single discrete marker measurement and assess the additional information gained by including the short-term outcome. We focus on the semi-competing risk setting where the short-term event is an intermediate event that may be censored by the terminal event while the terminal event is only subject to administrative censoring. Simulation studies suggest that the proposed procedures perform well in finite samples. Our procedures are illustrated using a data set of post-dialysis patients with end-stage renal disease.
Biomarkers; Disease prognosis; Risk prediction; Survival analysis
Accurate risk prediction is an important step in developing optimal strategies for disease prevention and treatment. Based on the predicted risks, patients can be stratified to different risk categories where each category corresponds to a particular clinical intervention. Incorrect or suboptimal interventions are likely to result in unnecessary financial and medical consequences. It is thus essential to account for the costs associated with the clinical interventions when developing and evaluating risk stratification (RS) rules for clinical use. In this article, we propose to quantify the value of an RS rule based on the total expected cost attributed to incorrect assignment of risk groups due to the rule. We have established the relationship between cost parameters and optimal threshold values used in the stratification rule that minimizes the total expected cost over the entire population of interest. Statistical inference procedures are developed for evaluating and comparing given RS rules and examined through simulation studies. The proposed procedures are illustrated with an example from the Cardiovascular Health Study.
Disease prognosis; Optimal risk stratification; Risk prediction
There is growing evidence that genomic and proteomic research holds great potentials for changing irrevocably the practice of medicine. The ability to identify important genomic and biological markers for risk assessment can have a great impact in public health from disease prevention, to detection, to treatment selection. However, the potentially large number of markers and the complexity in the relationship between the markers and the outcome of interest impose a grand challenge in developing accurate risk prediction models. The standard approach to identifying important markers often assesses the marginal effects of individual markers on a phenotype of interest. When multiple markers relate to the phenotype simultaneously via a complex structure, such a type of marginal analysis may not be effective. To overcome such difficulties, we employ a kernel machine Cox regression framework and propose an efficient score test to assess the overall effect of a set of markers, such as genes within a pathway or a network, on survival outcomes. The proposed test has the advantage of capturing the potentially non-linear effects without explicitly specifying a particular non-linear functional form. To approximate the null distribution of the score statistic, we propose a simple resampling procedure that can be easily implemented in practice. Numerical studies suggest that the test performs well with respect to both empirical size and power even when the number of variables in a gene set is not small compared to the sample size.
Genetic Association; Gene-set analysis; Genetic Pathways; Kernel Machine; Kernel PCA; Risk Prediction; Score Test; Survival Analysis
Tumoral tissues tend to generally exhibit aberrations in DNA copy number that are associated with the development and progression of cancer. Genotyping methods such as array-based comparative genomic hybridization (aCGH) provide means to identify copy number variation across the entire genome. To address some of the shortfalls of existing methods of DNA copy number data analysis, including strong model assumptions, lack of accounting for sampling variability of estimators, and the assumption that clones are independent, we propose a simple graphical approach to assess population-level genetic alterations over the entire genome based on moving average. Furthermore, existing methods primarily focus on segmentation and do not examine the association of covariates with genetic instability. In our methods, covariates are incorporated through a possibly mis-specified working model and sampling variabilities of estimators are approximated using a resampling method that is based on perturbing observed processes. Our proposal, which is applicable to partial, entire or multiple chromosomes, is illustrated through application to aCGH studies of two brain tumor types, meningioma and glioma.
aCGH data; Moving average; Perturbation method; Gaussian process; Genomic data
Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Traditional statistical inference procedures based on standard regression methods often fail in the presence of high-dimensional features. In recent years, regularization methods have emerged as promising tools for analyzing high dimensional data. These methods simultaneously select important features and provide stable estimation of their effects. Adaptive LASSO and SCAD for instance, give consistent and asymptotically normal estimates with oracle properties. However, in finite samples, it remains difficult to obtain interval estimators for the regression parameters. In this paper, we propose perturbation resampling based procedures to approximate the distribution of a general class of penalized parameter estimates. Our proposal, justified by asymptotic theory, provides a simple way to estimate the covariance matrix and confidence regions. Through finite sample simulations, we verify the ability of this method to give accurate inference and compare it to other widely used standard deviation and confidence interval estimates. We also illustrate our proposals with a data set used to study the association of HIV drug resistance and a large number of genetic mutations.
High dimensional regression; Interval estimation; Oracle property; Regularized estimation; Resampling methods
To evaluate the clinical utility of new risk markers, a crucial step is to measure their predictive accuracy with prospective studies. However, it is often infeasible to obtain marker values for all study participants. The nested case-control (NCC) design is a useful cost-effective strategy for such settings. Under the NCC design, markers are only ascertained for cases and a fraction of controls sampled randomly from the risk sets. The outcome dependent sampling generates a complex data structure and therefore a challenge for analysis. Existing methods for analyzing NCC studies focus primarily on association measures. Here, we propose a class of non-parametric estimators for commonly used accuracy measures. We derived asymptotic expansions for accuracy estimators based on both finite population and Bernoulli sampling and established asymptotic equivalence between the two. Simulation results suggest that the proposed procedures perform well in finite samples. The new procedures were illustrated with data from the Framingham Offspring study.
Biomarker study; Classification Accuracy; Conditional Kaplan Meier; Inverse Probability Weighting; Nested case-control study; Predictive Value; ROC Curve; Time-dependent Accuracy
Recently meta analysis has been widely utilized to combine information across multiple studies to evaluate a common effect. Integrating data from similar studies is particularly useful in genomic studies where the individual study sample sizes are not large relative to the number of parameters of interest. In this paper, we are interested in developing robust prognostic rules for the prediction of t-year survival based on multiple studies. We propose to construct a composite score for prediction by fitting a stratified semiparametric transformation model that allows the studies to have related but not identical outcomes. To evaluate the accuracy of the resulting score, we provide point and interval estimators for the commonly used accuracy measures including the time-specific ROC curves, and positive and negative predictive values. We apply the proposed procedures to develop prognostic rules for the 5-year survival of breast cancer patients based on five breast cancer genomic studies.
Biomarker; Classification; Conditional Kaplan-Meier; Meta Analysis; Nonparametric Maximum Likelihood; Predictive Values; Prognosis; ROC; Survival Analysis
For modern evidence-based medicine, a well thought-out risk scoring system for predicting the occurrence of a clinical event plays an important role in selecting prevention and treatment strategies. Such an index system is often established based on the subject’s “baseline” genetic or clinical markers via a working parametric or semi-parametric model. To evaluate the adequacy of such a system, C-statistics are routinely used in the medical literature to quantify the capacity of the estimated risk score in discriminating among subjects with different event times. The C-statistic provides a global assessment of a fitted survival model for the continuous event time rather than focuses on the prediction of t-year survival for a fixed time. When the event time is possibly censored, however, the population parameters corresponding to the commonly used C-statistics may depend on the study-specific censoring distribution. In this article, we present a simple C-statistic without this shortcoming. The new procedure consistently estimates a conventional concordance measure which is free of censoring. We provide a large sample approximation to the distribution of this estimator for making inferences about the concordance measure. Results from numerical studies suggest that the new procedure performs well in finite sample.
AUC; Cox’s proportional hazards model; Framingham risk score; ROC
Suppose that under the conventional randomized clinical trial setting, a new therapy is compared with a standard treatment. In this article, we propose a systematic, 2-stage estimation procedure for the subject-level treatment differences for future patient's disease management and treatment selections. To construct this procedure, we first utilize a parametric or semiparametric method to estimate individual-level treatment differences, and use these estimates to create an index scoring system for grouping patients. We then consistently estimate the average treatment difference for each subgroup of subjects via a nonparametric function estimation method. Furthermore, pointwise and simultaneous interval estimates are constructed to make inferences about such subgroup-specific treatment differences. The new proposal is illustrated with the data from a clinical trial for evaluating the efficacy and toxicity of a 3-drug combination versus a standard 2-drug combination for treating HIV-1–infected patients.
Cross-validation; HIV infection; Non-parametric function estimation; Personalized medicine; Subgroup analysis
Meta analysis provides a useful framework for combining information across related studies and has been widely utilized to combine data from clinical studies in order to evaluate treatment efficacy. More recently, meta analysis has also been used to assess drug safety. However, because adverse events are typically rare, standard methods may not work well in this setting. Most popular methods use fixed or random effects models to combine effect estimates obtained separately for each individual study. In the context of very rare outcomes, effect estimates from individual studies may be unstable or even undefined. We propose alternative approaches based on Poisson random effects models to make inference about the relative risk between two treatment groups. Simulation studies show that the proposed methods perform well when the underlying event rates are low. The methods are illustrated using data from a recent meta analysis  of 48 comparative trials involving rosiglitazone, a type 2 diabetes drug, with respect to its possible cardiovascular toxicity.
Cardiac Complications; rosiglitazone; Poisson Random Effects
DNA oxidative stress has been suggested as an important pathogenic mechanism in
cognitive impairment and dementia. With baseline data collected from 2004 to
2008, the authors examined whether urinary 8-hydroxy-2-deoxyguanosine (8-OHdG),
a biomarker of global DNA oxidation, was associated with cognitive function in a
sample of 1,003 Puerto Rican adults, aged 45–75 years, living in
Boston, Massachusetts, and the surrounding area. Cognitive function was measured
by using a battery of 7 tests: the Mini-Mental State Examination, word list
learning, digit span, clock drawing and figure copying, Stroop, and verbal
fluency tests. The primary outcome was a global cognitive score, averaging
standardized scores across all cognitive tests. A higher 8-OHdG concentration
was significantly associated with lower global cognitive scores, after
adjustment for age, education, status of the gene for apolipoprotein E
(APOE), and other covariates
(Ptrend = 0.01). The difference in the
global score, comparing participants in the 2 extreme 8-OHdG quartiles, was
−0.11 (95% confidence interval: −0.20, −0.02),
which was equivalent to accelerating cognitive aging by about 4 years, as
observed in this population. Prospective studies are needed to elucidate whether
elevated urinary 8-OHdG concentrations can predict the rate of cognitive decline
and incident dementia.
biological markers; DNA damage; Hispanic Americans; neurobehavioral manifestations; oxidative stress
Electronic medical records (EMRs) are a rich data source for discovery research but are underutilized due to the difficulty of extracting highly accurate clinical data. We assessed whether a classification algorithm incorporating narrative EMR data (typed physician notes), more accurately classifies subjects with rheumatoid arthritis (RA) compared to an algorithm using codified EMR data alone.
Subjects with ≥1 ICD9 RA code (714.xx) or who had anti-CCP checked in the EMR of two large academic centers were included into an ‘RA Mart’ (n=29,432). For all 29,432 subjects, we extracted narrative (using natural language processing) and codified RA clinical information. In a training set of 96 RA and 404 non-RA cases from the RA Mart classified by medical record review, we used narrative and codified data to develop classification algorithms using logistic regression. These algorithms were applied to the entire RA Mart. We calculated and compared the positive predictive value (PPV) of these algorithms by reviewing records of an additional 400 subjects classified as RA by the algorithms.
A complete algorithm (narrative and codified data) classified RA subjects with a significantly higher PPV of 94%, than an algorithm with codified data alone (PPV 88%). Characteristics of the RA cohort identified by the complete algorithm were comparable to existing RA cohorts (80% female, 63% anti-CCP+, 59% erosion+).
We demonstrate the ability to utilize complete EMR data to define an RA cohort with a PPV of 94%, which was superior to an algorithm using codified data alone.