Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.
Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand.
We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.
The study of the network between transcription factors and their targets is important for understanding the complex regulatory mechanisms in a cell. Unfortunately, with standard microarray experiments it is not possible to measure the transcription factor activities (TFAs) directly, as their own transcription levels are subject to post-translational modifications.
Here we propose a statistical approach based on partial least squares (PLS) regression to infer the true TFAs from a combination of mRNA expression and DNA-protein binding measurements. This method is also statistically sound for small samples and allows the detection of functional interactions among the transcription factors via the notion of "meta"-transcription factors. In addition, it enables false positives to be identified in ChIP data and activation and suppression activities to be distinguished.
The proposed method performs very well both for simulated data and for real expression and ChIP data from yeast and E. Coli experiments. It overcomes the limitations of previously used approaches to estimating TFAs. The estimated profiles may also serve as input for further studies, such as tests of periodicity or differential regulation. An R package "plsgenomics" implementing the proposed methods is available for download from the CRAN archive.
In the last years, the importance of independent validation of the prediction ability of a new gene signature has been largely recognized. Recently, with the development of gene signatures which integrate rather than replace the clinical predictors in the prediction rule, the focus has been moved to the validation of the added predictive value of a gene signature, i.e. to the verification that the inclusion of the new gene signature in a prediction model is able to improve its prediction ability.
The high-dimensional nature of the data from which a new signature is derived raises challenging issues and necessitates the modification of classical methods to adapt them to this framework. Here we show how to validate the added predictive value of a signature derived from high-dimensional data and critically discuss the impact of the choice of methods on the results.
The analysis of the added predictive value of two gene signatures developed in two recent studies on the survival of leukemia patients allows us to illustrate and empirically compare different validation techniques in the high-dimensional framework.
The issues related to the high-dimensional nature of the omics predictors space affect the validation process. An analysis procedure based on repeated cross-validation is suggested.
Added predictive value; Omics score; Prediction model; Time-to-event data; Validation
Motivation: Numerous competing algorithms for prediction in high-dimensional settings have been developed in the statistical and machine-learning literature. Learning algorithms and the prediction models they generate are typically evaluated on the basis of cross-validation error estimates in a few exemplary datasets. However, in most applications, the ultimate goal of prediction modeling is to provide accurate predictions for independent samples obtained in different settings. Cross-validation within exemplary datasets may not adequately reflect performance in the broader application context.
Methods: We develop and implement a systematic approach to ‘cross-study validation’, to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets. We illustrate it via simulations and in a collection of eight estrogen-receptor positive breast cancer microarray gene-expression datasets, where the objective is predicting distant metastasis-free survival (DMFS). We computed the C-index for all pairwise combinations of training and validation datasets. We evaluate several alternatives for summarizing the pairwise validation statistics, and compare these to conventional cross-validation.
Results: Our data-driven simulations and our application to survival prediction with eight breast cancer microarray datasets, suggest that standard cross-validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross-study validation. Furthermore, the ranking of learning algorithms differs, suggesting that algorithms performing best in cross-validation may be suboptimal when evaluated through independent validation.
Availability: The survHD: Survival in High Dimensions package (http://www.bitbucket.org/lwaldron/survhd) will be made available through Bioconductor.
Supplementary data are available at Bioinformatics online.
Aortic homografts are an alternative to mechanical or biological valve prostheses. Homografts are generally not transplanted ABO-compatible while this policy is still under debate. The purpose of this study was to investigate whether ABO compatibility impacts on long-term outcomes or not.
Between 1992 and 2009, 363 adult patients with a mean age of 52 years received homografts in aortic position. Donor and acceptor blood groups could be obtained for 335 patients. Sixty-three percent received blood group-compatible (n = 212) (Group iso) and 37% non-blood group-compatible allografts (n = 123) (Group non-iso).
The overall event-free survival (freedom from death or reoperation) was 55.5% (n = 186). In the iso group, the event-free survival was 84.1% at 5 years and 63.3% at 10 years. In the non-iso group, the event-free survival was 79.4% at 5 years and 51.8% at 10 years. 28.5% of patients (n = 35) with ABO-incompatible and 25.5% (n = 54) with ABO-compatible grafts required reoperation. The mean time to reoperation in the iso group was 97.3 vs 90 months in the non-iso group.
In 17 years of research, we have not yet found a statistical significant difference in blood group incompatibility regarding overall event-free survival. In our opinion, there is no need to use ABO-compatible homografts for aortic valve replacement in adults. Histological and immunohistochemical assays are mandatory to confirm our results.
Aortic homografts; Blood group incompatibility; Reoperation
Aim. There is no consensus about the normal fetal heart rate. Current international guidelines recommend for the normal fetal heart rate (FHR) baseline different ranges of 110 to 150 beats per minute (bpm) or 110 to 160 bpm. We started with a precise definition of “normality” and performed a retrospective computerized analysis of electronically recorded FHR tracings.
Methods. We analyzed all recorded cardiotocography tracings of singleton pregnancies in three German medical centers from 2000 to 2007 and identified 78,852 tracings of sufficient quality. For each tracing, the baseline FHR was extracted by eliminating accelerations/decelerations and averaging based on the “delayed moving windows” algorithm. After analyzing 40% of the dataset as “training set” from one hospital generating a hypothetical normal baseline range, evaluation of external validity on the other 60% of the data was performed using data from later years in the same hospital and externally using data from the two other hospitals.
Results. Based on the training data set, the “best” FHR range was 115 or 120 to 160 bpm. Validation in all three data sets identified 120 to 160 bpm as the correct symmetric “normal range”. FHR decreases slightly during gestation.
Conclusions. Normal ranges for FHR are 120 to 160 bpm. Many international guidelines define ranges of 110 to 160 bpm which seem to be safe in daily practice. However, further studies should confirm that such asymmetric alarm limits are safe, with a particular focus on the lower bound, and should give insights about how to show and further improve the usefulness of the widely used practice of CTG monitoring.
Cardiotocography; Fetal heart rate; Baseline; Computerized analysis; Monitoring; Guidelines
In computational science literature including, e.g., bioinformatics, computational statistics or machine learning, most published articles are devoted to the development of “new methods”, while comparison studies are generally appreciated by readers but surprisingly given poor consideration by many journals. This paper stresses the importance of neutral comparison studies for the objective evaluation of existing methods and the establishment of standards by drawing parallels with clinical research. The goal of the paper is twofold. Firstly, we present a survey of recent computational papers on supervised classification published in seven high-ranking computational science journals. The aim is to provide an up-to-date picture of current scientific practice with respect to the comparison of methods in both articles presenting new methods and articles focusing on the comparison study itself. Secondly, based on the results of our survey we critically discuss the necessity, impact and limitations of neutral comparison studies in computational sciences. We define three reasonable criteria a comparison study has to fulfill in order to be considered as neutral, and explicate general considerations on the individual components of a “tidy neutral comparison study”. R codes for completely replicating our statistical analyses and figures are available from the companion website http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/plea2013.
In the context of Gaussian Graphical Models (GGMs) with high-dimensional small sample data, we present a simple procedure, called PACOSE – standing for PArtial COrrelation SElection – to estimate partial correlations under the constraint that some of them are strictly zero. This method can also be extended to covariance selection. If the goal is to estimate a GGM, our new procedure can be applied to re-estimate the partial correlations after a first graph has been estimated in the hope to improve the estimation of non-zero coefficients. This iterated version of PACOSE is called iPACOSE. In a simulation study, we compare PACOSE to existing methods and show that the re-estimated partial correlation coefficients may be closer to the real values in important cases. Plus, we show on simulated and real data that iPACOSE shows very interesting properties with regards to sensitivity, positive predictive value and stability.
The random forest (RF) method is a commonly used tool for classification with
high dimensional data as well as for ranking candidate predictors based on
the so-called random forest variable importance measures (VIMs). However the
classification performance of RF is known to be suboptimal in case of
strongly unbalanced data, i.e. data where response class sizes differ
considerably. Suggestions were made to obtain better classification
performance based either on sampling procedures or on cost sensitivity
analyses. However to our knowledge the performance of the VIMs has not yet
been examined in the case of unbalanced response classes. In this paper we
explore the performance of the permutation VIM for unbalanced data settings
and introduce an alternative permutation VIM based on the area under the
curve (AUC) that is expected to be more robust towards class imbalance.
We investigated the performance of the standard permutation VIM and of our
novel AUC-based permutation VIM for different class imbalance levels using
simulated data and real data. The results suggest that the new AUC-based
permutation VIM outperforms the standard permutation VIM for unbalanced data
settings while both permutation VIMs have equal performance for balanced
The standard permutation VIM loses its ability to discriminate between
associated predictors and predictors not associated with the response for
increasing class imbalance. It is outperformed by our new AUC-based
permutation VIM for unbalanced data settings, while the performance of both
VIMs is very similar in the case of balanced classes. The new AUC-based VIM
is implemented in the R package party for the unbiased RF variant based on
conditional inference trees. The codes implementing our study are available
from the companion website:
Random forest; Conditional inference trees; Variable importance measure; Feature selection; Unbalanced data; Class imbalance; Area under the curve.
Recently it has been shown that radiation induces migration of glioma cells and facilitates a further spread of tumor cells locally and systemically. The aim of this study was to evaluate whether radiotherapy induces migration in head and neck squamous cell carcinoma (HNSCC). A further aim was to investigate the effects of blocking the epidermal growth factor receptor (EGFR) and its downstream pathways (Raf/MEK/ERK, PI3K/Akt) on tumor cell migration in vitro.
Migration of tumor cells was assessed via a wound healing assay and proliferation by a MTT colorimeritric assay using 3 HNSCC cell lines (BHY, CAL-27, HN). The cells were treated with increasing doses of irradiation (2 Gy, 5 Gy, 8 Gy) in the presence or absence of EGF, EGFR-antagonist (AG1478) or inhibitors of the downstream pathways PI3K (LY294002), mTOR (rapamycin) and MEK1 (PD98059). Biochemical activation of EGFR and the downstream markers Akt and ERK were examined by Western blot analysis.
In absence of stimulation or inhibition, increasing doses of irradiation induced a dose-dependent enhancement of migrating cells (p < 0.05 for the 3 HNSCC cell lines) and a decrease of cell proliferation (p < 0.05 for the 3 HNSCC cell lines). The inhibition of EGFR or the downstream pathways reduced cell migration significantly (almost all p < 0.05 for the 3 HNSCC cell lines). Stimulation of HNSCC cells with EGF caused a significant increase in migration (p < 0.05 for the 3 HNSCC cell lines). After irradiation alone a pronounced activation of EGFR was observed by Western blot analysis.
Our results demonstrate that the EGFR is involved in radiation induced migration of HNSCC cells. Therefore EGFR or the downstream pathways might be a target for the treatment of HNSCC to improve the efficacy of radiotherapy.
The number of percutaneous coronary interventions (PCI) prior to coronary artery bypass grafting (CABG) increased drastically during the last decade. Patients are referred for CABG with more severe coronary pathology, which may influence postoperative outcome. Outcomes of 200 CABG patients, collected consecutively in an observational study, were compared (mean follow-up: 5 years). Group A (n = 100, mean age 63 years, 20 women) had prior PCI before CABG, and group B (n = 100, mean age 66, 20 women) underwent primary CABG. In group A, the mean number of administered stents was 2. Statistically significant results were obtained for the following preoperative criteria: previous myocardial infarction: 54 vs 34 (P = 0.007), distribution of CAD (P < 0.0001), unstable angina: 27 vs 5 (P < 0.0001). For intraoperative data, the total number of established bypasses was 2.43 ± 1.08 vs 2.08 ± 1.08 (P = 0.017), with the number of arterial bypass grafts being: 1.26 ± 0.82 vs 1.07 ± 0.54 (P = 0.006). Regarding the postoperative course, significant results could be demonstrated for: adrenaline dosage (0.83 vs 0.41 mg/h; [p is not significant (ns)]) administered in 67 group A vs 47 group B patients (P = 0.006), and noradrenaline dosage (0.82 vs 0.87 mg/h; ns) administered in 46 group A vs 63 group B patients (P = 0.023), CK/troponine I (P = 0.002; P < 0.001), postoperative resuscitation (6 vs 0; P = 0.029), intra aortic balloon pump 12 vs 1 (P = 0.003), and 30-day mortality (9% in group A vs 1% in group B; P = 0.018). Clopidogrel was administered in 35% of patients with prior PCI and in 19% of patients without prior PCI (P = 0.016). Patients with prior PCI presented for CABG with more severe CAD. Morbidity, mortality and reoperation rate during mid term were significantly higher in patients with prior PCI.
CABG; CABG and PCI; CAD; outcome
While high-dimensional molecular data such as microarray gene expression data have been used for disease outcome prediction or diagnosis purposes for about ten years in biomedical research, the question of the additional predictive value of such data given that classical predictors are already available has long been under-considered in the bioinformatics literature.
We suggest an intuitive permutation-based testing procedure for assessing the additional predictive value of high-dimensional molecular data. Our method combines two well-known statistical tools: logistic regression and boosting regression. We give clear advice for the choice of the only method parameter (the number of boosting iterations). In simulations, our novel approach is found to have very good power in different settings, e.g. few strong predictors or many weak predictors. For illustrative purpose, it is applied to the two publicly available cancer data sets.
Our simple and computationally efficient approach can be used to globally assess the additional predictive power of a large number of candidate predictors given that a few clinical covariates or a known prognostic index are already available. It is implemented in the R package "globalboosttest" which is publicly available from R-forge and will be sent to the CRAN as soon as possible.
In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias.
In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure.
We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly.
The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on permuted uninformative predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.
Graphical Gaussian models are popular tools for the estimation of (undirected) gene association networks from microarray data. A key issue when the number of variables greatly exceeds the number of samples is the estimation of the matrix of partial correlations. Since the (Moore-Penrose) inverse of the sample covariance matrix leads to poor estimates in this scenario, standard methods are inappropriate and adequate regularization techniques are needed. Popular approaches include biased estimates of the covariance matrix and high-dimensional regression schemes, such as the Lasso and Partial Least Squares.
In this article, we investigate a general framework for combining regularized regression methods with the estimation of Graphical Gaussian models. This framework includes various existing methods as well as two new approaches based on ridge regression and adaptive lasso, respectively. These methods are extensively compared both qualitatively and quantitatively within a simulation study and through an application to six diverse real data sets. In addition, all proposed algorithms are implemented in the R package "parcor", available from the R repository CRAN.
In our simulation studies, the investigated non-sparse regression methods, i.e. Ridge Regression and Partial Least Squares, exhibit rather conservative behavior when combined with (local) false discovery rate multiple testing in order to decide whether or not an edge is present in the network. For networks with higher densities, the difference in performance of the methods decreases. For sparse networks, we confirm the Lasso's well known tendency towards selecting too many edges, whereas the two-stage adaptive Lasso is an interesting alternative that provides sparser solutions. In our simulations, both sparse and non-sparse methods are able to reconstruct networks with cluster structures. On six real data sets, we also clearly distinguish the results obtained using the non-sparse methods and those obtained using the sparse methods where specification of the regularization parameter automatically means model selection. In five out of six data sets, Partial Least Squares selects very dense networks. Furthermore, for data that violate the assumption of uncorrelated observations (due to replications), the Lasso and the adaptive Lasso yield very complex structures, indicating that they might not be suited under these conditions. The shrinkage approach is more stable than the regression based approaches when using subsampling.
Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables.
We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure.
The resulting conditional variable importance reflects the true impact of each predictor variable more reliably than the original marginal approach.