In an interesting and quite exhaustive review on Random Forests (RF) methodology in bioinformatics Touw et al. address—among other topics—the problem of the detection of interactions between variables based on RF methodology. We feel that some important statistical concepts, such as ‘interaction’, ‘conditional dependence’ or ‘correlation’, are sometimes employed inconsistently in the bioinformatics literature in general and in the literature on RF in particular. In this letter to the Editor, we aim to clarify some of the central statistical concepts and point out some confusing interpretations concerning RF given by Touw et al. and other authors.
random forest; statistics; interaction; correlation; conditional inference trees; conditional variable importance
Missing values are a frequent issue in human studies. In many situations, multiple imputation (MI) is an appropriate missing data handling strategy, whereby missing values are imputed multiple times, the analysis is performed in every imputed data set, and the obtained estimates are pooled. If the aim is to estimate (added) predictive performance measures, such as (change in) the area under the receiver-operating characteristic curve (AUC), internal validation strategies become desirable in order to correct for optimism. It is not fully understood how internal validation should be combined with multiple imputation.
In a comprehensive simulation study and in a real data set based on blood markers as predictors for mortality, we compare three combination strategies: Val-MI, internal validation followed by MI on the training and test parts separately, MI-Val, MI on the full data set followed by internal validation, and MI(-y)-Val, MI on the full data set omitting the outcome followed by internal validation. Different validation strategies, including bootstrap und cross-validation, different (added) performance measures, and various data characteristics are considered, and the strategies are evaluated with regard to bias and mean squared error of the obtained performance estimates. In addition, we elaborate on the number of resamples and imputations to be used, and adopt a strategy for confidence interval construction to incomplete data.
Internal validation is essential in order to avoid optimism, with the bootstrap 0.632+ estimate representing a reliable method to correct for optimism. While estimates obtained by MI-Val are optimistically biased, those obtained by MI(-y)-Val tend to be pessimistic in the presence of a true underlying effect. Val-MI provides largely unbiased estimates, with a slight pessimistic bias with increasing true effect size, number of covariates and decreasing sample size. In Val-MI, accuracy of the estimate is more strongly improved by increasing the number of bootstrap draws rather than the number of imputations. With a simple integrated approach, valid confidence intervals for performance estimates can be obtained.
When prognostic models are developed on incomplete data, Val-MI represents a valid strategy to obtain estimates of predictive performance measures.
Electronic supplementary material
The online version of this article (doi:10.1186/s12874-016-0239-7) contains supplementary material, which is available to authorized users.
Missing values; Incomplete data; Prediction model; Predictive performance; Bootstrap; Internal validation; Resampling; Cross-validation; Multiple imputation; MICE
Interleukin-22 (IL-22) is involved in lung diseases such as pneumonia, asthma and lung cancer. Lavage mirrors the local environment, and may provide insights into the presence and role of IL-22 in patients.
Bronchoscopic lavage (BL) samples (n = 195, including bronchoalveolar lavage and bronchial washings) were analysed for IL-22 using an enzyme-linked immunosorbent assay. Clinical characteristics and parameters from lavage and serum were correlated with lavage IL-22 concentrations.
IL-22 was higher in lavage from patients with lung disease than in controls (38.0 vs 15.3 pg/ml, p < 0.001). Patients with pneumonia and lung cancer had the highest concentrations (48.9 and 33.0 pg/ml, p = 0.009 and p < 0.001, respectively). IL-22 concentration did not correlate with systemic inflammation. IL-22 concentrations did not relate to any of the analysed cell types in BL indicating a potential mixed contribution of different cell populations to IL-22 production.
Lavage IL-22 concentrations are high in patients with lung cancer but do not correlate with systemic inflammation, thus suggesting that lavage IL-22 may be related to the underlying malignancy. Our results suggest that lavage may represent a distinct compartment where the role of IL-22 in thoracic malignancies can be studied.
Bronchoalveolar lavage; Interleukin-22; Biomarker; Lung cancer; Pneumonia
Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.
Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand.
We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.
The study of the network between transcription factors and their targets is important for understanding the complex regulatory mechanisms in a cell. Unfortunately, with standard microarray experiments it is not possible to measure the transcription factor activities (TFAs) directly, as their own transcription levels are subject to post-translational modifications.
Here we propose a statistical approach based on partial least squares (PLS) regression to infer the true TFAs from a combination of mRNA expression and DNA-protein binding measurements. This method is also statistically sound for small samples and allows the detection of functional interactions among the transcription factors via the notion of "meta"-transcription factors. In addition, it enables false positives to be identified in ChIP data and activation and suppression activities to be distinguished.
The proposed method performs very well both for simulated data and for real expression and ChIP data from yeast and E. Coli experiments. It overcomes the limitations of previously used approaches to estimating TFAs. The estimated profiles may also serve as input for further studies, such as tests of periodicity or differential regulation. An R package "plsgenomics" implementing the proposed methods is available for download from the CRAN archive.
Identification of microorganisms in positive blood cultures still relies on standard techniques such as Gram staining followed by culturing with definite microorganism identification. Alternatively, matrix-assisted laser desorption/ionization time-of-flight mass spectrometry or the analysis of headspace volatile compound (VC) composition produced by cultures can help to differentiate between microorganisms under experimental conditions. This study assessed the efficacy of volatile compound based microorganism differentiation into Gram-negatives and -positives in unselected positive blood culture samples from patients.
Headspace gas samples of positive blood culture samples were transferred to sterilized, sealed, and evacuated 20 ml glass vials and stored at −30 °C until batch analysis. Headspace gas VC content analysis was carried out via an auto sampler connected to an ion–molecule reaction mass spectrometer (IMR-MS). Measurements covered a mass range from 16 to 135 u including CO2, H2, N2, and O2. Prediction rules for microorganism identification based on VC composition were derived using a training data set and evaluated using a validation data set within a random split validation procedure.
One-hundred-fifty-two aerobic samples growing 27 Gram-negatives, 106 Gram-positives, and 19 fungi and 130 anaerobic samples growing 37 Gram-negatives, 91 Gram-positives, and two fungi were analysed. In anaerobic samples, ten discriminators were identified by the random forest method allowing for bacteria differentiation into Gram-negative and -positive (error rate: 16.7 % in validation data set). For aerobic samples the error rate was not better than random.
In anaerobic blood culture samples of patients IMR-MS based headspace VC composition analysis facilitates bacteria differentiation into Gram-negative and -positive.
Electronic supplementary material
The online version of this article (doi:10.1186/s40709-016-0040-0) contains supplementary material, which is available to authorized users.
Mass spectrometry; Chemical ionization; Volatile compound; Blood culture; Prediction rule; Gram identification
Reliable risk assessment of frequent, but treatable diseases and disorders has considerable clinical and socio-economic relevance. However, as these conditions usually originate from a complex interplay between genetic and environmental factors, precise prediction remains a considerable challenge. The current progress in genotyping technology has resulted in a substantial increase of knowledge regarding the genetic basis of such diseases and disorders. Consequently, common genetic risk variants are increasingly being included in epidemiological models to improve risk prediction. This work reviews recent high-quality publications targeting the prediction of common complex diseases. To be included in this review, articles had to report both, numerical measures of prediction performance based on traditional (non-genetic) risk factors, as well as measures of prediction performance when adding common genetic variants to the model. Systematic PubMed-based search finally identified 55 eligible studies. These studies were compared with respect to the chosen approach and methodology as well as results and clinical impact. Phenotypes analysed included tumours, diabetes mellitus, and cardiovascular diseases. All studies applied one or more statistical measures reporting on calibration, discrimination, or reclassification to quantify the benefit of including SNPs, but differed substantially regarding the methodological details that were reported. Several examples for improved risk assessments by considering disease-related SNPs were identified. Although the add-on benefit of including SNP genotyping data was mostly moderate, the strategy can be of clinical relevance and may, when being paralleled by an even deeper understanding of disease-related genetics, further explain the development of enhanced predictive and diagnostic strategies for complex diseases.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-016-1636-z) contains supplementary material, which is available to authorized users.
In the context of high-throughput molecular data analysis it is common that the observations included in a dataset form distinct groups; for example, measured at different times, under different conditions or even in different labs. These groups are generally denoted as batches. Systematic differences between these batches not attributable to the biological signal of interest are denoted as batch effects. If ignored when conducting analyses on the combined data, batch effects can lead to distortions in the results. In this paper we present FAbatch, a general, model-based method for correcting for such batch effects in the case of an analysis involving a binary target variable. It is a combination of two commonly used approaches: location-and-scale adjustment and data cleaning by adjustment for distortions due to latent factors. We compare FAbatch extensively to the most commonly applied competitors on the basis of several performance metrics. FAbatch can also be used in the context of prediction modelling to eliminate batch effects from new test data. This important application is illustrated using real and simulated data. We implemented FAbatch and various other functionalities in the R package bapred available online from CRAN.
FAbatch is seen to be competitive in many cases and above average in others. In our analyses, the only cases where it failed to adequately preserve the biological signal were when there were extremely outlying batches and when the batch effects were very weak compared to the biological signal.
As seen in this paper batch effect structures found in real datasets are diverse. Current batch effect adjustment methods are often either too simplistic or make restrictive assumptions, which can be violated in real datasets. Due to the generality of its underlying model and its ability to perform well FAbatch represents a reliable tool for batch effect adjustment for most situations found in practice.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-015-0870-z) contains supplementary material, which is available to authorized users.
Batch effects; High-dimensional data; Data preparation; Prediction; Latent factors
In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset—in its entirety—before training/test set based prediction error estimation by cross-validation (CV)—an approach referred to as “incomplete CV”. Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.
We devise the easily interpretable and general measure CVIIM (“CV Incompleteness Impact Measure”) to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA.
Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings.
While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias.
Electronic supplementary material
The online version of this article (doi:10.1186/s12874-015-0088-9) contains supplementary material, which is available to authorized users.
Cross-validation; Error estimation; Over-optimism; Practical guidelines; Supervised learning
The problem of publication bias has long been discussed in research fields such as medicine. There is a consensus that publication bias is a reality and that solutions should be found to reduce it. In methodological computational research, including cancer informatics, publication bias may also be at work. The publication of negative research findings is certainly also a relevant issue, but has attracted very little attention to date. The present paper aims at providing a new formal framework to describe the notion of publication bias in the context of methodological computational research, facilitate and stimulate discussions on this topic, and increase awareness in the scientific community. We report an exemplary pilot study that aims at gaining experiences with the collection and analysis of information on unpublished research efforts with respect to publication bias, and we outline the encountered problems. Based on these experiences, we try to formalize the notion of publication bias.
epistemology; publication practice; false research findings; overoptimism
In the last years, the importance of independent validation of the prediction ability of a new gene signature has been largely recognized. Recently, with the development of gene signatures which integrate rather than replace the clinical predictors in the prediction rule, the focus has been moved to the validation of the added predictive value of a gene signature, i.e. to the verification that the inclusion of the new gene signature in a prediction model is able to improve its prediction ability.
The high-dimensional nature of the data from which a new signature is derived raises challenging issues and necessitates the modification of classical methods to adapt them to this framework. Here we show how to validate the added predictive value of a signature derived from high-dimensional data and critically discuss the impact of the choice of methods on the results.
The analysis of the added predictive value of two gene signatures developed in two recent studies on the survival of leukemia patients allows us to illustrate and empirically compare different validation techniques in the high-dimensional framework.
The issues related to the high-dimensional nature of the omics predictors space affect the validation process. An analysis procedure based on repeated cross-validation is suggested.
Added predictive value; Omics score; Prediction model; Time-to-event data; Validation
Motivation: Numerous competing algorithms for prediction in high-dimensional settings have been developed in the statistical and machine-learning literature. Learning algorithms and the prediction models they generate are typically evaluated on the basis of cross-validation error estimates in a few exemplary datasets. However, in most applications, the ultimate goal of prediction modeling is to provide accurate predictions for independent samples obtained in different settings. Cross-validation within exemplary datasets may not adequately reflect performance in the broader application context.
Methods: We develop and implement a systematic approach to ‘cross-study validation’, to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets. We illustrate it via simulations and in a collection of eight estrogen-receptor positive breast cancer microarray gene-expression datasets, where the objective is predicting distant metastasis-free survival (DMFS). We computed the C-index for all pairwise combinations of training and validation datasets. We evaluate several alternatives for summarizing the pairwise validation statistics, and compare these to conventional cross-validation.
Results: Our data-driven simulations and our application to survival prediction with eight breast cancer microarray datasets, suggest that standard cross-validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross-study validation. Furthermore, the ranking of learning algorithms differs, suggesting that algorithms performing best in cross-validation may be suboptimal when evaluated through independent validation.
Availability: The survHD: Survival in High Dimensions package (http://www.bitbucket.org/lwaldron/survhd) will be made available through Bioconductor.
Supplementary data are available at Bioinformatics online.
Aortic homografts are an alternative to mechanical or biological valve prostheses. Homografts are generally not transplanted ABO-compatible while this policy is still under debate. The purpose of this study was to investigate whether ABO compatibility impacts on long-term outcomes or not.
Between 1992 and 2009, 363 adult patients with a mean age of 52 years received homografts in aortic position. Donor and acceptor blood groups could be obtained for 335 patients. Sixty-three percent received blood group-compatible (n = 212) (Group iso) and 37% non-blood group-compatible allografts (n = 123) (Group non-iso).
The overall event-free survival (freedom from death or reoperation) was 55.5% (n = 186). In the iso group, the event-free survival was 84.1% at 5 years and 63.3% at 10 years. In the non-iso group, the event-free survival was 79.4% at 5 years and 51.8% at 10 years. 28.5% of patients (n = 35) with ABO-incompatible and 25.5% (n = 54) with ABO-compatible grafts required reoperation. The mean time to reoperation in the iso group was 97.3 vs 90 months in the non-iso group.
In 17 years of research, we have not yet found a statistical significant difference in blood group incompatibility regarding overall event-free survival. In our opinion, there is no need to use ABO-compatible homografts for aortic valve replacement in adults. Histological and immunohistochemical assays are mandatory to confirm our results.
Aortic homografts; Blood group incompatibility; Reoperation
Aim. There is no consensus about the normal fetal heart rate. Current international guidelines recommend for the normal fetal heart rate (FHR) baseline different ranges of 110 to 150 beats per minute (bpm) or 110 to 160 bpm. We started with a precise definition of “normality” and performed a retrospective computerized analysis of electronically recorded FHR tracings.
Methods. We analyzed all recorded cardiotocography tracings of singleton pregnancies in three German medical centers from 2000 to 2007 and identified 78,852 tracings of sufficient quality. For each tracing, the baseline FHR was extracted by eliminating accelerations/decelerations and averaging based on the “delayed moving windows” algorithm. After analyzing 40% of the dataset as “training set” from one hospital generating a hypothetical normal baseline range, evaluation of external validity on the other 60% of the data was performed using data from later years in the same hospital and externally using data from the two other hospitals.
Results. Based on the training data set, the “best” FHR range was 115 or 120 to 160 bpm. Validation in all three data sets identified 120 to 160 bpm as the correct symmetric “normal range”. FHR decreases slightly during gestation.
Conclusions. Normal ranges for FHR are 120 to 160 bpm. Many international guidelines define ranges of 110 to 160 bpm which seem to be safe in daily practice. However, further studies should confirm that such asymmetric alarm limits are safe, with a particular focus on the lower bound, and should give insights about how to show and further improve the usefulness of the widely used practice of CTG monitoring.
Cardiotocography; Fetal heart rate; Baseline; Computerized analysis; Monitoring; Guidelines
In computational science literature including, e.g., bioinformatics, computational statistics or machine learning, most published articles are devoted to the development of “new methods”, while comparison studies are generally appreciated by readers but surprisingly given poor consideration by many journals. This paper stresses the importance of neutral comparison studies for the objective evaluation of existing methods and the establishment of standards by drawing parallels with clinical research. The goal of the paper is twofold. Firstly, we present a survey of recent computational papers on supervised classification published in seven high-ranking computational science journals. The aim is to provide an up-to-date picture of current scientific practice with respect to the comparison of methods in both articles presenting new methods and articles focusing on the comparison study itself. Secondly, based on the results of our survey we critically discuss the necessity, impact and limitations of neutral comparison studies in computational sciences. We define three reasonable criteria a comparison study has to fulfill in order to be considered as neutral, and explicate general considerations on the individual components of a “tidy neutral comparison study”. R codes for completely replicating our statistical analyses and figures are available from the companion website http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/plea2013.
In the context of Gaussian Graphical Models (GGMs) with high-dimensional small sample data, we present a simple procedure, called PACOSE – standing for PArtial COrrelation SElection – to estimate partial correlations under the constraint that some of them are strictly zero. This method can also be extended to covariance selection. If the goal is to estimate a GGM, our new procedure can be applied to re-estimate the partial correlations after a first graph has been estimated in the hope to improve the estimation of non-zero coefficients. This iterated version of PACOSE is called iPACOSE. In a simulation study, we compare PACOSE to existing methods and show that the re-estimated partial correlation coefficients may be closer to the real values in important cases. Plus, we show on simulated and real data that iPACOSE shows very interesting properties with regards to sensitivity, positive predictive value and stability.
The random forest (RF) method is a commonly used tool for classification with
high dimensional data as well as for ranking candidate predictors based on
the so-called random forest variable importance measures (VIMs). However the
classification performance of RF is known to be suboptimal in case of
strongly unbalanced data, i.e. data where response class sizes differ
considerably. Suggestions were made to obtain better classification
performance based either on sampling procedures or on cost sensitivity
analyses. However to our knowledge the performance of the VIMs has not yet
been examined in the case of unbalanced response classes. In this paper we
explore the performance of the permutation VIM for unbalanced data settings
and introduce an alternative permutation VIM based on the area under the
curve (AUC) that is expected to be more robust towards class imbalance.
We investigated the performance of the standard permutation VIM and of our
novel AUC-based permutation VIM for different class imbalance levels using
simulated data and real data. The results suggest that the new AUC-based
permutation VIM outperforms the standard permutation VIM for unbalanced data
settings while both permutation VIMs have equal performance for balanced
The standard permutation VIM loses its ability to discriminate between
associated predictors and predictors not associated with the response for
increasing class imbalance. It is outperformed by our new AUC-based
permutation VIM for unbalanced data settings, while the performance of both
VIMs is very similar in the case of balanced classes. The new AUC-based VIM
is implemented in the R package party for the unbiased RF variant based on
conditional inference trees. The codes implementing our study are available
from the companion website:
Random forest; Conditional inference trees; Variable importance measure; Feature selection; Unbalanced data; Class imbalance; Area under the curve.
Recently it has been shown that radiation induces migration of glioma cells and facilitates a further spread of tumor cells locally and systemically. The aim of this study was to evaluate whether radiotherapy induces migration in head and neck squamous cell carcinoma (HNSCC). A further aim was to investigate the effects of blocking the epidermal growth factor receptor (EGFR) and its downstream pathways (Raf/MEK/ERK, PI3K/Akt) on tumor cell migration in vitro.
Migration of tumor cells was assessed via a wound healing assay and proliferation by a MTT colorimeritric assay using 3 HNSCC cell lines (BHY, CAL-27, HN). The cells were treated with increasing doses of irradiation (2 Gy, 5 Gy, 8 Gy) in the presence or absence of EGF, EGFR-antagonist (AG1478) or inhibitors of the downstream pathways PI3K (LY294002), mTOR (rapamycin) and MEK1 (PD98059). Biochemical activation of EGFR and the downstream markers Akt and ERK were examined by Western blot analysis.
In absence of stimulation or inhibition, increasing doses of irradiation induced a dose-dependent enhancement of migrating cells (p < 0.05 for the 3 HNSCC cell lines) and a decrease of cell proliferation (p < 0.05 for the 3 HNSCC cell lines). The inhibition of EGFR or the downstream pathways reduced cell migration significantly (almost all p < 0.05 for the 3 HNSCC cell lines). Stimulation of HNSCC cells with EGF caused a significant increase in migration (p < 0.05 for the 3 HNSCC cell lines). After irradiation alone a pronounced activation of EGFR was observed by Western blot analysis.
Our results demonstrate that the EGFR is involved in radiation induced migration of HNSCC cells. Therefore EGFR or the downstream pathways might be a target for the treatment of HNSCC to improve the efficacy of radiotherapy.
The number of percutaneous coronary interventions (PCI) prior to coronary artery bypass grafting (CABG) increased drastically during the last decade. Patients are referred for CABG with more severe coronary pathology, which may influence postoperative outcome. Outcomes of 200 CABG patients, collected consecutively in an observational study, were compared (mean follow-up: 5 years). Group A (n = 100, mean age 63 years, 20 women) had prior PCI before CABG, and group B (n = 100, mean age 66, 20 women) underwent primary CABG. In group A, the mean number of administered stents was 2. Statistically significant results were obtained for the following preoperative criteria: previous myocardial infarction: 54 vs 34 (P = 0.007), distribution of CAD (P < 0.0001), unstable angina: 27 vs 5 (P < 0.0001). For intraoperative data, the total number of established bypasses was 2.43 ± 1.08 vs 2.08 ± 1.08 (P = 0.017), with the number of arterial bypass grafts being: 1.26 ± 0.82 vs 1.07 ± 0.54 (P = 0.006). Regarding the postoperative course, significant results could be demonstrated for: adrenaline dosage (0.83 vs 0.41 mg/h; [p is not significant (ns)]) administered in 67 group A vs 47 group B patients (P = 0.006), and noradrenaline dosage (0.82 vs 0.87 mg/h; ns) administered in 46 group A vs 63 group B patients (P = 0.023), CK/troponine I (P = 0.002; P < 0.001), postoperative resuscitation (6 vs 0; P = 0.029), intra aortic balloon pump 12 vs 1 (P = 0.003), and 30-day mortality (9% in group A vs 1% in group B; P = 0.018). Clopidogrel was administered in 35% of patients with prior PCI and in 19% of patients without prior PCI (P = 0.016). Patients with prior PCI presented for CABG with more severe CAD. Morbidity, mortality and reoperation rate during mid term were significantly higher in patients with prior PCI.
CABG; CABG and PCI; CAD; outcome
While high-dimensional molecular data such as microarray gene expression data have been used for disease outcome prediction or diagnosis purposes for about ten years in biomedical research, the question of the additional predictive value of such data given that classical predictors are already available has long been under-considered in the bioinformatics literature.
We suggest an intuitive permutation-based testing procedure for assessing the additional predictive value of high-dimensional molecular data. Our method combines two well-known statistical tools: logistic regression and boosting regression. We give clear advice for the choice of the only method parameter (the number of boosting iterations). In simulations, our novel approach is found to have very good power in different settings, e.g. few strong predictors or many weak predictors. For illustrative purpose, it is applied to the two publicly available cancer data sets.
Our simple and computationally efficient approach can be used to globally assess the additional predictive power of a large number of candidate predictors given that a few clinical covariates or a known prognostic index are already available. It is implemented in the R package "globalboosttest" which is publicly available from R-forge and will be sent to the CRAN as soon as possible.
In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias.
In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure.
We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly.
The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on permuted uninformative predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.
Graphical Gaussian models are popular tools for the estimation of (undirected) gene association networks from microarray data. A key issue when the number of variables greatly exceeds the number of samples is the estimation of the matrix of partial correlations. Since the (Moore-Penrose) inverse of the sample covariance matrix leads to poor estimates in this scenario, standard methods are inappropriate and adequate regularization techniques are needed. Popular approaches include biased estimates of the covariance matrix and high-dimensional regression schemes, such as the Lasso and Partial Least Squares.
In this article, we investigate a general framework for combining regularized regression methods with the estimation of Graphical Gaussian models. This framework includes various existing methods as well as two new approaches based on ridge regression and adaptive lasso, respectively. These methods are extensively compared both qualitatively and quantitatively within a simulation study and through an application to six diverse real data sets. In addition, all proposed algorithms are implemented in the R package "parcor", available from the R repository CRAN.
In our simulation studies, the investigated non-sparse regression methods, i.e. Ridge Regression and Partial Least Squares, exhibit rather conservative behavior when combined with (local) false discovery rate multiple testing in order to decide whether or not an edge is present in the network. For networks with higher densities, the difference in performance of the methods decreases. For sparse networks, we confirm the Lasso's well known tendency towards selecting too many edges, whereas the two-stage adaptive Lasso is an interesting alternative that provides sparser solutions. In our simulations, both sparse and non-sparse methods are able to reconstruct networks with cluster structures. On six real data sets, we also clearly distinguish the results obtained using the non-sparse methods and those obtained using the sparse methods where specification of the regularization parameter automatically means model selection. In five out of six data sets, Partial Least Squares selects very dense networks. Furthermore, for data that violate the assumption of uncorrelated observations (due to replications), the Lasso and the adaptive Lasso yield very complex structures, indicating that they might not be suited under these conditions. The shrinkage approach is more stable than the regression based approaches when using subsampling.
Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables.
We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure.
The resulting conditional variable importance reflects the true impact of each predictor variable more reliably than the original marginal approach.