Machine learning has increasingly been used with microarray gene expression data and for the development of classifiers using a variety of methods. However, method comparisons in cross-study datasets are very scarce. This study compares the performance of seven classification methods and the effect of voting for predicting metastasis outcome in breast cancer patients, in three situations: within the same dataset or across datasets on similar or dissimilar microarray platforms. Combining classification results from seven classifiers into one voting decision performed significantly better during internal validation as well as external validation in similar microarray platforms than the underlying classification methods. When validating between different microarray platforms, random forest, another voting-based method, proved to be the best performing method. We conclude that voting based classifiers provided an advantage with respect to classifying metastasis outcome in breast cancer patients.
A consensus prognostic classifier for estrogen receptor positive breast tumors has been developed and shown to be valid in nearly 900 samples across different microarray platforms.
A consensus prognostic gene expression classifier is still elusive in heterogeneous diseases such as breast cancer.
Here we perform a combined analysis of three major breast cancer microarray data sets to hone in on a universally valid prognostic molecular classifier in estrogen receptor (ER) positive tumors. Using a recently developed robust measure of prognostic separation, we further validate the prognostic classifier in three external independent cohorts, confirming the validity of our molecular classifier in a total of 877 ER positive samples. Furthermore, we find that molecular classifiers may not outperform classical prognostic indices but that they can be used in hybrid molecular-pathological classification schemes to improve prognostic separation.
The prognostic molecular classifier presented here is the first to be valid in over 877 ER positive breast cancer samples and across three different microarray platforms. Larger multi-institutional studies will be needed to fully determine the added prognostic value of molecular classifiers when combined with standard prognostic factors.
There has been considerable interest recently in the application of bagging in the classification of both gene-expression data and protein-abundance mass spectrometry data. The approach is often justified by the improvement it produces on the performance of unstable, overfitting classification rules under small-sample situations. However, the question of real practical interest is whether the ensemble scheme will improve performance of those classifiers sufficiently to beat the performance of single stable, nonoverfitting classifiers, in the case of small-sample genomic and proteomic data sets. To investigate that question, we conducted a detailed empirical study, using publicly-available data sets from published genomic and proteomic studies. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overfitting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, nonoverfitting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, as expected, the ensemble method did not improve the performance of these classifiers significantly. Representative experimental results are presented and discussed in this work.
We describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process).
With E-RFE, we speed up the recursive feature elimination (RFE) with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipf's law profiles.
Without a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance.
We consider both univariate- and multivariate-based feature selection for the problem of binary classification with microarray data. The idea is to determine whether the more sophisticated multivariate approach leads to better misclassification error rates because of the potential to consider jointly significant subsets of genes (but without overfitting the data).
We present an empirical study in which 10-fold cross-validation is applied externally to both a univariate-based and two multivariate- (genetic algorithm (GA)-) based feature selection processes. These procedures are applied with respect to three supervised learning algorithms and six published two-class microarray datasets.
Considering all datasets, and learning algorithms, the average 10-fold external cross-validation error rates for the univariate-, single-stage GA-, and two-stage GA-based processes are 14.2%, 14.6%, and 14.2%, respectively. We also find that the optimism bias estimates from the GA analyses were half that of the univariate approach, but the selection bias estimates from the GA analyses were 2.5 times that of the univariate results.
We find that the 10-fold external cross-validation misclassification error rates were very comparable. Further, we find that a two-stage GA approach did not demonstrate a significant advantage over a 1-stage approach. We also find that the univariate approach had higher optimism bias and lower selection bias compared to both GA approaches.
cross-validation; feature selection; supervised-learning; genetic algorithm
Controlled clinical trials of health care interventions are either explanatory or pragmatic. Explanatory trials test whether an intervention is efficacious; that is, whether it can have a beneficial effect in an ideal situation. Pragmatic trials measure effectiveness; they measure the degree of beneficial effect in real clinical practice. In pragmatic trials, a balance between external validity (generalizability of the results) and internal validity (reliability or accuracy of the results) needs to be achieved. The explanatory trial seeks to maximize the internal validity by assuring rigorous control of all variables other than the intervention. The pragmatic trial seeks to maximize external validity to ensure that the results can be generalized. However the danger of pragmatic trials is that internal validity may be overly compromised in the effort to ensure generalizability. We are conducting two pragmatic randomized controlled trials on interventions in the management of hypertension in primary care. We describe the design of the trials and the steps taken to deal with the competing demands of external and internal validity.
External validity is maximized by having few exclusion criteria and by allowing flexibility in the interpretation of the intervention and in management decisions. Internal validity is maximized by decreasing contamination bias through cluster randomization, and decreasing observer and assessment bias, in these non-blinded trials, through baseline data collection prior to randomization, automating the outcomes assessment with 24 hour ambulatory blood pressure monitors, and blinding the data analysis.
Clinical trials conducted in community practices present investigators with difficult methodological choices related to maintaining a balance between internal validity (reliability of the results) and external validity (generalizability). The attempt to achieve methodological purity can result in clinically meaningless results, while attempting to achieve full generalizability can result in invalid and unreliable results. Achieving a creative tension between the two is crucial.
Clinical and epidemiologic investigations are paying increasing attention to the critical constructs of “representativeness” of study samples and “generalizability” of study results. This is a laudable trend and yet, these key concepts are often misconstrued and conflated, masking the central issues of internal and external validity. The authors define these issues and demonstrate how they are related to one another and to generalizability. Providing examples, they identify threats to validity from different forms of bias and confounding. They also lay out relevant practical issues in study design, from sample selection to assessment of exposures, in both clinic-based and population-based settings.
It is often stated that external validity is not sufficiently considered in the assessment of clinical studies. Although tools for its evaluation have been established, there is a lack of awareness of their significance and application. In this article, a comprehensive checklist is presented addressing these relevant criteria.
The checklist was developed by listing the most commonly used assessment criteria for clinical studies. Additionally, specific lists for individual applications were included. The categories of biases of internal validity (selection, performance, attrition and detection bias) correspond to structural, treatment-related and observational differences between the test and control groups. Analogously, we have extended these categories to address external validity and model validity, regarding similarity between the study population/conditions and the general population/conditions related to structure, treatment and observation.
A checklist is presented, in which the evaluation criteria concerning external validity and model validity are systemised and transformed into a questionnaire format.
The checklist presented in this article can be applied to both planning and evaluating of clinical studies. We encourage the prospective user to modify the checklists according to the respective application and research question. The higher expenditure needed for the evaluation of clinical studies in systematic reviews is justified, particularly in the light of the influential nature of their conclusions on therapeutic decisions and the creation of clinical guidelines.
Accurate evaluation of glomerular filtration rates (GFRs) is of critical importance in clinical practice. A previous study showed that models based on artificial neural networks (ANNs) could achieve a better performance than traditional equations. However, large-sample cross-sectional surveys have not resolved questions about ANN performance.
A total of 1,180 patients that had chronic kidney disease (CKD) were enrolled in the development data set, the internal validation data set and the external validation data set. Additional 222 patients that were admitted to two independent institutions were externally validated. Several ANNs were constructed and finally a Back Propagation network optimized by a genetic algorithm (GABP network) was chosen as a superior model, which included six input variables; i.e., serum creatinine, serum urea nitrogen, age, height, weight and gender, and estimated GFR as the one output variable. Performance was then compared with the Cockcroft-Gault equation, the MDRD equations and the CKD-EPI equation.
In the external validation data set, Bland-Altman analysis demonstrated that the precision of the six-variable GABP network was the highest among all of the estimation models; i.e., 46.7 ml/min/1.73 m2 vs. a range from 71.3 to 101.7 ml/min/1.73 m2, allowing improvement in accuracy (15% accuracy, 49.0%; 30% accuracy, 75.1%; 50% accuracy, 90.5% [P<0.001 for all]) and CKD stage classification (misclassification rate of CKD stage, 32.4% vs. a range from 47.3% to 53.3% [P<0.001 for all]). Furthermore, in the additional external validation data set, precision and accuracy were improved by the six-variable GABP network.
A new ANN model (the six-variable GABP network) for CKD patients was developed that could provide a simple, more accurate and reliable means for the estimation of GFR and stage of CKD than traditional equations. Further validations are needed to assess the ability of the ANN model in diverse populations.
Oncology is a highly researched therapeutic area with an ever expanding armamentarium of drugs entering the market. It is unique in how the heterogeneity of tumor, patient and treatment factors is critical in determining outcomes of interventions. When it comes to decision making in the clinic, the practicing physician often seeks answers in populations with obvious deviations from the ideal selected populations included in the pivotal phase III randomized controlled trials (RCTs). While the randomized nature of the RCT ensures its high internal validity by removing bias, their ‘controlled’ nature casts a doubt on their generalizability to the real world population. It is for this reason that trials done in a naturalistic setting post the marketing authorization of a drug are increasingly required. This article discusses the importance of non interventional drug studies in oncology as an important tool in testing the external validity of controlled trial results and its value in generation of new hypothesis. It also discusses the limitations of such studies while outlining the steps in their effective conduct.
Good clinical practice; non interventional studies; standard operating procedures; study plan
In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias.
In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure.
We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly.
The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on permuted uninformative predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.
Objectives To evaluate the risk of bias tool, introduced by the Cochrane Collaboration for assessing the internal validity of randomised trials, for inter-rater agreement, concurrent validity compared with the Jadad scale and Schulz approach to allocation concealment, and the relation between risk of bias and effect estimates.
Design Cross sectional study.
Study sample 163 trials in children.
Main outcome measures Inter-rater agreement between reviewers assessing trials using the risk of bias tool (weighted κ), time to apply the risk of bias tool compared with other approaches to quality assessment (paired t test), degree of correlation for overall risk compared with overall quality scores (Kendall’s τ statistic), and magnitude of effect estimates for studies classified as being at high, unclear, or low risk of bias (metaregression).
Results Inter-rater agreement on individual domains of the risk of bias tool ranged from slight (κ=0.13) to substantial (κ=0.74). The mean time to complete the risk of bias tool was significantly longer than for the Jadad scale and Schulz approach, individually or combined (8.8 minutes (SD 2.2) per study v 2.0 (SD 0.8), P<0.001). There was low correlation between risk of bias overall compared with the Jadad scores (P=0.395) and Schulz approach (P=0.064). Effect sizes differed between studies assessed as being at high or unclear risk of bias (0.52) compared with those at low risk (0.23).
Conclusions Inter-rater agreement varied across domains of the risk of bias tool. Generally, agreement was poorer for those items that required more judgment. There was low correlation between assessments of overall risk of bias and two common approaches to quality assessment: the Jadad scale and Schulz approach to allocation concealment. Overall risk of bias as assessed by the risk of bias tool differentiated effect estimates, with more conservative estimates for studies at low risk.
In a practical classifier design problem, the sample size is limited, and the available finite sample needs to be used both to design a classifier and to predict the classifier’s performance for the true population. Since a larger sample is more representative of the population, it is advantageous to design the classifier with all the available cases, and to use a resampling technique for performance prediction. We conducted a Monte-Carlo simulation study to compare the ability of different resampling techniques in predicting the performance of a neural network (NN) classifier designed with the available sample. We used the area under the receiver operating characteristic curve as the performance index for the NN classifier. We investigated resampling techniques based on the cross-validation, the leave-one-out method, and three different types of bootstrapping, namely, the ordinary, .632, and .632+ bootstrap. Our results indicated that, under the study conditions, there can be a large difference in the accuracy of the prediction obtained from different resampling methods, especially when the feature space dimensionality is relatively large and the sample size is small. Although this investigation is performed under some specific conditions, it reveals important trends for the problem of classifier performance prediction under the constraint of a limited data set.
Risk of bias in translational medicine may take one of three forms: A. a systematic error of methodology as it pertains to measurement or sampling (e.g., selection bias), B. a systematic defect of design that leads to estimates of experimental and control groups, and of effect sizes that substantially deviate from true values (e.g., information bias), and C. a systematic distortion of the analytical process, which results in a misrepresentation of the data with consequential errors of inference (e.g., inferential bias). Risk of bias can seriously adulterate the internal and the external validity of a clinical study, and, unless it is identified and systematically evaluated, can seriously hamper the process of comparative effectiveness and efficacy research and analysis for practice. The Cochrane Group and the Agency for Healthcare Research and Quality have independently developed instruments for assessing the meta-construct of risk of bias. The present article begins to discuss this dialectic.
Motivation: Automatic classification of high-resolution mass spectrometry proteomic data has increasing potential in the early diagnosis of cancer. We propose a new procedure of biomarker discovery in serum protein profiles based on: (i) discrete wavelet transformation of the spectra; (ii) selection of discriminative wavelet coefficients by a statistical test and (iii) building and evaluating a support vector machine classifier by double cross-validation with attention to the generalizability of the results. In addition to the evaluation results (total recognition rate, sensitivity and specificity), the procedure provides the biomarker patterns, i.e. the parts of spectra which discriminate cancer and control individuals. The evaluation was performed on matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) serum protein profiles of 66 colorectal cancer patients and 50 controls.
Results: Our procedure provided a high recognition rate (97.3%), sensitivity (98.4%) and specificity (95.8%). The extracted biomarker patterns mostly represent the peaks expressing mean differences between the cancer and control spectra. However, we showed that the discriminative power of a peak is not simply expressed by its mean height and cannot be derived by comparison of the mean spectra. The obtained classifiers have high generalization power as measured by the number of support vectors. This prevents overfitting and contributes to the reproducibility of the results, which is required to find biomarkers differentiating cancer patients from healthy individuals.
Availability: The data and scripts used in this study are available at http://www.math.uni-bremen.de/~theodore/MALDIDWT.
Supplementary information: Supplementary data are available at Bioinformatics online.
Identification of molecular classifiers from genome-wide gene expression analysis is an important practice for the investigation of biological systems in the post-genomic era - and one with great potential for near-term clinical impact. The 'Top-Scoring Pair' (TSP) classification method identifies pairs of genes whose relative expression correlates strongly with phenotype. In this study, we sought to assess the effectiveness of the TSP approach in the identification of diagnostic classifiers for a number of human diseases including bacterial and viral infection, cardiomyopathy, diabetes, Crohn's disease, and transformed ulcerative colitis. We examined transcriptional profiles from both solid tissues and blood-borne leukocytes.
The algorithm identified multiple predictive gene pairs for each phenotype, with cross-validation accuracy ranging from 70 to nearly 100 percent, and high sensitivity and specificity observed in most classification tasks. Performance compared favourably with that of pre-existing transcription-based classifiers, and in some cases was comparable to the accuracy of current clinical diagnostic procedures. Several diseases of solid tissues could be reliably diagnosed through classifiers based on the blood-borne leukocyte transcriptome. The TSP classifier thus represents a simple yet robust method to differentiate between diverse phenotypic states based on gene expression profiles.
Two-transcript classifiers have the potential to reliably classify diverse human diseases, through analysis of both local diseased tissue and the immunological response assayed through blood-borne leukocytes. The experimental simplicity of this method results in measurements that can be easily translated to clinical practice.
OBJECTIVE: To test the feasibility of creating a valid and reliable checklist with the following features: appropriate for assessing both randomised and non-randomised studies; provision of both an overall score for study quality and a profile of scores not only for the quality of reporting, internal validity (bias and confounding) and power, but also for external validity. DESIGN: A pilot version was first developed, based on epidemiological principles, reviews, and existing checklists for randomised studies. Face and content validity were assessed by three experienced reviewers and reliability was determined using two raters assessing 10 randomised and 10 non- randomised studies. Using different raters, the checklist was revised and tested for internal consistency (Kuder-Richardson 20), test-retest and inter-rater reliability (Spearman correlation coefficient and sign rank test; kappa statistics), criterion validity, and respondent burden. MAIN RESULTS: The performance of the checklist improved considerably after revision of a pilot version. The Quality Index had high internal consistency (KR-20: 0.89) as did the subscales apart from external validity (KR-20: 0.54). Test-retest (r 0.88) and inter-rater (r 0.75) reliability of the Quality Index were good. Reliability of the subscales varied from good (bias) to poor (external validity). The Quality Index correlated highly with an existing, established instrument for assessing randomised studies (r 0.90). There was little difference between its performance with non-randomised and with randomised studies. Raters took about 20 minutes to assess each paper (range 10 to 45 minutes). CONCLUSIONS: This study has shown that it is feasible to develop a checklist that can be used to assess the methodological quality not only of randomised controlled trials but also non-randomised studies. It has also shown that it is possible to produce a checklist that provides a profile of the paper, alerting reviewers to its particular methodological strengths and weaknesses. Further work is required to improve the checklist and the training of raters in the assessment of external validity.
Male rats were treated with various model compounds or the appropriate vehicle controls. Most substances were either well-known hepatotoxicants or showed hepatotoxicity during preclinical testing. The aim of the present study was to determine if biological samples from rats treated with various compounds can be classified based on gene expression profiles. In addition to gene expression analysis using microarrays, a complete serum chemistry profile and liver and kidney histopathology were performed. We analyzed hepatic gene expression profiles using a supervised learning method (support vector machines; SVMs) to generate classification rules and combined this with recursive feature elimination to improve classification performance and to identify a compact subset of probe sets with potential use as biomarkers. Two different SVM algorithms were tested, and the models obtained were validated with a compound-based external cross-validation approach. Our predictive models were able to discriminate between hepatotoxic and nonhepatotoxic compounds. Furthermore, they predicted the correct class of hepatotoxicant in most cases. We provide an example showing that a predictive model built on transcript profiles from one rat strain can successfully classify profiles from another rat strain. In addition, we demonstrate that the predictive models identify nonresponders and are able to discriminate between gene changes related to pharmacology and toxicity. This work confirms the hypothesis that compound classification based on gene expression data is feasible.
liver; microarray; predictive toxicology; rat; support vector machines; toxicogenomics
Germ cell tumor (GCT) is the most common malignancy in young adult men. Currently, patients are risk-stratified on the basis of clinical presentation and serum tumor markers. The introduction of molecular markers could improve outcome prediction.
Patients and Methods
Expression profiling was performed on 74 nonseminomatous GCTs (NSGCTs) from cisplatin-treated patients (ie, training set) and on 34 similarly treated patients with NSGCTs (ie, validation set). A gene classifier was developed by using prediction analysis for microarrays (PAM) for the binary end point of 5-year overall survival (OS). A predictive score was developed for OS by using the univariate Cox model.
In the training set, PAM identified 140 genes that predicted 5-year OS (cross-validated classification rate, 60%). The PAM model correctly classified 90% of patients in the validation set. Patients predicted to have good outcome had significantly longer survival than those with poor predicted outcome (P < .001). For the OS end point, a 10-gene model had a predictive accuracy (ie, concordance index) of 0.66 in the training set and a concordance index of 0.83 in the validation set. Dichotomization of the samples on the basis of the median score resulted in significant differences in survival (P = .002). For both end points, the gene-based predictor was an independent prognostic factor in a multivariate model that included clinical risk stratification (P < .01 for both).
We have identified gene expression signatures that accurately predict outcome in patients with GCTs. These predictive genes should be useful for the prediction of patient outcome and could provide novel targets for therapeutic intervention.
In childhood acute lymphoblastic leukemia (ALL) genetic subtypes are recognized that determine the risk-group for further treatment. However, 25% of precursor BALL are currently genetically unclassified and have an intermediate prognosis. The present study used genome-wide strategies to reveal new biological insights and advance the prognostic classification of childhood ALL.
A classifier based on gene expression in ALL cells from 190 newly diagnosed pediatric cases was constructed using a double-loop cross-validation method and next, was validated on an independent cohort of 107 newly diagnosed pediatric ALL cases. Hierarchical cluster analysis using classifying gene probe sets then revealed a novel ALL subtype for which underlying genetic abnormalities were characterized by comparative genomic hybridization-arrays and molecular cytogenetics.
The prediction accuracy of the classifier was median 90% in the discovery cohort and 87.9% in the independent validation cohort. A significant part of the currently genetically unclassified cases clustered with BCR-ABL-positive cases in both the discovery and validation cohort. These BCR-ABL-like cases represent 15–20% of ALL cases and have a highly unfavorable outcome (5-year disease-free survival 59.5%, 95%CI: 37.1%–81.9%) compared to other precursor B-ALL cases (84.4%, 95%CI: 76.8%–92.1%; P=0.012), similar to the poor prognosis of BCR-ABL-positive ALL (51.9%, 95%CI: 23.1%–80.6%), as was confirmed in the validation cohort. Further genetic studies revealed that the BCR-ABL-like subtype is characterized by a high frequency of deletions in genes involved in B-cell development (82%), including IKAROS, E2A, EBF1, PAX5 and VPREB1, compared to other ALL cases (36%, p=0.0002). BCR-ABL-like leukemic cells were median >70-times resistant to L-asparaginase (p=0.001) and 1.6-times more resistant to daunorubicin (p=0.017) compared to other precursor B-ALL cases whereas the toxicity of prednisolone and vincristine did not significantly differ.
Classification by gene expression profiling identified a novel subtype of ALL not detected by current diagnostic procedures but which comprises the largest group of patients with a high-risk of treatment failure. New treatment strategies are needed to improve outcome for this novel high-risk subtype of ALL.
Dutch Cancer Society, Sophia Foundation for Medical Research, Pediatric Oncology Foundation Rotterdam, Center of Medical Systems Biology of the Netherlands Genomics Initiative/Netherlands Organisation for Scientific Research, American National Institute of Health, American National Cancer Institute and American Lebanese Syrian Associated Charities.
microarray; gene expression profiling; classification; genotype; novel subtype; class discovery; ALL
This study evaluated the use of machine learning techniques in the classification of sentence type. 7253 structured abstracts and 204 unstructured abstracts of Randomized Controlled Trials from MedLINE were parsed into sentences and each sentence was labeled as one of four types (Introduction, Method, Result, or Conclusion). Support Vector Machine (SVM) and Linear Classifier models were generated and evaluated on cross-validated data. Treating sentences as a simple "bag of words", the SVM model had an average ROC area of 0.92. Adding a feature of relative sentence location improved performance markedly for some models and overall increasing the average ROC to 0.95. Linear classifier performance was significantly worse than the SVM in all datasets. Using the SVM model trained on structured abstracts to predict unstructured abstracts yielded performance similar to that of models trained with unstructured abstracts in 3 of the 4 types. We conclude that classification of sentence type seems feasible within the domain of RCT's. Identification of sentence types may be helpful for providing context to end users or other text summarization techniques.
Randomized clinical trials are considered to be the gold standard of evidence‐based medicine nowadays. However, it is important that we point out some limitations of randomized clinical trials relating to surgical interventions. There are limitations that affect the external and internal validity of many surgical study designs. Some limitations can be bypassed, but can make it more difficult for the study to be carried out. Other limitations cannot be bypassed. When it is intended to extrapolate the result of a randomized clinical trial, the premise is that the performed or to be performed intervention will be similar wherever applied and/or for every doctor using it. However, no matter how standardized the technique may be, the results are not similar for all surgeons, which implies a significant limitation to surgical randomized clinical trials concerning external validity. When considering the various limitations presented for performing surgical trials capable of generating scientific evidence within the patterns currently proposed in the evidence level classifications of medical publications, it is necessary to rethink whether those scientific evidence levels are similarly applicable to surgical works and to nonsurgical trials. We currently live in a time of supposed “inferiority” of surgical scientific works under the optics of the current quality criteria for a “suitable” clinical trial.
Clinical Trial; Surgery; Randomization; Blinding; Evidence Level
Supervised classification is fundamental in bioinformatics. Machine learning models, such as neural networks, have been applied to discover genes and expression patterns. This process is achieved by implementing training and test phases. In the training phase, a set of cases and their respective labels are used to build a classifier. During testing, the classifier is used to predict new cases. One approach to assessing its predictive quality is to estimate its accuracy during the test phase. Key limitations appear when dealing with small-data samples. This paper investigates the effect of data sampling techniques on the assessment of neural network classifiers.
Three data sampling techniques were studied: Cross-validation, leave-one-out, and bootstrap. These methods are designed to reduce the bias and variance of small-sample estimations. Two prediction problems based on small-sample sets were considered: Classification of microarray data originating from a leukemia study and from small, round blue-cell tumours. A third problem, the prediction of splice-junctions, was analysed to perform comparisons. Different accuracy estimations were produced for each problem. The variations are accentuated in the small-data samples. The quality of the estimates depends on the number of train-test experiments and the amount of data used for training the networks.
The predictive quality assessment of biomolecular data classifiers depends on the data size, sampling techniques and the number of train-test experiments. Conservative and optimistic accuracy estimations can be obtained by applying different methods. Guidelines are suggested to select a sampling technique according to the complexity of the prediction problem under consideration.
Discourse connectives are words or phrases that connect or relate two coherent sentences or phrases and indicate the presence of discourse relations. Automatic recognition of discourse connectives may benefit many natural language processing applications. In this pilot study, we report the development of the supervised machine-learning classifiers with conditional random fields (CRFs) for automatically identifying discourse connectives in full-text biomedical articles. Our first classifier was trained on the open-domain 1 million token Penn Discourse Tree Bank (PDTB). We performed cross validation on biomedical articles (approximately 100K word tokens) that we annotated. The results show that the classifier trained on PDTB data attained a 0.55 F1-score for identifying discourse connectives in biomedical text, while the cross-validation results in the biomedical text attained a 0.69 F1-score, a much better performance despite a much smaller training size. Our preliminary analysis suggests the existence of domain-specific features, and we speculate that domain-adaption approaches may further improve performance.
PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate.
We show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means).
The results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data.