PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1435110)

Clipboard (0)
None

Related Articles

1.  Cross-study validation for the assessment of prediction algorithms 
Bioinformatics  2014;30(12):i105-i112.
Motivation: Numerous competing algorithms for prediction in high-dimensional settings have been developed in the statistical and machine-learning literature. Learning algorithms and the prediction models they generate are typically evaluated on the basis of cross-validation error estimates in a few exemplary datasets. However, in most applications, the ultimate goal of prediction modeling is to provide accurate predictions for independent samples obtained in different settings. Cross-validation within exemplary datasets may not adequately reflect performance in the broader application context.
Methods: We develop and implement a systematic approach to ‘cross-study validation’, to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets. We illustrate it via simulations and in a collection of eight estrogen-receptor positive breast cancer microarray gene-expression datasets, where the objective is predicting distant metastasis-free survival (DMFS). We computed the C-index for all pairwise combinations of training and validation datasets. We evaluate several alternatives for summarizing the pairwise validation statistics, and compare these to conventional cross-validation.
Results: Our data-driven simulations and our application to survival prediction with eight breast cancer microarray datasets, suggest that standard cross-validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross-study validation. Furthermore, the ranking of learning algorithms differs, suggesting that algorithms performing best in cross-validation may be suboptimal when evaluated through independent validation.
Availability: The survHD: Survival in High Dimensions package (http://www.bitbucket.org/lwaldron/survhd) will be made available through Bioconductor.
Contact: levi.waldron@hunter.cuny.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu279
PMCID: PMC4058929  PMID: 24931973
2.  New bandwidth selection criterion for Kernel PCA: Approach to dimensionality reduction and classification problems 
BMC Bioinformatics  2014;15:137.
Background
DNA microarrays are potentially powerful technology for improving diagnostic classification, treatment selection, and prognostic assessment. The use of this technology to predict cancer outcome has a history of almost a decade. Disease class predictors can be designed for known disease cases and provide diagnostic confirmation or clarify abnormal cases. The main input to this class predictors are high dimensional data with many variables and few observations. Dimensionality reduction of these features set significantly speeds up the prediction task. Feature selection and feature transformation methods are well known preprocessing steps in the field of bioinformatics. Several prediction tools are available based on these techniques.
Results
Studies show that a well tuned Kernel PCA (KPCA) is an efficient preprocessing step for dimensionality reduction, but the available bandwidth selection method for KPCA was computationally expensive. In this paper, we propose a new data-driven bandwidth selection criterion for KPCA, which is related to least squares cross-validation for kernel density estimation. We propose a new prediction model with a well tuned KPCA and Least Squares Support Vector Machine (LS-SVM). We estimate the accuracy of the newly proposed model based on 9 case studies. Then, we compare its performances (in terms of test set Area Under the ROC Curve (AUC) and computational time) with other well known techniques such as whole data set + LS-SVM, PCA + LS-SVM, t-test + LS-SVM, Prediction Analysis of Microarrays (PAM) and Least Absolute Shrinkage and Selection Operator (Lasso). Finally, we assess the performance of the proposed strategy with an existing KPCA parameter tuning algorithm by means of two additional case studies.
Conclusion
We propose, evaluate, and compare several mathematical/statistical techniques, which apply feature transformation/selection for subsequent classification, and consider its application in medical diagnostics. Both feature selection and feature transformation perform well on classification tasks. Due to the dynamic selection property of feature selection, it is hard to define significant features for the classifier, which predicts classes of future samples. Moreover, the proposed strategy enjoys a distinctive advantage with its relatively lesser time complexity.
doi:10.1186/1471-2105-15-137
PMCID: PMC4025604  PMID: 24886083
3.  Predictive Modeling Using a Somatic Mutational Profile in Ovarian High Grade Serous Carcinoma 
PLoS ONE  2013;8(1):e54089.
Purpose
Recent high-throughput sequencing technology has identified numerous somatic mutations across the whole exome in a variety of cancers. In this study, we generate a predictive model employing the whole exome somatic mutational profile of ovarian high-grade serous carcinomas (Ov-HGSCs) obtained from The Cancer Genome Atlas data portal.
Methods
A total of 311 patients were included for modeling overall survival (OS) and 259 patients were included for modeling progression free survival (PFS) in an analysis of 509 genes. The model was validated with complete leave-one-out cross-validation involving re-selecting genes for each iteration of the cross-validation procedure. Cross-validated Kaplan-Meier curves were generated. Cross-validated time dependent receiver operating characteristic (ROC) curves were computed and the area under the curve (AUC) values were calculated from the ROC curves to estimate the predictive accuracy of the survival risk models.
Results
There was a significant difference in OS between the high-risk group (median, 28.1 months) and the low-risk group (median, 61.5 months) (permutated p-value <0.001). For PFS, there was also a significant difference in PFS between the high-risk group (10.9 months) and the low-risk group (22.3 months) (permutated p-value <0.001). Cross-validated AUC values were 0.807 for the OS and 0.747 for the PFS based on a defined landmark time t = 36 months. In comparisons between a predictive model containing only gene variables and a combined model containing both gene variables and clinical covariates, the predictive model containing gene variables without clinical covariates were effective and high AUC values for both OS and PFS were observed.
Conclusions
We designed a predictive model using a somatic mutation profile obtained from high-throughput genomic sequencing data in Ov-HGSC samples that may represent a new strategy for applying high-throughput sequencing data to clinical practice.
doi:10.1371/journal.pone.0054089
PMCID: PMC3542368  PMID: 23326577
4.  A Predictive Risk Probability Approach for Microarray Data with Survival as an Endpoint 
Gene expression profiling has played an important role in cancer risk classification and has shown promising results. Since gene expression profiling often involves determination of a set of top rank genes for analysis, it is important to evaluate how modeling performance varies with the number of selected top ranked genes incorporated in the model. We used a colon data set collected at Moffitt Cancer Center, as an example of the study, and ranked genes based on the univariate Cox proportional hazards model. A set of top ranked genes was selected for evaluation. The selection was done by choosing the top k ranked genes for k=1 to 12,500. An analysis indicated a considerable variation of classification outcomes when the number of top ranked genes was changed. We developed a predictive risk probability approach to accommodate this variation by identifying a range number of top ranked genes. For each number of top ranked genes, the procedure classifies each patient as having high risk (score = 1) or low risk (score = 0). The categorizations are then averaged, giving a risk score between 0 and 1, thus providing a ranking for the patient’s need for further treatment. This approach was applied to the colon data set and demonstrated the strength of this approach by three criteria: First, a univariate Cox proportional hazards model showed a highly statistically significant level (log-rank χ2 statistics=110 with p value < 10−16) for the predictive risk probability classification. Second, the survival tree model used the risk probability to partition patients into five risk groups showing a good separation of survival curves (log-rank χ2 statistics=215). In addition, utilization of the risk group status identified a small set of risk genes which may be practical for biological validation. Third, analysis of re-sampling the risk probability suggested the variation pattern of the log-rank χ2 in the colon cancer dataset was unlikely caused by chance.
doi:10.1080/10543400802277967
PMCID: PMC2717790  PMID: 18781520
Cancer risk classification; dimension reduction; log-rank test; survival tree model; top ranked genes
5.  The Associations between Immunity-Related Genes and Breast Cancer Prognosis in Korean Women 
PLoS ONE  2014;9(7):e103593.
We investigated the role of common genetic variation in immune-related genes on breast cancer disease-free survival (DFS) in Korean women. 107 breast cancer patients of the Seoul Breast Cancer Study (SEBCS) were selected for this study. A total of 2,432 tag single nucleotide polymorphisms (SNPs) in 283 immune-related genes were genotyped with the GoldenGate Oligonucleotide pool assay (OPA). A multivariate Cox-proportional hazard model and polygenic risk score model were used to estimate the effects of SNPs on breast cancer prognosis. Harrell’s C index was calculated to estimate the predictive accuracy of polygenic risk score model. Subsequently, an extended gene set enrichment analysis (GSEA-SNP) was conducted to approximate the biological pathway. In addition, to confirm our results with current evidence, previous studies were systematically reviewed. Sixty-two SNPs were statistically significant at p-value less than 0.05. The most significant SNPs were rs1952438 in SOCS4 gene (hazard ratio (HR) = 11.99, 95% CI = 3.62–39.72, P = 4.84E-05), rs2289278 in TSLP gene (HR = 4.25, 95% CI = 2.10–8.62, P = 5.99E-05) and rs2074724 in HGF gene (HR = 4.63, 95% CI = 2.18–9.87, P = 7.04E-05). In the polygenic risk score model, the HR of women in the 3rd tertile was 6.78 (95% CI = 1.48–31.06) compared to patients in the 1st tertile of polygenic risk score. Harrell’s C index was 0.813 with total patients and 0.924 in 4-fold cross validation. In the pathway analysis, 18 pathways were significantly associated with breast cancer prognosis (P<0.1). The IL-6R, IL-8, IL-10RB, IL-12A, and IL-12B was associated with the prognosis of cancer in data of both our study and a previous study. Therefore, our results suggest that genetic polymorphisms in immune-related genes have relevance to breast cancer prognosis among Korean women.
doi:10.1371/journal.pone.0103593
PMCID: PMC4116221  PMID: 25075970
6.  Risk Models to Predict Chronic Kidney Disease and Its Progression: A Systematic Review 
PLoS Medicine  2012;9(11):e1001344.
A systematic review of risk prediction models conducted by Justin Echouffo-Tcheugui and Andre Kengne examines the evidence base for prediction of chronic kidney disease risk and its progression, and suitability of such models for clinical use.
Background
Chronic kidney disease (CKD) is common, and associated with increased risk of cardiovascular disease and end-stage renal disease, which are potentially preventable through early identification and treatment of individuals at risk. Although risk factors for occurrence and progression of CKD have been identified, their utility for CKD risk stratification through prediction models remains unclear. We critically assessed risk models to predict CKD and its progression, and evaluated their suitability for clinical use.
Methods and Findings
We systematically searched MEDLINE and Embase (1 January 1980 to 20 June 2012). Dual review was conducted to identify studies that reported on the development, validation, or impact assessment of a model constructed to predict the occurrence/presence of CKD or progression to advanced stages. Data were extracted on study characteristics, risk predictors, discrimination, calibration, and reclassification performance of models, as well as validation and impact analyses. We included 26 publications reporting on 30 CKD occurrence prediction risk scores and 17 CKD progression prediction risk scores. The vast majority of CKD risk models had acceptable-to-good discriminatory performance (area under the receiver operating characteristic curve>0.70) in the derivation sample. Calibration was less commonly assessed, but overall was found to be acceptable. Only eight CKD occurrence and five CKD progression risk models have been externally validated, displaying modest-to-acceptable discrimination. Whether novel biomarkers of CKD (circulatory or genetic) can improve prediction largely remains unclear, and impact studies of CKD prediction models have not yet been conducted. Limitations of risk models include the lack of ethnic diversity in derivation samples, and the scarcity of validation studies. The review is limited by the lack of an agreed-on system for rating prediction models, and the difficulty of assessing publication bias.
Conclusions
The development and clinical application of renal risk scores is in its infancy; however, the discriminatory performance of existing tools is acceptable. The effect of using these models in practice is still to be explored.
Please see later in the article for the Editors' Summary
Editors' Summary
Background
Chronic kidney disease (CKD)—the gradual loss of kidney function—is increasingly common worldwide. In the US, for example, about 26 million adults have CKD, and millions more are at risk of developing the condition. Throughout life, small structures called nephrons inside the kidneys filter waste products and excess water from the blood to make urine. If the nephrons stop working because of injury or disease, the rate of blood filtration decreases, and dangerous amounts of waste products such as creatinine build up in the blood. Symptoms of CKD, which rarely occur until the disease is very advanced, include tiredness, swollen feet and ankles, puffiness around the eyes, and frequent urination, especially at night. There is no cure for CKD, but progression of the disease can be slowed by controlling high blood pressure and diabetes, both of which cause CKD, and by adopting a healthy lifestyle. The same interventions also reduce the chances of CKD developing in the first place.
Why Was This Study Done?
CKD is associated with an increased risk of end-stage renal disease, which is treated with dialysis or by kidney transplantation (renal replacement therapies), and of cardiovascular disease. These life-threatening complications are potentially preventable through early identification and treatment of CKD, but most people present with advanced disease. Early identification would be particularly useful in developing countries, where renal replacement therapies are not readily available and resources for treating cardiovascular problems are limited. One way to identify people at risk of a disease is to use a “risk model.” Risk models are constructed by testing the ability of different combinations of risk factors that are associated with a specific disease to identify those individuals in a “derivation sample” who have the disease. The model is then validated on an independent group of people. In this systematic review (a study that uses predefined criteria to identify all the research on a given topic), the researchers critically assess the ability of existing CKD risk models to predict the occurrence of CKD and its progression, and evaluate their suitability for clinical use.
What Did the Researchers Do and Find?
The researchers identified 26 publications reporting on 30 risk models for CKD occurrence and 17 risk models for CKD progression that met their predefined criteria. The risk factors most commonly included in these models were age, sex, body mass index, diabetes status, systolic blood pressure, serum creatinine, protein in the urine, and serum albumin or total protein. Nearly all the models had acceptable-to-good discriminatory performance (a measure of how well a model separates people who have a disease from people who do not have the disease) in the derivation sample. Not all the models had been calibrated (assessed for whether the average predicted risk within a group matched the proportion that actually developed the disease), but in those that had been assessed calibration was good. Only eight CKD occurrence and five CKD progression risk models had been externally validated; discrimination in the validation samples was modest-to-acceptable. Finally, very few studies had assessed whether adding extra variables to CKD risk models (for example, genetic markers) improved prediction, and none had assessed the impact of adopting CKD risk models on the clinical care and outcomes of patients.
What Do These Findings Mean?
These findings suggest that the development and clinical application of CKD risk models is still in its infancy. Specifically, these findings indicate that the existing models need to be better calibrated and need to be externally validated in different populations (most of the models were tested only in predominantly white populations) before they are incorporated into guidelines. The impact of their use on clinical outcomes also needs to be assessed before their widespread use is recommended. Such research is worthwhile, however, because of the potential public health and clinical applications of well-designed risk models for CKD. Such models could be used to identify segments of the population that would benefit most from screening for CKD, for example. Moreover, risk communication to patients could motivate them to adopt a healthy lifestyle and to adhere to prescribed medications, and the use of models for predicting CKD progression could help clinicians tailor disease-modifying therapies to individual patient needs.
Additional Information
Please access these websites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1001344.
This study is further discussed in a PLOS Medicine Perspective by Maarten Taal
The US National Kidney and Urologic Diseases Information Clearinghouse provides information about all aspects of kidney disease; the US National Kidney Disease Education Program provides resources to help improve the understanding, detection, and management of kidney disease (in English and Spanish)
The UK National Health Service Choices website provides information for patients on chronic kidney disease, including some personal stories
The US National Kidney Foundation, a not-for-profit organization, provides information about chronic kidney disease (in English and Spanish)
The not-for-profit UK National Kidney Federation support and information for patients with kidney disease and for their carers, including a selection of patient experiences of kidney disease
World Kidney Day, a joint initiative between the International Society of Nephrology and the International Federation of Kidney Foundations, aims to raise awareness about kidneys and kidney disease
doi:10.1371/journal.pmed.1001344
PMCID: PMC3502517  PMID: 23185136
7.  Mass spectrometry protein expression profiles in colorectal cancer tissue associated with clinico-pathological features of disease 
BMC Cancer  2010;10:410.
Background
Studies of several tumour types have shown that expression profiling of cellular protein extracted from surgical tissue specimens by direct mass spectrometry analysis can accurately discriminate tumour from normal tissue and in some cases can sub-classify disease. We have evaluated the potential value of this approach to classify various clinico-pathological features in colorectal cancer by employing matrix-assisted laser desorption ionisation time of-flight-mass spectrometry (MALDI-TOF MS).
Methods
Protein extracts from 31 tumour and 33 normal mucosa specimens were purified, subjected to MALDI-Tof MS and then analysed using the 'GenePattern' suite of computational tools (Broad Institute, MIT, USA). Comparative Gene Marker Selection with either a t-test or a signal-to-noise ratio (SNR) test statistic was used to identify and rank differentially expressed marker peaks. The k-nearest neighbours algorithm was used to build classification models either using separate training and test datasets or else by using an iterative, 'leave-one-out' cross-validation method.
Results
73 protein peaks in the mass range 1800-16000Da were differentially expressed in tumour verses adjacent normal mucosa tissue (P ≤ 0.01, false discovery rate ≤ 0.05). Unsupervised hierarchical cluster analysis classified most tumour and normal mucosa into distinct cluster groups. Supervised prediction correctly classified the tumour/normal mucosa status of specimens in an independent test spectra dataset with 100% sensitivity and specificity (95% confidence interval: 67.9-99.2%). Supervised prediction using 'leave-one-out' cross validation algorithms for tumour spectra correctly classified 10/13 poorly differentiated and 16/18 well/moderately differentiated tumours (P = < 0.001; receiver-operator characteristics - ROC - error, 0.171); disease recurrence was correctly predicted in 5/6 cases and disease-free survival (median follow-up time, 25 months) was correctly predicted in 22/23 cases (P = < 0.001; ROC error, 0.105). A similar analysis of normal mucosa spectra correctly predicted 11/14 patients with, and 15/19 patients without lymph node involvement (P = 0.001; ROC error, 0.212).
Conclusions
Protein expression profiling of surgically resected CRC tissue extracts by MALDI-TOF MS has potential value in studies aimed at improved molecular classification of this disease. Further studies, with longer follow-up times and larger patient cohorts, that would permit independent validation of supervised classification models, would be required to confirm the predictive value of tumour spectra for disease recurrence/patient survival.
doi:10.1186/1471-2407-10-410
PMCID: PMC2927547  PMID: 20691062
8.  Estimating Survival in Patients with Operable Skeletal Metastases: An Application of a Bayesian Belief Network 
PLoS ONE  2011;6(5):e19956.
Background
Accurate estimations of life expectancy are important in the management of patients with metastatic cancer affecting the extremities, and help set patient, family, and physician expectations. Clinically, the decision whether to operate on patients with skeletal metastases, as well as the choice of surgical procedure, are predicated on an individual patient's estimated survival. Currently, there are no reliable methods for estimating survival in this patient population. Bayesian classification, which includes Bayesian belief network (BBN) modeling, is a statistical method that explores conditional, probabilistic relationships between variables to estimate the likelihood of an outcome using observed data. Thus, BBN models are being used with increasing frequency in a variety of diagnoses to codify complex clinical data into prognostic models. The purpose of this study was to determine the feasibility of developing Bayesian classifiers to estimate survival in patients undergoing surgery for metastases of the axial and appendicular skeleton.
Methods
We searched an institution-owned patient management database for all patients who underwent surgery for skeletal metastases between 1999 and 2003. We then developed and trained a machine-learned BBN model to estimate survival in months using candidate features based on historical data. Ten-fold cross-validation and receiver operating characteristic (ROC) curve analysis were performed to evaluate the BNN model's accuracy and robustness.
Results
A total of 189 consecutive patients were included. First-degree predictors of survival differed between the 3-month and 12-month models. Following cross validation, the area under the ROC curve was 0.85 (95% CI: 0.80–0.93) for 3-month probability of survival and 0.83 (95% CI: 0.77–0.90) for 12-month probability of survival.
Conclusions
A robust, accurate, probabilistic naïve BBN model was successfully developed using observed clinical data to estimate individualized survival in patients with operable skeletal metastases. This method warrants further development and must be externally validated in other patient populations.
doi:10.1371/journal.pone.0019956
PMCID: PMC3094405  PMID: 21603644
9.  Accurate multimodal probabilistic prediction of conversion to Alzheimer's disease in patients with mild cognitive impairment☆ 
NeuroImage : Clinical  2013;2:735-745.
Accurately identifying the patients that have mild cognitive impairment (MCI) who will go on to develop Alzheimer's disease (AD) will become essential as new treatments will require identification of AD patients at earlier stages in the disease process. Most previous work in this area has centred around the same automated techniques used to diagnose AD patients from healthy controls, by coupling high dimensional brain image data or other relevant biomarker data to modern machine learning techniques. Such studies can now distinguish between AD patients and controls as accurately as an experienced clinician. Models trained on patients with AD and control subjects can also distinguish between MCI patients that will convert to AD within a given timeframe (MCI-c) and those that remain stable (MCI-s), although differences between these groups are smaller and thus, the corresponding accuracy is lower. The most common type of classifier used in these studies is the support vector machine, which gives categorical class decisions. In this paper, we introduce Gaussian process (GP) classification to the problem. This fully Bayesian method produces naturally probabilistic predictions, which we show correlate well with the actual chances of converting to AD within 3 years in a population of 96 MCI-s and 47 MCI-c subjects. Furthermore, we show that GPs can integrate multimodal data (in this study volumetric MRI, FDG-PET, cerebrospinal fluid, and APOE genotype with the classification process through the use of a mixed kernel). The GP approach aids combination of different data sources by learning parameters automatically from training data via type-II maximum likelihood, which we compare to a more conventional method based on cross validation and an SVM classifier. When the resulting probabilities from the GP are dichotomised to produce a binary classification, the results for predicting MCI conversion based on the combination of all three types of data show a balanced accuracy of 74%. This is a substantially higher accuracy than could be obtained using any individual modality or using a multikernel SVM, and is competitive with the highest accuracy yet achieved for predicting conversion within three years on the widely used ADNI dataset.
Highlights
•Prediction of MCI to AD conversion using ADNI data and Gaussian processes.•74% accuracy, 0.795 area under ROC curve for predicting conversion within 3 years.•Gaussian processes allow automatic parameter tuning including multimodal weights.•Statistically significant improvement for multimodal vs best unimodal prediction.•Probabilistic interpretation of results to better reflect continuum of disease.
doi:10.1016/j.nicl.2013.05.004
PMCID: PMC3777690  PMID: 24179825
Alzheimer's disease; Mild cognitive impairment; Gaussian process; Support vector machine; Multimodality; Probabilistic classification; Risk scores
10.  A Risk Prediction Model for the Assessment and Triage of Women with Hypertensive Disorders of Pregnancy in Low-Resourced Settings: The miniPIERS (Pre-eclampsia Integrated Estimate of RiSk) Multi-country Prospective Cohort Study 
PLoS Medicine  2014;11(1):e1001589.
Beth Payne and colleagues use a risk prediction model, the Pre-eclampsia Integrated Estimate of RiSk (miniPIERS) to help inform the clinical assessment and triage of women with hypertensive disorders of pregnancy in low-resourced settings.
Please see later in the article for the Editors' Summary
Background
Pre-eclampsia/eclampsia are leading causes of maternal mortality and morbidity, particularly in low- and middle- income countries (LMICs). We developed the miniPIERS risk prediction model to provide a simple, evidence-based tool to identify pregnant women in LMICs at increased risk of death or major hypertensive-related complications.
Methods and Findings
From 1 July 2008 to 31 March 2012, in five LMICs, data were collected prospectively on 2,081 women with any hypertensive disorder of pregnancy admitted to a participating centre. Candidate predictors collected within 24 hours of admission were entered into a step-wise backward elimination logistic regression model to predict a composite adverse maternal outcome within 48 hours of admission. Model internal validation was accomplished by bootstrapping and external validation was completed using data from 1,300 women in the Pre-eclampsia Integrated Estimate of RiSk (fullPIERS) dataset. Predictive performance was assessed for calibration, discrimination, and stratification capacity. The final miniPIERS model included: parity (nulliparous versus multiparous); gestational age on admission; headache/visual disturbances; chest pain/dyspnoea; vaginal bleeding with abdominal pain; systolic blood pressure; and dipstick proteinuria. The miniPIERS model was well-calibrated and had an area under the receiver operating characteristic curve (AUC ROC) of 0.768 (95% CI 0.735–0.801) with an average optimism of 0.037. External validation AUC ROC was 0.713 (95% CI 0.658–0.768). A predicted probability ≥25% to define a positive test classified women with 85.5% accuracy. Limitations of this study include the composite outcome and the broad inclusion criteria of any hypertensive disorder of pregnancy. This broad approach was used to optimize model generalizability.
Conclusions
The miniPIERS model shows reasonable ability to identify women at increased risk of adverse maternal outcomes associated with the hypertensive disorders of pregnancy. It could be used in LMICs to identify women who would benefit most from interventions such as magnesium sulphate, antihypertensives, or transportation to a higher level of care.
Please see later in the article for the Editors' Summary
Editors' Summary
Background
Each year, ten million women develop pre-eclampsia or a related hypertensive (high blood pressure) disorder of pregnancy and 76,000 women die as a result. Globally, hypertensive disorders of pregnancy cause around 12% of maternal deaths—deaths of women during or shortly after pregnancy. The mildest of these disorders is gestational hypertension, high blood pressure that develops after 20 weeks of pregnancy. Gestational hypertension does not usually harm the mother or her unborn child and resolves after delivery but up to a quarter of women with this condition develop pre-eclampsia, a combination of hypertension and protein in the urine (proteinuria). Women with mild pre-eclampsia may not have any symptoms—the condition is detected during antenatal checks—but more severe pre-eclampsia can cause headaches, blurred vision, and other symptoms, and can lead to eclampsia (fits), multiple organ failure, and death of the mother and/or her baby. The only “cure” for pre-eclampsia is to deliver the baby as soon as possible but women are sometimes given antihypertensive drugs to lower their blood pressure or magnesium sulfate to prevent seizures.
Why Was This Study Done?
Women in low- and middle-income countries (LMICs) are more likely to develop complications of pre-eclampsia than women in high-income countries and most of the deaths associated with hypertensive disorders of pregnancy occur in LMICs. The high burden of illness and death in LMICs is thought to be primarily due to delays in triage (the identification of women who are or may become severely ill and who need specialist care) and delays in transporting these women to facilities where they can receive appropriate care. Because there is a shortage of health care workers who are adequately trained in the triage of suspected cases of hypertensive disorders of pregnancy in many LMICs, one way to improve the situation might be to design a simple tool to identify women at increased risk of complications or death from hypertensive disorders of pregnancy. Here, the researchers develop miniPIERS (Pre-eclampsia Integrated Estimate of RiSk), a clinical risk prediction model for adverse outcomes among women with hypertensive disorders of pregnancy suitable for use in community and primary health care facilities in LMICs.
What Did the Researchers Do and Find?
The researchers used data on candidate predictors of outcome that are easy to collect and/or measure in all health care settings and that are associated with pre-eclampsia from women admitted with any hypertensive disorder of pregnancy to participating centers in five LMICs to build a model to predict death or a serious complication such as organ damage within 48 hours of admission. The miniPIERS model included parity (whether the woman had been pregnant before), gestational age (length of pregnancy), headache/visual disturbances, chest pain/shortness of breath, vaginal bleeding with abdominal pain, systolic blood pressure, and proteinuria detected using a dipstick. The model was well-calibrated (the predicted risk of adverse outcomes agreed with the observed risk of adverse outcomes among the study participants), it had a good discriminatory ability (it could separate women who had a an adverse outcome from those who did not), and it designated women as being at high risk (25% or greater probability of an adverse outcome) with an accuracy of 85.5%. Importantly, external validation using data collected in fullPIERS, a study that developed a more complex clinical prediction model based on data from women attending tertiary hospitals in high-income countries, confirmed the predictive performance of miniPIERS.
What Do These Findings Mean?
These findings indicate that the miniPIERS model performs reasonably well as a tool to identify women at increased risk of adverse maternal outcomes associated with hypertensive disorders of pregnancy. Because miniPIERS only includes simple-to-measure personal characteristics, symptoms, and signs, it could potentially be used in resource-constrained settings to identify the women who would benefit most from interventions such as transportation to a higher level of care. However, further external validation of miniPIERS is needed using data collected from women living in LMICs before the model can be used during routine antenatal care. Moreover, the value of miniPIERS needs to be confirmed in implementation projects that examine whether its potential translates into clinical improvements. For now, though, the model could provide the basis for an education program to increase the knowledge of women, families, and community health care workers in LMICs about the signs and symptoms of hypertensive disorders of pregnancy.
Additional Information
Please access these websites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1001589.
The World Health Organization provides guidelines for the management of hypertensive disorders of pregnancy in low-resourced settings
The Maternal and Child Health Integrated Program provides information on pre-eclampsia and eclampsia targeted to low-resourced settings along with a tool-kit for LMIC providers
The US National Heart, Lung, and Blood Institute provides information about high blood pressure in pregnancy and a guide to lowering blood pressure in pregnancy
The UK National Health Service Choices website provides information about pre-eclampsia
The US not-for profit organization Preeclampsia Foundation provides information about all aspects of pre-eclampsia; its website includes some personal stories
The UK charity Healthtalkonline also provides personal stories about hypertensive disorders of pregnancy
MedlinePlus provides links to further information about high blood pressure and pregnancy (in English and Spanish); the MedlinePlus Encyclopedia has a video about pre-eclampsia (also in English and Spanish)
More information about miniPIERS and about fullPIERS is available
doi:10.1371/journal.pmed.1001589
PMCID: PMC3897359  PMID: 24465185
11.  Prognostic Model for Predicting Survival of Patients With Metastatic Urothelial Cancer Treated With Cisplatin-Based Chemotherapy 
A prognostic model that predicts overall survival (OS) for metastatic urothelial cancer (MetUC) patients treated with cisplatin-based chemotherapy was developed, validated, and compared with a commonly used Memorial Sloan-Kettering Cancer Center (MSKCC) risk-score model. Data from 7 protocols that enrolled 308 patients with MetUC were pooled. An external multi-institutional dataset was used to validate the model. The primary measurement of predictive discrimination was Harrell’s c-index, computed with 95% confidence interval (CI). The final model included four pretreatment variables to predict OS: visceral metastases, albumin, performance status, and hemoglobin. The Harrell’s c-index was 0.67 for the four-variable model and 0.64 for the MSKCC risk-score model, with a prediction improvement for OS (the U statistic and its standard deviation were used to calculate the two-sided P = .002). In the validation cohort, the c-indices for the four-variable and the MSKCC risk-score models were 0.63 (95% CI = 0.56 to 0.69) and 0.58 (95% CI = 0.52 to 0.65), respectively, with superiority of the four-variable model compared with the MSKCC risk-score model for OS (the U statistic and its standard deviation were used to calculate the two-sided P = .02).
doi:10.1093/jnci/djt015
PMCID: PMC3691944  PMID: 23411591
12.  Prognostic survival model for people diagnosed with invasive cutaneous melanoma 
BMC Cancer  2015;15:27.
Background
The ability of medical practitioners to communicate risk estimates effectively to patients diagnosed with melanoma relies on accurate information about prognostic factors and their impact on survival. This study reports the development of one of the few melanoma prognostic models, called the Melanoma Severity Index (MSI), based on population-based cancer registry data.
Methods
Data from the Queensland Cancer Registry for people (20–89 years) diagnosed with a single invasive melanoma between 1995 and 2008 (n = 28,654; 1,700 melanoma deaths). Additional clinical information about metastasis, ulceration and positive lymph nodes was manually extracted from pathology forms. Flexible parametric survival models were combined with multivariable fractional polynomial for selecting variables and transformations of continuous variables. Multiple imputation was used for missing covariate values.
Results
The MSI contained the variables thickness (transformed, explained 40.6% of variation in survival), body site (additional 1.9% in variation), metastasis (1.8%), positive nodes (0.7%), ulceration (1.3%), age (1.1%). Royston and Sauerbrei’s D statistic (measure of discrimination) was 1.50 (95% CI = 1.44, 1.56) and the corresponding RD2 (measure of explained variation) was 0.47 (0.45, 0.49), demonstrating strong explanatory performance. The Harrell-C statistic was 0.88 (0.88, 0.89). Lacking an external validation dataset, we applied internal-external cross validation to demonstrate the consistency of the prognostic information across geographically-defined subsets of the cohort.
Conclusions
The MSI provides good ability to predict survival for melanoma patients. Beyond the immediate clinical use, the MSI may have important public health and research applications for evaluations of public health interventions aimed at reducing deaths from melanoma.
doi:10.1186/s12885-015-1024-4
PMCID: PMC4328047  PMID: 25637143
Melanoma; Survival; Prognostic model; Thickness; Population-based; Risk
13.  Regularized binormal ROC method in disease classification using microarray data 
BMC Bioinformatics  2006;7:253.
Background
An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease diagnosis and prognosis. Thus it is of interest to develop efficient statistical methods that can simultaneously identify important biomarkers from such high-throughput genomic data and construct appropriate classification rules. It is also of interest to develop methods for evaluation of classification performance and ranking of identified biomarkers.
Results
The ROC (receiver operating characteristic) technique has been widely used in disease classification with low dimensional biomarkers. Compared with the empirical ROC approach, the binormal ROC is computationally more affordable and robust in small sample size cases. We propose using the binormal AUC (area under the ROC curve) as the objective function for two-sample classification, and the scaled threshold gradient directed regularization method for regularized estimation and biomarker selection. Tuning parameter selection is based on V-fold cross validation. We develop Monte Carlo based methods for evaluating the stability of individual biomarkers and overall prediction performance. Extensive simulation studies show that the proposed approach can generate parsimonious models with excellent classification and prediction performance, under most simulated scenarios including model mis-specification. Application of the method to two cancer studies shows that the identified genes are reasonably stable with satisfactory prediction performance and biologically sound implications. The overall classification performance is satisfactory, with small classification errors and large AUCs.
Conclusion
In comparison to existing methods, the proposed approach is computationally more affordable without losing the optimality possessed by the standard ROC method.
doi:10.1186/1471-2105-7-253
PMCID: PMC1513612  PMID: 16684357
14.  Assessment of performance of survival prediction models for cancer prognosis 
Background
Cancer survival studies are commonly analyzed using survival-time prediction models for cancer prognosis. A number of different performance metrics are used to ascertain the concordance between the predicted risk score of each patient and the actual survival time, but these metrics can sometimes conflict. Alternatively, patients are sometimes divided into two classes according to a survival-time threshold, and binary classifiers are applied to predict each patient’s class. Although this approach has several drawbacks, it does provide natural performance metrics such as positive and negative predictive values to enable unambiguous assessments.
Methods
We compare the survival-time prediction and survival-time threshold approaches to analyzing cancer survival studies. We review and compare common performance metrics for the two approaches. We present new randomization tests and cross-validation methods to enable unambiguous statistical inferences for several performance metrics used with the survival-time prediction approach. We consider five survival prediction models consisting of one clinical model, two gene expression models, and two models from combinations of clinical and gene expression models.
Results
A public breast cancer dataset was used to compare several performance metrics using five prediction models. 1) For some prediction models, the hazard ratio from fitting a Cox proportional hazards model was significant, but the two-group comparison was insignificant, and vice versa. 2) The randomization test and cross-validation were generally consistent with the p-values obtained from the standard performance metrics. 3) Binary classifiers highly depended on how the risk groups were defined; a slight change of the survival threshold for assignment of classes led to very different prediction results.
Conclusions
1) Different performance metrics for evaluation of a survival prediction model may give different conclusions in its discriminatory ability. 2) Evaluation using a high-risk versus low-risk group comparison depends on the selected risk-score threshold; a plot of p-values from all possible thresholds can show the sensitivity of the threshold selection. 3) A randomization test of the significance of Somers’ rank correlation can be used for further evaluation of performance of a prediction model. 4) The cross-validated power of survival prediction models decreases as the training and test sets become less balanced.
doi:10.1186/1471-2288-12-102
PMCID: PMC3410808  PMID: 22824262
15.  Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests 
BMC Research Notes  2011;4:299.
Background
Dementia and cognitive impairment associated with aging are a major medical and social concern. Neuropsychological testing is a key element in the diagnostic procedures of Mild Cognitive Impairment (MCI), but has presently a limited value in the prediction of progression to dementia. We advance the hypothesis that newer statistical classification methods derived from data mining and machine learning methods like Neural Networks, Support Vector Machines and Random Forests can improve accuracy, sensitivity and specificity of predictions obtained from neuropsychological testing. Seven non parametric classifiers derived from data mining methods (Multilayer Perceptrons Neural Networks, Radial Basis Function Neural Networks, Support Vector Machines, CART, CHAID and QUEST Classification Trees and Random Forests) were compared to three traditional classifiers (Linear Discriminant Analysis, Quadratic Discriminant Analysis and Logistic Regression) in terms of overall classification accuracy, specificity, sensitivity, Area under the ROC curve and Press'Q. Model predictors were 10 neuropsychological tests currently used in the diagnosis of dementia. Statistical distributions of classification parameters obtained from a 5-fold cross-validation were compared using the Friedman's nonparametric test.
Results
Press' Q test showed that all classifiers performed better than chance alone (p < 0.05). Support Vector Machines showed the larger overall classification accuracy (Median (Me) = 0.76) an area under the ROC (Me = 0.90). However this method showed high specificity (Me = 1.0) but low sensitivity (Me = 0.3). Random Forest ranked second in overall accuracy (Me = 0.73) with high area under the ROC (Me = 0.73) specificity (Me = 0.73) and sensitivity (Me = 0.64). Linear Discriminant Analysis also showed acceptable overall accuracy (Me = 0.66), with acceptable area under the ROC (Me = 0.72) specificity (Me = 0.66) and sensitivity (Me = 0.64). The remaining classifiers showed overall classification accuracy above a median value of 0.63, but for most sensitivity was around or even lower than a median value of 0.5.
Conclusions
When taking into account sensitivity, specificity and overall classification accuracy Random Forests and Linear Discriminant analysis rank first among all the classifiers tested in prediction of dementia using several neuropsychological tests. These methods may be used to improve accuracy, sensitivity and specificity of Dementia predictions from neuropsychological testing.
doi:10.1186/1756-0500-4-299
PMCID: PMC3180705  PMID: 21849043
16.  Simplified Prognostic Model in Patients with Oxaliplatin-Based or Irinotecan-Based First-Line Chemotherapy for Metastatic Colorectal Cancer: A GERCOR Study 
The Oncologist  2011;16(9):1228-1238.
The present study was done to establish a prognostic model for patients and trials using an oxaliplatin-based or irinotecan-based first-line chemotherapy in metastatic colorectal cancer. Serum lactate dehydrogenase level was the main prognostic factor in predicting survival, followed by World Health Organization performance status. Three risk groups for death depending on these two baseline parameters were identified.
Learning Objectives
After completing this course, the reader will be able to: Describe prognostic factors in metastatic colorectal cancer.Estimate prognostic score with a simple model using only PS and LDH as parameters.
This article is available for continuing medical education credit at CME.TheOncologist.com
Background.
The present study was done to establish a prognostic model for patients and trials using an oxaliplatin-based or irinotecan-based first-line chemotherapy in metastatic colorectal cancer.
Patients and Methods.
Eight hundred three patients treated with FOLFOX or FOLFIRI in three prospective trials were randomly separated into learning (n = 535) and validation (n = 268) samples. Eleven baseline variables were evaluated in univariate and multivariate analysis as prognostic factors for overall survival, and a prognostic score was developed.
Results.
Independent prognostic factors identified in multivariate analysis for overall survival were performance status (PS) (p < .001), serum lactate dehydrogenase (LDH) (p < .001), and number of metastatic sites (p = .005). A prognostic score based on these three variables was found efficient (Harrell's C index 0.61). This new model was improved by selecting only PS and LDH (Harrell's C index 0.64). Three risk groups for death could be identified: a low-risk group (n = 184; median overall survival [OS] 29.8 months), an intermediate-risk group (n = 223; median OS 19.5 months), and a high-risk group (n = 128; median OS 13.9 months). Median survival for the low-, intermediate-, and high-risk groups were 26.8, 21.1, and 16.5 months, respectively, in the validation sample (Harrell's C index 0.63).
Conclusions.
Serum LDH level was the main prognostic factor in predicting survival, followed by WHO PS. We identified three risk groups for death depending on these two baseline parameters. This simple prognostic model can be useful for clinician's use and patient stratification in future clinical trials.
doi:10.1634/theoncologist.2011-0039
PMCID: PMC3228179  PMID: 21859820
Colorectal cancer; Prognostic model; Chemotherapy
17.  A Risk Prediction Model for Screening Bacteremic Patients: A Cross Sectional Study 
PLoS ONE  2014;9(9):e106765.
Background
Bacteraemia is a frequent and severe condition with a high mortality rate. Despite profound knowledge about the pre-test probability of bacteraemia, blood culture analysis often results in low rates of pathogen detection and therefore increasing diagnostic costs. To improve the cost-effectiveness of blood culture sampling, we computed a risk prediction model based on highly standardizable variables, with the ultimate goal to identify via an automated decision support tool patients with very low risk for bacteraemia.
Methods
In this retrospective hospital-wide cohort study evaluating 15,985 patients with suspected bacteraemia, 51 variables were assessed for their diagnostic potency. A derivation cohort (n = 14.699) was used for feature and model selection as well as for cut-off specification. Models were established using the A2DE classifier, a supervised Bayesian classifier. Two internally validated models were further evaluated by a validation cohort (n = 1,286).
Results
The proportion of neutrophile leukocytes in differential blood count was the best individual variable to predict bacteraemia (ROC-AUC: 0.694). Applying the A2DE classifier, two models, model 1 (20 variables) and model 2 (10 variables) were established with an area under the receiver operating characteristic curve (ROC-AUC) of 0.767 and 0.759, respectively. In the validation cohort, ROC-AUCs of 0.800 and 0.786 were achieved. Using predefined cut-off points, 16% and 12% of patients were allocated to the low risk group with a negative predictive value of more than 98.8%.
Conclusion
Applying the proposed models, more than ten percent of patients with suspected blood stream infection were identified having minimal risk for bacteraemia. Based on these data the application of this model as an automated decision support tool for physicians is conceivable leading to a potential increase in the cost-effectiveness of blood culture sampling. External prospective validation of the model's generalizability is needed for further appreciation of the usefulness of this tool.
doi:10.1371/journal.pone.0106765
PMCID: PMC4153716  PMID: 25184209
18.  A Unifying Framework for Evaluating the Predictive Power of Genetic Variants Based on the Level of Heritability Explained 
PLoS Genetics  2010;6(12):e1001230.
An increasing number of genetic variants have been identified for many complex diseases. However, it is controversial whether risk prediction based on genomic profiles will be useful clinically. Appropriate statistical measures to evaluate the performance of genetic risk prediction models are required. Previous studies have mainly focused on the use of the area under the receiver operating characteristic (ROC) curve, or AUC, to judge the predictive value of genetic tests. However, AUC has its limitations and should be complemented by other measures. In this study, we develop a novel unifying statistical framework that connects a large variety of predictive indices together. We showed that, given the overall disease probability and the level of variance in total liability (or heritability) explained by the genetic variants, we can estimate analytically a large variety of prediction metrics, for example the AUC, the mean risk difference between cases and non-cases, the net reclassification improvement (ability to reclassify people into high- and low-risk categories), the proportion of cases explained by a specific percentile of population at the highest risk, the variance of predicted risks, and the risk at any percentile. We also demonstrate how to construct graphs to visualize the performance of risk models, such as the ROC curve, the density of risks, and the predictiveness curve (disease risk plotted against risk percentile). The results from simulations match very well with our theoretical estimates. Finally we apply the methodology to nine complex diseases, evaluating the predictive power of genetic tests based on known susceptibility variants for each trait.
Author Summary
Recently many genetic variants have been established for diseases, and the findings have raised hope for risk prediction based on genomic profiles. However, we need to have proper statistical measures to assess the usefulness of such tests. In this study, we developed a statistical framework which enables us to evaluate many predictive indices analytically. It is based on the liability threshold model, which postulates a latent liability that is normally distributed. Affected individuals are assumed to have a liability exceeding a certain threshold. We demonstrated that, given the overall disease probability and variance in liability explained by the genetic markers, we can compute a variety of predictive indices. An example is the area under the receiver operating characteristic (ROC) curve, or AUC, which is very commonly employed. However, the limitations of AUC are often ignored, and we proposed complementing it with other indices. We have therefore also computed other metrics like the average difference in risks between cases and non-cases, the ability of reclassification into high- and low-risk categories, and the proportion of cases accounted for by a certain percentile of population at the highest risk. We also derived how to construct graphs showing the risk distribution in population.
doi:10.1371/journal.pgen.1001230
PMCID: PMC2996330  PMID: 21151957
19.  An assessment of existing models for individualized breast cancer risk estimation in a screening program in Spain 
BMC Cancer  2013;13:587.
Background
The aim of this study was to evaluate the calibration and discriminatory power of three predictive models of breast cancer risk.
Methods
We included 13,760 women who were first-time participants in the Sabadell-Cerdanyola Breast Cancer Screening Program, in Catalonia, Spain. Projections of risk were obtained at three and five years for invasive cancer using the Gail, Chen and Barlow models. Incidence and mortality data were obtained from the Catalan registries. The calibration and discrimination of the models were assessed using the Hosmer-Lemeshow C statistic, the area under the receiver operating characteristic curve (AUC) and the Harrell’s C statistic.
Results
The Gail and Chen models showed good calibration while the Barlow model overestimated the number of cases: the ratio between estimated and observed values at 5 years ranged from 0.86 to 1.55 for the first two models and from 1.82 to 3.44 for the Barlow model. The 5-year projection for the Chen and Barlow models had the highest discrimination, with an AUC around 0.58. The Harrell’s C statistic showed very similar values in the 5-year projection for each of the models. Although they passed the calibration test, the Gail and Chen models overestimated the number of cases in some breast density categories.
Conclusions
These models cannot be used as a measure of individual risk in early detection programs to customize screening strategies. The inclusion of longitudinal measures of breast density or other risk factors in joint models of survival and longitudinal data may be a step towards personalized early detection of BC.
doi:10.1186/1471-2407-13-587
PMCID: PMC4029404  PMID: 24321553
Breast cancer; Screening; Risk models; Individual risk; Breast density
20.  Validation of the prognostic relevance of plasma C-reactive protein levels in soft-tissue sarcoma patients 
British Journal of Cancer  2013;109(9):2316-2322.
Background:
The concept of the involvement of systemic inflammation in cancer progression and metastases has gained attraction within the past decade. C-reactive protein (CRP), a non-specific blood-based marker of the systemic inflammatory response, has been associated with decreased survival in several cancer types. The aim of the present study was to validate the prognostic value of pre-operative plasma CRP levels on clinical outcome in a large cohort of soft-tissue sarcoma (STS) patients.
Methods:
Three hundred and four STS patients, operated between 1998 and 2010, were retrospectively evaluated. CRP levels and the impact on cancer-specific survival (CSS), disease-free survival (DFS) and overall survival (OS) were assessed using Kaplan–Meier curves and univariate as well as multivariate Cox proportional models. Additionally, we developed a nomogram by supplementing the plasma CRP level to the well-established Kattan nomogram and evaluated the improvement of predictive accuracy of this novel nomogram by applying calibration and Harrell's concordance index (c-index).
Results:
An elevated plasma CRP level was significantly associated with established prognostic factors, including age, tumour grade, size and depth (P<0.05). In multivariate analysis, increased CRP levels were significantly associated with a poor outcome for CSS (HR=2.05; 95% CI=1.13–3.74; P=0.019) and DFS (HR=1.88; 95% CI=1.07–3.34; P=0.029). The estimated c-index was 0.74 using the original Kattan nomogram and 0.77 when the plasma CRP level was added.
Conclusion:
An elevated pre-operative CRP level represents an independent prognostic factor that predicts poor prognosis and improves the predictive ability of the Kattan nomogram in STS patients. Our data suggest to further prospectively validate its potential utility for individual risk stratification and clinical management of STS patients.
doi:10.1038/bjc.2013.595
PMCID: PMC3817333  PMID: 24084772
soft-tissue sarcoma; prognosis; inflammation; C-reactive protein
21.  Analysis of multiple compound–protein interactions reveals novel bioactive molecules 
The authors use machine learning of compound-protein interactions to explore drug polypharmacology and to efficiently identify bioactive ligands, including novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein coupled receptors and protein kinases.
We have demonstrated that machine learning of multiple compound–protein interactions is useful for efficient ligand screening and for assessing drug polypharmacology.This approach successfully identified novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein-coupled receptors and protein kinases.These bioactive compounds were not detected by existing computational ligand-screening methods in comparative studies.The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. Perturbations of biological systems by chemical probes provide broader applications not only for analysis of complex systems but also for intentional manipulations of these systems. Nevertheless, the lack of well-characterized chemical modulators has limited their use. Recently, chemical genomics has emerged as a promising area of research applicable to the exploration of novel bioactive molecules, and researchers are currently striving toward the identification of all possible ligands for all target protein families (Wang et al, 2009). Chemical genomics studies have shown that patterns of compound–protein interactions (CPIs) are too diverse to be understood as simple one-to-one events. There is an urgent need to develop appropriate data mining methods for characterizing and visualizing the full complexity of interactions between chemical space and biological systems. However, no existing screening approach has so far succeeded in identifying novel bioactive compounds using multiple interactions among compounds and target proteins.
High-throughput screening (HTS) and computational screening have greatly aided in the identification of early lead compounds for drug discovery. However, the large number of assays required for HTS to identify drugs that target multiple proteins render this process very costly and time-consuming. Therefore, interest in using in silico strategies for screening has increased. The most common computational approaches, ligand-based virtual screening (LBVS) and structure-based virtual screening (SBVS; Oprea and Matter, 2004; Muegge and Oloff, 2006; McInnes, 2007; Figure 1A), have been used for practical drug development. LBVS aims to identify molecules that are very similar to known active molecules and generally has difficulty identifying compounds with novel structural scaffolds that differ from reference molecules. The other popular strategy, SBVS, is constrained by the number of three-dimensional crystallographic structures available. To circumvent these limitations, we have shown that a new computational screening strategy, chemical genomics-based virtual screening (CGBVS), has the potential to identify novel, scaffold-hopping compounds and assess their polypharmacology by using a machine-learning method to recognize conserved molecular patterns in comprehensive CPI data sets.
The CGBVS strategy used in this study was made up of five steps: CPI data collection, descriptor calculation, representation of interaction vectors, predictive model construction using training data sets, and predictions from test data (Figure 1A). Importantly, step 1, the construction of a data set of chemical structures and protein sequences for known CPIs, did not require the three-dimensional protein structures needed for SBVS. In step 2, compound structures and protein sequences were converted into numerical descriptors. These descriptors were used to construct chemical or biological spaces in which decreasing distance between vectors corresponded to increasing similarity of compound structures or protein sequences. In step 3, we represented multiple CPI patterns by concatenating these chemical and protein descriptors. Using these interaction vectors, we could quantify the similarity of molecular interactions for compound–protein pairs, despite the fact that the ligand and protein similarity maps differed substantially. In step 4, concatenated vectors for CPI pairs (positive samples) and non-interacting pairs (negative samples) were input into an established machine-learning method. In the final step, the classifier constructed using training sets was applied to test data.
To evaluate the predictive value of CGBVS, we first compared its performance with that of LBVS by fivefold cross-validation. CGBVS performed with considerably higher accuracy (91.9%) than did LBVS (84.4%; Figure 1B). We next compared CGBVS and SBVS in a retrospective virtual screening based on the human β2-adrenergic receptor (ADRB2). Figure 1C shows that CGBVS provided higher hit rates than did SBVS. These results suggest that CGBVS is more successful than conventional approaches for prediction of CPIs.
We then evaluated the ability of the CGBVS method to predict the polypharmacology of ADRB2 by attempting to identify novel ADRB2 ligands from a group of G-protein-coupled receptor (GPCR) ligands. We ranked the prediction scores for the interactions of 826 reported GPCR ligands with ADRB2 and then analyzed the 50 highest-ranked compounds in greater detail. Of 21 commercially available compounds, 11 showed ADRB2-binding activity and were not previously reported to be ADRB2 ligands. These compounds included ligands not only for aminergic receptors but also for neuropeptide Y-type 1 receptors (NPY1R), which have low protein homology to ADRB2. Most ligands we identified were not detected by LBVS and SBVS, which suggests that only CGBVS could identify this unexpected cross-reaction for a ligand developed as a target to a peptidergic receptor.
The true value of CGBVS in drug discovery must be tested by assessing whether this method can identify scaffold-hopping lead compounds from a set of compounds that is structurally more diverse. To assess this ability, we analyzed 11 500 commercially available compounds to predict compounds likely to bind to two GPCRs and two protein kinases. Functional assays revealed that nine ADRB2 ligands, three NPY1R ligands, five epidermal growth factor receptor (EGFR) inhibitors, and two cyclin-dependent kinase 2 (CDK2) inhibitors were concentrated in the top-ranked compounds (hit rate=30, 15, 25, and 10%, respectively). We also evaluated the extent of scaffold hopping achieved in the identification of these novel ligands. One ADRB2 ligand, two NPY1R ligands, and one CDK2 inhibitor exhibited scaffold hopping (Figure 4), indicating that CGBVS can use this characteristic to rationally predict novel lead compounds, a crucial and very difficult step in drug discovery. This feature of CGBVS is critically different from existing predictive methods, such as LBVS, which depend on similarities between test and reference ligands, and focus on a single protein or highly homologous proteins. In particular, CGBVS is useful for targets with undefined ligands because this method can use CPIs with target proteins that exhibit lower levels of homology.
In summary, we have demonstrated that data mining of multiple CPIs is of great practical value for exploration of chemical space. As a predictive model, CGBVS could provide an important step in the discovery of such multi-target drugs by identifying the group of proteins targeted by a particular ligand, leading to innovation in pharmaceutical research.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. For this purpose, the emerging field of chemical genomics is currently focused on accumulating large assay data sets describing compound–protein interactions (CPIs). Although new target proteins for known drugs have recently been identified through mining of CPI databases, using these resources to identify novel ligands remains unexplored. Herein, we demonstrate that machine learning of multiple CPIs can not only assess drug polypharmacology but can also efficiently identify novel bioactive scaffold-hopping compounds. Through a machine-learning technique that uses multiple CPIs, we have successfully identified novel lead compounds for two pharmaceutically important protein families, G-protein-coupled receptors and protein kinases. These novel compounds were not identified by existing computational ligand-screening methods in comparative studies. The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
doi:10.1038/msb.2011.5
PMCID: PMC3094066  PMID: 21364574
chemical genomics; data mining; drug discovery; ligand screening; systems chemical biology
22.  Risk Prediction for Breast, Endometrial, and Ovarian Cancer in White Women Aged 50 y or Older: Derivation and Validation from Population-Based Cohort Studies 
PLoS Medicine  2013;10(7):e1001492.
Ruth Pfeiffer and colleagues describe models to calculate absolute risks for breast, endometrial, and ovarian cancers for white, non-Hispanic women over 50 years old using easily obtainable risk factors.
Please see later in the article for the Editors' Summary
Background
Breast, endometrial, and ovarian cancers share some hormonal and epidemiologic risk factors. While several models predict absolute risk of breast cancer, there are few models for ovarian cancer in the general population, and none for endometrial cancer.
Methods and Findings
Using data on white, non-Hispanic women aged 50+ y from two large population-based cohorts (the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial [PLCO] and the National Institutes of Health–AARP Diet and Health Study [NIH-AARP]), we estimated relative and attributable risks and combined them with age-specific US-population incidence and competing mortality rates. All models included parity. The breast cancer model additionally included estrogen and progestin menopausal hormone therapy (MHT) use, other MHT use, age at first live birth, menopausal status, age at menopause, family history of breast or ovarian cancer, benign breast disease/biopsies, alcohol consumption, and body mass index (BMI); the endometrial model included menopausal status, age at menopause, BMI, smoking, oral contraceptive use, MHT use, and an interaction term between BMI and MHT use; the ovarian model included oral contraceptive use, MHT use, and family history or breast or ovarian cancer. In independent validation data (Nurses' Health Study cohort) the breast and ovarian cancer models were well calibrated; expected to observed cancer ratios were 1.00 (95% confidence interval [CI]: 0.96–1.04) for breast cancer and 1.08 (95% CI: 0.97–1.19) for ovarian cancer. The number of endometrial cancers was significantly overestimated, expected/observed = 1.20 (95% CI: 1.11–1.29). The areas under the receiver operating characteristic curves (AUCs; discriminatory power) were 0.58 (95% CI: 0.57–0.59), 0.59 (95% CI: 0.56–0.63), and 0.68 (95% CI: 0.66–0.70) for the breast, ovarian, and endometrial models, respectively.
Conclusions
These models predict absolute risks for breast, endometrial, and ovarian cancers from easily obtainable risk factors and may assist in clinical decision-making. Limitations are the modest discriminatory ability of the breast and ovarian models and that these models may not generalize to women of other races.
Please see later in the article for the Editors' Summary
Editors' Summary
Background
In 2008, just three types of cancer accounted for 10% of global cancer-related deaths. That year, about 460,000 women died from breast cancer (the most frequently diagnosed cancer among women and the fifth most common cause of cancer-related death). Another 140,000 women died from ovarian cancer, and 74,000 died from endometrial (womb) cancer (the 14th and 20th most common causes of cancer-related death, respectively). Although these three cancers originate in different tissues, they nevertheless share many risk factors. For example, current age, age at menarche (first period), and parity (the number of children a woman has had) are all strongly associated with breast, ovarian, and endometrial cancer risk. Because these cancers share many hormonal and epidemiological risk factors, a woman with a high breast cancer risk is also likely to have an above-average risk of developing ovarian or endometrial cancer.
Why Was This Study Done?
Several statistical models (for example, the Breast Cancer Risk Assessment Tool) have been developed that estimate a woman's absolute risk (probability) of developing breast cancer over the next few years or over her lifetime. Absolute risk prediction models are useful in the design of cancer prevention trials and can also help women make informed decisions about cancer prevention and treatment options. For example, a woman at high risk of breast cancer might decide to take tamoxifen for breast cancer prevention, but ideally she needs to know her absolute endometrial cancer risk before doing so because tamoxifen increases the risk of this cancer. Similarly, knowledge of her ovarian cancer risk might influence a woman's decision regarding prophylactic removal of her ovaries to reduce her breast cancer risk. There are few absolute risk prediction models for ovarian cancer, and none for endometrial cancer, so here the researchers develop models to predict the risk of these cancers and of breast cancer.
What Did the Researchers Do and Find?
Absolute risk prediction models are constructed by combining estimates for risk factors from cohorts with population-based incidence rates from cancer registries. Models are validated in an independent cohort by testing their ability to identify people with the disease in an independent cohort and their ability to predict the observed numbers of incident cases. The researchers used data on white, non-Hispanic women aged 50 years or older that were collected during two large prospective US cohort studies of cancer screening and of diet and health, and US cancer incidence and mortality rates provided by the Surveillance, Epidemiology, and End Results Program to build their models. The models all included parity as a risk factor, as well as other factors. The model for endometrial cancer, for example, also included menopausal status, age at menopause, body mass index (an indicator of the amount of body fat), oral contraceptive use, menopausal hormone therapy use, and an interaction term between menopausal hormone therapy use and body mass index. Individual women's risk for endometrial cancer calculated using this model ranged from 1.22% to 17.8% over the next 20 years depending on their exposure to various risk factors. Validation of the models using data from the US Nurses' Health Study indicated that the endometrial cancer model overestimated the risk of endometrial cancer but that the breast and ovarian cancer models were well calibrated—the predicted and observed risks for these cancers in the validation cohort agreed closely. Finally, the discriminatory power of the models (a measure of how well a model separates people who have a disease from people who do not have the disease) was modest for the breast and ovarian cancer models but somewhat better for the endometrial cancer model.
What Do These Findings Mean?
These findings show that breast, ovarian, and endometrial cancer can all be predicted using information on known risk factors for these cancers that is easily obtainable. Because these models were constructed and validated using data from white, non-Hispanic women aged 50 years or older, they may not accurately predict absolute risk for these cancers for women of other races or ethnicities. Moreover, the modest discriminatory power of the breast and ovarian cancer models means they cannot be used to decide which women should be routinely screened for these cancers. Importantly, however, these well-calibrated models should provide realistic information about an individual's risk of developing breast, ovarian, or endometrial cancer that can be used in clinical decision-making and that may assist in the identification of potential participants for research studies.
Additional Information
Please access these websites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1001492.
This study is further discussed in a PLOS Medicine Perspective by Lars Holmberg and Andrew Vickers
The US National Cancer Institute provides comprehensive information about cancer (in English and Spanish), including detailed information about breast cancer, ovarian cancer, and endometrial cancer;
Information on the Breast Cancer Risk Assessment Tool, the Surveillance, Epidemiology, and End Results Program, and on the prospective cohort study of screening and the diet and health study that provided the data used to build the models is also available on the NCI site
Cancer Research UK, a not-for-profit organization, provides information about cancer, including detailed information on breast cancer, ovarian cancer, and endometrial cancer
The UK National Health Service Choices website has information and personal stories about breast cancer, ovarian cancer, and endometrial cancer; the not-for-profit organization Healthtalkonline also provides personal stories about dealing with breast cancer and ovarian cancer
doi:10.1371/journal.pmed.1001492
PMCID: PMC3728034  PMID: 23935463
23.  An evidential reasoning based model for diagnosis of lymph node metastasis in gastric cancer 
Background
Lymph node metastasis (LNM) in gastric cancer is a very important prognostic factor affecting long-term survival. Currently, several common imaging techniques are used to evaluate the lymph node status. However, they are incapable of achieving both high sensitivity and specificity simultaneously. In order to deal with this complex issue, a new evidential reasoning (ER) based model is proposed to support diagnosis of LNM in gastric cancer.
Methods
There are 175 consecutive patients who went through multidetector computed tomography (MDCT) consecutively before the surgery. Eight indicators, which are serosal invasion, tumor classification, tumor enhancement pattern, tumor thickness, number of lymph nodes, maximum lymph node size, lymph node station and lymph node enhancement are utilized to evaluate the tumor and lymph node through CT images. All of the above indicators reflect the biological behavior of gastric cancer. An ER based model is constructed by taking the above indicators as input index. The output index determines whether LNM occurs for the patients, which is decided by the surgery and histopathology. A technique called k-fold cross-validation is used for training and testing the new model. The diagnostic capability of LNM is evaluated by receiver operating characteristic (ROC) curves. A Radiologist classifies LNM by adopting lymph node size for comparison.
Results
134 out of 175 cases are cases of LNM, and the remains are not. Eight indicators have statistically significant difference between the positive and negative groups. The sensitivity, specificity and AUC of the ER based model are 88.41%, 77.57% and 0.813, respectively. However, for the radiologist evaluating LNM by maximum lymph node size, the corresponding values are only 63.4%, 75.6% and 0.757. Therefore, the proposed model can obtain better performance than the radiologist. Besides, the proposed model also outperforms other machine learning methods.
Conclusions
According to the biological behavior information of gastric cancer, the ER based model can diagnose LNM effectively and preoperatively.
doi:10.1186/1472-6947-13-123
PMCID: PMC3827004  PMID: 24195733
Gastric cancer; Lymph node metastasis; Evidential reasoning
24.  Genetic Variants and Their Interactions in the Prediction of Increased Pre-Clinical Carotid Atherosclerosis: The Cardiovascular Risk in Young Finns Study 
PLoS Genetics  2010;6(9):e1001146.
The relative contribution of genetic risk factors to the progression of subclinical atherosclerosis is poorly understood. It is likely that multiple variants are implicated in the development of atherosclerosis, but the subtle genotypic and phenotypic differences are beyond the reach of the conventional case-control designs and the statistical significance testing procedures being used in most association studies. Our objective here was to investigate whether an alternative approach—in which common disorders are treated as quantitative phenotypes that are continuously distributed over a population—can reveal predictive insights into the early atherosclerosis, as assessed using ultrasound imaging-based quantitative measurement of carotid artery intima-media thickness (IMT). Using our population-based follow-up study of atherosclerosis precursors as a basis for sampling subjects with gradually increasing IMT levels, we searched for such subsets of genetic variants and their interactions that are the most predictive of the various risk classes, rather than using exclusively those variants meeting a stringent level of statistical significance. The area under the receiver operating characteristic curve (AUC) was used to evaluate the predictive value of the variants, and cross-validation was used to assess how well the predictive models will generalize to other subsets of subjects. By means of our predictive modeling framework with machine learning-based SNP selection, we could improve the prediction of the extreme classes of atherosclerosis risk and progression over a 6-year period (average AUC 0.844 and 0.761), compared to that of using conventional cardiovascular risk factors alone (average AUC 0.741 and 0.629), or when combined with the statistically significant variants (average AUC 0.762 and 0.651). The predictive accuracy remained relatively high in an independent validation set of subjects (average decrease of 0.043). These results demonstrate that the modeling framework can utilize the “gray zone” of genetic variation in the classification of subjects with different degrees of risk of developing atherosclerosis.
Author Summary
Although cardiovascular events, such as myocardial infarction and stroke, usually occur at later ages, it is known that the atherogenic process begins much earlier in life. Detection of subclinical atherosclerosis would therefore offer the means to identify individuals who are at increased risk of developing cardiovascular events. What remains unclear is the relative contribution of genetic variation to the development of the early stages of atherosclerosis. To address this question, we searched for combinations of both genetic and clinical determinants that are the most predictive of the progression of subclinical carotid atherosclerosis in a sample of 1,027 young adults, aged between 24–39 years, from the Finnish general population (The Cardiovascular Risk in Young Finns Study). We demonstrate here, for the first time in a population-based follow-up study, a predictive relationship between individual's genotypic variation and early signs of atherosclerosis, which cannot be explained by conventional cardiovascular risk factors, such as obesity and elevated blood pressure levels. The predictive modeling framework facilitates the usability of genetic information by identifying informative panels of variants, along with conventional risk factors, which may prove to be useful in early detection and management of atherosclerosis. The clinical implications of these findings remain to be studied.
doi:10.1371/journal.pgen.1001146
PMCID: PMC2947986  PMID: 20941391
25.  Can the Tumor Deposits Be Counted as Metastatic Lymph Nodes in the UICC TNM Staging System for Colorectal Cancer? 
PLoS ONE  2012;7(3):e34087.
Objective
The 7th edition of AJCC staging manual implicitly states that only T1 and T2 lesions that lack regional lymph node metastasis but have tumor deposit(s) will be classified in addition as N1c, though it is not consistent in that pN1c is also an option for pT3/T4a tumors in the staging table. Nevertheless, in this TNM classification, how to classify tumor deposits (TDs) in colorectal cancer patients with lymph node metastasis (LNM) and TDs simultaneously is still not clear. The aim of this study is to investigate the possibility of counting TDs as metastatic lymph nodes in TNM classification and to indentify its prognostic value for colorectal cancer patients.
Methods and Results
In this retrospective study, 513 cases of colorectal cancer with LNM were reviewed. We proposed a novel pN (npN) category in which TDs were counted as metastatic lymph nodes in the TNM classification. Cancer-specific survival according to the npN or pN category was analyzed using Kaplan-Meier survival curves. Univariate and multivariate analyses were performed to indentify significant prognostic factors. Harrell's C statistic was used to test the predictive capacity of the prognostic models. The results revealed that the TD was a significant prognostic factor in colorectal cancer. Univariate and multivariate analyses uniformly indicated that the npN category was significantly correlated with prognosis. The results of Harrell's C statistical analysis demonstrated that the npN category exhibited a superior predictive capacity compared to the pN category of the 7th edition TNM classification. Moreover, we also found no significant prognostic differences in patients with or without TD in the same npN categories.
Conclusions
The counting of TDs as metastatic lymph nodes in the TNM classification system is potentially superior to the classification in the 7th edition of the TNM staging system to assess prognosis and survival for colorectal cancer patients.
doi:10.1371/journal.pone.0034087
PMCID: PMC3312887  PMID: 22461900

Results 1-25 (1435110)