Search tips
Search criteria

Results 1-7 (7)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  A Composite Model for Subgroup Identification and Prediction via Bicluster Analysis 
PLoS ONE  2014;9(10):e111318.
A major challenges in the analysis of large and complex biomedical data is to develop an approach for 1) identifying distinct subgroups in the sampled populations, 2) characterizing their relationships among subgroups, and 3) developing a prediction model to classify subgroup memberships of new samples by finding a set of predictors. Each subgroup can represent different pathogen serotypes of microorganisms, different tumor subtypes in cancer patients, or different genetic makeups of patients related to treatment response.
This paper proposes a composite model for subgroup identification and prediction using biclusters. A biclustering technique is first used to identify a set of biclusters from the sampled data. For each bicluster, a subgroup-specific binary classifier is built to determine if a particular sample is either inside or outside the bicluster. A composite model, which consists of all binary classifiers, is constructed to classify samples into several disjoint subgroups. The proposed composite model neither depends on any specific biclustering algorithm or patterns of biclusters, nor on any classification algorithms.
The composite model was shown to have an overall accuracy of 97.4% for a synthetic dataset consisting of four subgroups. The model was applied to two datasets where the sample’s subgroup memberships were known. The procedure showed 83.7% accuracy in discriminating lung cancer adenocarcinoma and squamous carcinoma subtypes, and was able to identify 5 serotypes and several subtypes with about 94% accuracy in a pathogen dataset.
The composite model presents a novel approach to developing a biclustering-based classification model from unlabeled sampled data. The proposed approach combines unsupervised biclustering and supervised classification techniques to classify samples into disjoint subgroups based on their associated attributes, such as genotypic factors, phenotypic outcomes, efficacy/safety measures, or responses to treatments. The procedure is useful for identification of unknown species or new biomarkers for targeted therapy.
PMCID: PMC4210136  PMID: 25347824
2.  Assessment of performance of survival prediction models for cancer prognosis 
Cancer survival studies are commonly analyzed using survival-time prediction models for cancer prognosis. A number of different performance metrics are used to ascertain the concordance between the predicted risk score of each patient and the actual survival time, but these metrics can sometimes conflict. Alternatively, patients are sometimes divided into two classes according to a survival-time threshold, and binary classifiers are applied to predict each patient’s class. Although this approach has several drawbacks, it does provide natural performance metrics such as positive and negative predictive values to enable unambiguous assessments.
We compare the survival-time prediction and survival-time threshold approaches to analyzing cancer survival studies. We review and compare common performance metrics for the two approaches. We present new randomization tests and cross-validation methods to enable unambiguous statistical inferences for several performance metrics used with the survival-time prediction approach. We consider five survival prediction models consisting of one clinical model, two gene expression models, and two models from combinations of clinical and gene expression models.
A public breast cancer dataset was used to compare several performance metrics using five prediction models. 1) For some prediction models, the hazard ratio from fitting a Cox proportional hazards model was significant, but the two-group comparison was insignificant, and vice versa. 2) The randomization test and cross-validation were generally consistent with the p-values obtained from the standard performance metrics. 3) Binary classifiers highly depended on how the risk groups were defined; a slight change of the survival threshold for assignment of classes led to very different prediction results.
1) Different performance metrics for evaluation of a survival prediction model may give different conclusions in its discriminatory ability. 2) Evaluation using a high-risk versus low-risk group comparison depends on the selected risk-score threshold; a plot of p-values from all possible thresholds can show the sensitivity of the threshold selection. 3) A randomization test of the significance of Somers’ rank correlation can be used for further evaluation of performance of a prediction model. 4) The cross-validated power of survival prediction models decreases as the training and test sets become less balanced.
PMCID: PMC3410808  PMID: 22824262
3.  Data mining tools for Salmonella characterization: application to gel-based fingerprinting analysis 
BMC Bioinformatics  2013;14(Suppl 14):S15.
Pulsed field gel electrophoresis (PFGE) is currently the most widely and routinely used method by the Centers for Disease Control and Prevention (CDC) and state health labs in the United States for Salmonella surveillance and outbreak tracking. Major drawbacks of commercially available PFGE analysis programs have been their difficulty in dealing with large datasets and the limited availability of analysis tools. There exists a need to develop new analytical tools for PFGE data mining in order to make full use of valuable data in large surveillance databases.
In this study, a software package was developed consisting of five types of bioinformatics approaches exploring and implementing for the analysis and visualization of PFGE fingerprinting. The approaches include PFGE band standardization, Salmonella serotype prediction, hierarchical cluster analysis, distance matrix analysis and two-way hierarchical cluster analysis. PFGE band standardization makes it possible for cross-group large dataset analysis. The Salmonella serotype prediction approach allows users to predict serotypes of Salmonella isolates based on their PFGE patterns. The hierarchical cluster analysis approach could be used to clarify subtypes and phylogenetic relationships among groups of PFGE patterns. The distance matrix and two-way hierarchical cluster analysis tools allow users to directly visualize the similarities/dissimilarities of any two individual patterns and the inter- and intra-serotype relationships of two or more serotypes, and provide a summary of the overall relationships between user-selected serotypes as well as the distinguishable band markers of these serotypes. The functionalities of these tools were illustrated on PFGE fingerprinting data from PulseNet of CDC.
The bioinformatics approaches included in the software package developed in this study were integrated with the PFGE database to enhance the data mining of PFGE fingerprints. Fast and accurate prediction makes it possible to elucidate Salmonella serotype information before conventional serological methods are pursued. The development of bioinformatics tools to distinguish the PFGE markers and serotype specific patterns will enhance PFGE data retrieval, interpretation and serotype identification and will likely accelerate source tracking to identify the Salmonella isolates implicated in foodborne diseases.
PMCID: PMC3851133  PMID: 24267777
Data mining; Salmonella; PFGE; bioinformatics tools; data analysis.
4.  Identification of Bicluster Regions in a Binary Matrix and Its Applications 
PLoS ONE  2013;8(8):e71680.
Biclustering has emerged as an important approach to the analysis of large-scale datasets. A biclustering technique identifies a subset of rows that exhibit similar patterns on a subset of columns in a data matrix. Many biclustering methods have been proposed, and most, if not all, algorithms are developed to detect regions of “coherence” patterns. These methods perform unsatisfactorily if the purpose is to identify biclusters of a constant level. This paper presents a two-step biclustering method to identify constant level biclusters for binary or quantitative data. This algorithm identifies the maximal dimensional submatrix such that the proportion of non-signals is less than a pre-specified tolerance δ. The proposed method has much higher sensitivity and slightly lower specificity than several prominent biclustering methods from the analysis of two synthetic datasets. It was further compared with the Bimax method for two real datasets. The proposed method was shown to perform the most robust in terms of sensitivity, number of biclusters and number of serotype-specific biclusters identified. However, dichotomization using different signal level thresholds usually leads to different sets of biclusters; this also occurs in the present analysis.
PMCID: PMC3733970  PMID: 23940779
5.  Meta-Analysis of Pulsed-Field Gel Electrophoresis Fingerprints Based on a Constructed Salmonella Database 
PLoS ONE  2013;8(3):e59224.
A database was constructed consisting of 45,923 Salmonella pulsed-field gel electrophoresis (PFGE) patterns. The patterns, randomly selected from all submissions to CDC PulseNet during 2005 to 2010, included the 20 most frequent serotypes and 12 less frequent serotypes. Meta-analysis was applied to all of the PFGE patterns in the database. In the range of 20 to 1100 kb, serotype Enteritidis averaged the fewest bands at 12 bands and Paratyphi A the most with 19, with most serotypes in the 13−15 range among the 32 serptypes. The 10 most frequent bands for each of the 32 serotypes were sorted and distinguished, and the results were in concordance with those from distance matrix and two-way hierarchical cluster analyses of the patterns in the database. The hierarchical cluster analysis divided the 32 serotypes into three major groups according to dissimilarity measures, and revealed for the first time the similarities among the PFGE patterns of serotype Saintpaul to serotypes Typhimurium, Typhimurium var. 5-, and I 4,[5],12:i:-; of serotype Hadar to serotype Infantis; and of serotype Muenchen to serotype Newport. The results of the meta-analysis indicated that the pattern similarities/dissimilarities determined the serotype discrimination of PFGE method, and that the possible PFGE markers may have utility for serotype identification. The presence of distinct, serotype specific patterns may provide useful information to aid in the distribution of serotypes in the population and potentially reduce the need for laborious analyses, such as traditional serotyping.
PMCID: PMC3597626  PMID: 23516614
6.  Assessment of reproducibility of cancer survival risk predictions across medical centers 
Two most important considerations in evaluation of survival prediction models are 1) predictability - ability to predict survival risks accurately and 2) reproducibility - ability to generalize to predict samples generated from different studies. We present approaches for assessment of reproducibility of survival risk score predictions across medical centers.
Reproducibility was evaluated in terms of consistency and transferability. Consistency is the agreement of risk scores predicted between two centers. Transferability from one center to another center is the agreement of the risk scores of the second center predicted by each of the two centers. The transferability can be: 1) model transferability - whether a predictive model developed from one center can be applied to predict the samples generated from other centers and 2) signature transferability - whether signature markers of a predictive model developed from one center can be applied to predict the samples from other centers. We considered eight prediction models, including two clinical models, two gene expression models, and their combinations. Predictive performance of the eight models was evaluated by several common measures. Correlation coefficients between predicted risk scores of different centers were computed to assess reproducibility - consistency and transferability.
Two public datasets, the lung cancer data generated from four medical centers and colon cancer data generated from two medical centers, were analyzed. The risk score estimates for lung cancer patients predicted by three of four centers agree reasonably well. In general, a good prediction model showed better cross-center consistency and transferability. The risk scores for the colon cancer patients from one (Moffitt) medical center that were predicted by the clinical models developed from the another (Vanderbilt) medical center were shown to have excellent model transferability and signature transferability.
This study illustrates an analytical approach to assessing reproducibility of predictive models and signatures. Based on the analyses of the two cancer datasets, we conclude that the models with clinical variables appear to perform reasonable well with high degree of consistency and transferability. There should have more investigations on the reproducibility of prediction models including gene expression data across studies.
PMCID: PMC3598915  PMID: 23425000
7.  Prediction System for Rapid Identification of Salmonella Serotypes Based on Pulsed-Field Gel Electrophoresis Fingerprints 
Journal of Clinical Microbiology  2012;50(5):1524-1532.
A classification model is presented for rapid identification of Salmonella serotypes based on pulsed-field gel electrophoresis (PFGE) fingerprints. The classification model was developed using random forest and support vector machine algorithms and was then applied to a database of 45,923 PFGE patterns, randomly selected from all submissions to CDC PulseNet from 2005 to 2010. The patterns selected included the top 20 most frequent serotypes and 12 less frequent serotypes from various sources. The prediction accuracies for the 32 serotypes ranged from 68.8% to 99.9%, with an overall accuracy of 96.0% for the random forest classification, and ranged from 67.8% to 100.0%, with an overall accuracy of 96.1% for the support vector machine classification. The prediction system improves reliability and accuracy and provides a new tool for early and fast screening and source tracking of outbreak isolates. It is especially useful to get serotype information before the conventional methods are done. Additionally, this system also works well for isolates that are serotyped as “unknown” by conventional methods, and it is useful for a laboratory where standard serotyping is not available.
PMCID: PMC3347130  PMID: 22378901

Results 1-7 (7)