This paper concerns a study indicating that the expression levels of genes in signaling pathways can be modeled using a causal Bayesian network (BN) that is altered in tumorous tissue. These results open up promising areas of future research that can help identify driver genes and therapeutic targets. So, it is most appropriate for the cancer informatics community.
Our central hypothesis is that the expression levels of genes that code for proteins on a signal transduction network (STP) are causally related and that this causal structure is altered when the STP is involved in cancer. To test this hypothesis, we analyzed 5 STPs associated with breast cancer, 7 STPs associated with other cancers, and 10 randomly chosen pathways, using a breast cancer gene expression level dataset containing 529 cases and 61 controls. We identified all the genes related to each of the 22 pathways and developed separate gene expression datasets for each pathway. We obtained significant results indicating that the causal structure of the expression levels of genes coding for proteins on STPs, which are believed to be implicated in both breast cancer and in all cancers, is more altered in the cases relative to the controls than the causal structure of the randomly chosen pathways.
signal transduction pathway; bayesian network; gene expression level; breast cancer; causal structure
This work reviews the most relevant present-day processing methods used to improve the accuracy of multimodal nonlinear images in the detection of epithelial cancer and the supporting stroma. Special emphasis has been placed on methods of non linear optical (NLO) microscopy image processing such as: second harmonic to autofluorescence ageing index of dermis (SAAID), tumor-associated collagen signatures (TACS), fast Fourier transform (FFT) analysis, and gray level co-occurrence matrix (GLCM)-based methods. These strategies are presented as a set of potential valuable diagnostic tools for early cancer detection. It may be proposed that the combination of NLO microscopy and informatics based image analysis approaches described in this review (all carried out on free software) may represent a powerful tool to investigate collagen organization and remodeling of extracellular matrix in carcinogenesis processes.
nonlinear signal; nonlinear microscopy; anisotropy; gray level co-occurrence matrix; tumor-associated collagen signatures
The emergence of transcriptomics, fuelled by high-throughput sequencing technologies, has changed the nature of cancer research and resulted in a massive accumulation of data. Computational analysis, integration, and data visualization are now major bottlenecks in cancer biology and translational research. Although many tools have been brought to bear on these problems, their use remains unnecessarily restricted to computational biologists, as many tools require scripting skills, data infrastructure, and powerful computational facilities. New user-friendly, integrative, and automated analytical approaches are required to make computational methods more generally useful to the research community. Here we present INsPeCT (INtegrative Platform for Cancer Transcriptomics), which allows users with basic computer skills to perform comprehensive in-silico analyses of microarray, ChIP-seq, and RNA-seq data. INsPeCT supports the selection of interesting genes for advanced functional analysis. Included in its automated workflows are (i) a novel analytical framework, RMaNI (regulatory module network inference), which supports the inference of cancer subtype-specific transcriptional module networks and the analysis of modules; and (ii) WGCNA (weighted gene co-expression network analysis), which infers modules of highly correlated genes across microarray samples, associated with sample traits, eg survival time. INsPeCT is available free of cost from Bioinformatics Resource Australia-EMBL and can be accessed at http://inspect.braembl.org.au.
cancer; systems biology; transcriptomics; transcriptional module networks; microarray; ChIP-seq; RNA-seq
The purpose of this investigation is to develop and evaluate a new Bayesian network (BN)-based patient survivorship prediction method. The central hypothesis is that the method predicts patient survivorship well, while having the capability to handle high-dimensional data and be incorporated into a clinical decision support system (CDSS). We have developed EBMC_Survivorship (EBMC_S), which predicts survivorship for each year individually. EBMC_S is based on the EBMC BN algorithm, which has been shown to handle high-dimensional data. BNs have excellent architecture for decision support systems. In this study, we evaluate EBMC_S using the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) dataset, which concerns breast tumors. A 5-fold cross-validation study indicates that EMBC_S performs better than the Cox proportional hazard model and is comparable to the random survival forest method. We show that EBMC_S provides additional information such as sensitivity analyses, which covariates predict each year, and yearly areas under the ROC curve (AUROCs). We conclude that our investigation supports the central hypothesis.
Bayesian network; survivorship prediction; Cox proportional hazard model; random survival forest; breast cancer
Gene set enrichment analysis (GSA) methods have been widely adopted by biological labs to analyze data and generate hypotheses for validation. Most of the existing comparison studies focus on whether the existing GSA methods can produce accurate P-values; however, practitioners are often more concerned with the correct gene-set ranking generated by the methods. The ranking performance is closely related to two critical goals associated with GSA methods: the ability to reveal biological themes and ensuring reproducibility, especially for small-sample studies. We have conducted a comprehensive simulation study focusing on the ranking performance of seven representative GSA methods. We overcome the limitation on the availability of real data sets by creating hybrid data models from existing large data sets. To build the data model, we pick a master gene from the data set to form the ground truth and artificially generate the phenotype labels. Multiple hybrid data models can be constructed from one data set and multiple data sets of smaller sizes can be generated by resampling the original data set. This approach enables us to generate a large batch of data sets to check the ranking performance of GSA methods. Our simulation study reveals that for the proposed data model, the Q2 type GSA methods have in general better performance than other GSA methods and the global test has the most robust results. The properties of a data set play a critical role in the performance. For the data sets with highly connected genes, all GSA methods suffer significantly in performance.
gene set enrichment analysis; feature ranking; data model; simulation study
The aberrantly expressed signal transducer and activator of transcription 3 (STAT3) predicts poor prognosis, primarily in estrogen receptor positive (ER(+)) breast cancers. Activated STAT3 is overexpressed in luminal A subtype cells. The mechanisms contributing to the prognosis and/or subtype relevant features of STAT3 in ER(+) breast cancers are through multiple interacting regulatory pathways, including STAT3-MYC, STAT3-ERα, and STAT3-MYC-ERα interactions, as well as the direct action of activated STAT3. These data predict malignant events, treatment responses and a novel enhancer of tamoxifen resistance. The inferred crosstalk between ERα and STAT3 in regulating their shared target gene-METAP2 is partially validated in the luminal B breast cancer cell line-MCF7. Taken together, we identify a poor prognosis relevant gene set within the STAT3 network and a robust one in a subset of patients. VEGFA, ABL1, LYN, IGF2R and STAT3 are suggested therapeutic targets for further study based upon the degree of differential expression in our model.
STAT3 transcriptional regulatory network; prognosis; TAM resistance; tumorigenesis; breast cancer
OmicCircos is an R software package used to generate high-quality circular plots for visualizing genomic variations, including mutation patterns, copy number variations (CNVs), expression patterns, and methylation patterns. Such variations can be displayed as scatterplot, line, or text-label figures. Relationships among genomic features in different chromosome positions can be represented in the forms of polygons or curves. Utilizing the statistical and graphic functions in an R/Bioconductor environment, OmicCircos performs statistical analyses and displays results using cluster, boxplot, histogram, and heatmap formats. In addition, OmicCircos offers a number of unique capabilities, including independent track drawing for easy modification and integration, zoom functions, link-polygons, and position-independent heatmaps supporting detailed visualization.
AVAILABILITY AND IMPLEMENTATION
OmicCircos is available through Bioconductor at http://www.bioconductor.org/packages/devel/bioc/html/OmicCircos.html. An extensive vignette in the package describes installation, data formatting, and workflow procedures. The software is open source under the Artistic–2.0 license.
R package; circular plot; genomic variation
This paper presents the R/Bioconductor package stepwiseCM, which classifies cancer samples using two heterogeneous data sets in an efficient way. The algorithm is able to capture the distinct classification power of two given data types without actually combining them. This package suits for classification problems where two different types of data sets on the same samples are available. One of these data types has measurements on all samples and the other one has measurements on some samples. One is easy to collect and/or relatively cheap (eg, clinical covariates) compared to the latter (high-dimensional data, eg, gene expression). One additional application for which stepwiseCM is proven to be useful as well is the combination of two high-dimensional data types, eg, DNA copy number and mRNA expression. The package includes functions to project the neighborhood information in one data space to the other to determine a potential group of samples that are likely to benefit most by measuring the second type of covariates. The two heterogeneous data spaces are connected by indirect mapping. The crucial difference between the stepwise classification strategy implemented in this package and the existing packages is that our approach aims to be cost-efficient by avoiding measuring additional covariates, which might be expensive or patient-unfriendly, for a potentially large subgroup of individuals. Moreover, in diagnosis for these individuals test, results would be quickly available, which may lead to reduced waiting times and hence lower the patients’ distress. The improvement described remedies the key limitations of existing packages, and facilitates the use of the stepwiseCM package in diverse applications.
classification; data integration; high-dimensional data; R package
Two drastically different approaches to understanding the forces driving carcinogenesis have crystallized through years of research. These are the somatic mutation theory (SMT) and the tissue organization field theory (TOFT). The essence of SMT is that cancer is derived from a single somatic cell that has successively accumulated multiple DNA mutations, and that those mutations occur on genes which control cell proliferation and cell cycle. Thus, according to SMT, neoplastic lesions are the results of DNA-level events. Conversely, according to TOFT, carcinogenesis is primarily a problem of tissue organization: carcinogenic agents destroy the normal tissue architecture thus disrupting cell-to-cell signaling and compromising genomic integrity. Hence, in TOFT the DNA mutations are the effect, and not the cause, of the tissue-level events. Cardinal importance of successful resolution of the TOFT versus SMT controversy dwells in the fact that, according to SMT, cancer is a unidirectional and mostly irreversible disease; whereas, according to TOFT, it is curable and reversible. In this paper, our goal is to outline a plausible scenario in which TOFT and SMT can be reconciled using the framework and concepts of the self-organized criticality (SOC), the principle proven to be extremely fruitful in a wide range of disciplines pertaining to natural phenomena, to biological communities, to large-scale social developments, to technological networks, and to many other subjects of research.
self-organized criticality; somatic mutations; carcinogenesis; avalanches; quorum sensing; swarm intelligence
Breast tumors have been described by molecular subtypes characterized by pervasively different gene expression profiles. The subtypes are associated with different clinical parameters and origin of precursor cells. However, the biological pathways and chromosomal aberrations that differ between the subgroups are less well characterized. The molecular subtypes are associated with different risk of metastatic recurrence of the disease. Nevertheless, the performance of these overall patterns to predict outcome is far from optimal, suggesting that biological mechanisms that extend beyond the subgroups impact metastasis.
We have scrutinized publicly available gene expression datasets and identified molecular subtypes in 1,394 breast tumors with outcome data. By analysis of chromosomal regions and pathways using “Gene set enrichment analysis” followed by a meta-analysis, we identified comprehensive mechanistic differences between the subgroups. Furthermore, the same approach was used to investigate mechanisms related to metastasis within the subgroups. A striking finding is that the molecular subtypes account for the majority of biological mechanisms associated with metastasis. However, some mechanisms, aside from the subtypes, were identified in a training set of 1,239 tumors and confirmed by survival analysis in two independent validation datasets from the same type of platform and consisting of very comparable node-negative patients that did not receive adjuvant medical therapy. The results show that high expression of 5q14 genes and low levels of TNFR2 pathway genes were associated with poor survival in basal-like cancers. Furthermore, low expression of 5q33 genes and interleukin-12 pathway genes were associated with poor outcome exclusively in ERBB2-like tumors.
The identified regions, genes, and pathways may be potential drug targets in future individualized treatment strategies.
breast cancer; metastasis; gene expression; microarray; pathway analysis; molecular subtypes
High-dimensional datasets can be confounded by variation from technical sources, such as batches. Undetected batch effects can have severe consequences for the validity of a study’s conclusion(s). We evaluate high-throughput RNAseq and miRNAseq as well as DNA methylation and gene expression microarray datasets, mainly from the Cancer Genome Atlas (TCGA) project, in respect to technical and biological annotations. We observe technical bias in these datasets and discuss corrective interventions. We then suggest a general procedure to control study design, detect technical bias using linear regression of principal components, correct for batch effects, and re-evaluate principal components. This procedure is implemented in the R package swamp, and as graphical user interface software. In conclusion, high-throughput platforms that generate continuous measurements are sensitive to various forms of technical bias. For such data, monitoring of technical variation is an important analysis step.
data adjustment; batch effect; bias; sample annotation; RNAseq; high-throughput analysis
Recent advances in high-throughput genotyping have made possible identification of genetic variants associated with increased risk of developing prostate cancer using genome-wide associations studies (GWAS). However, the broader context in which the identified genetic variants operate is poorly understood. Here we present a comprehensive assessment, network, and pathway analysis of the emerging genetic susceptibility landscape of prostate cancer.
We created a comprehensive catalog of genetic variants and associated genes by mining published reports and accompanying websites hosting supplementary data on GWAS. We then performed network and pathway analysis using single nucleotide polymorphism (SNP)-containing genes to identify gene regulatory networks and pathways enriched for genetic variants.
We identified multiple gene networks and pathways enriched for genetic variants including IGF-1, androgen biosynthesis and androgen signaling pathways, and the molecular mechanisms of cancer. The results provide putative functional bridges between GWAS findings and gene regulatory networks and biological pathways.
prostate cancer GWAS network pathway analysis
B-Precursor acute lymphoblastic leukemia (B-ALL) is the most common childhood cancer. Although 80% of B-ALL patients are able to be cured, significant challenges persist. Significant disparities in clinical outcomes and mortality rates exist between racial/ethnic populations. The objective of this study was to determine whether gene expression levels significantly differ between ethnic populations. We compared gene expression levels between four ethnic populations (Whites, Blacks, Hispanics, and Asians) in the United States. Additionally, we performed network and pathway analysis to identify gene networks and pathways. Gene expression data involved 198 samples distributed as follows: 126 Whites, 51 Hispanics, 13 Blacks, and 8 Asians. We identified 300 highly significantly (P < 0.001) differentially expressed genes between the four ethnic populations. Among the identified genes included the genes PHF6, BRD3, CRLF2, and RNF135 which have been implicated in pediatric B-ALL. We identified key pathways implicated in B-ALL including the PDGF, PI3/AKT, ERBB2-ERBB3, and IL-15 signaling pathways.
leukemia gene expression variation pediatric B-ALL
Microarray techniques provide promising tools for cancer diagnosis using gene expression profiles. However, molecular diagnosis based on high-throughput platforms presents great challenges due to the overwhelming number of variables versus the small sample size and the complex nature of multi-type tumors. Support vector machines (SVMs) have shown superior performance in cancer classification due to their ability to handle high dimensional low sample size data. The multi-class SVM algorithm of Crammer and Singer provides a natural framework for multi-class learning. Despite its effective performance, the procedure utilizes all variables without selection. In this paper, we propose to improve the procedure by imposing shrinkage penalties in learning to enforce solution sparsity.
The original multi-class SVM of Crammer and Singer is effective for multi-class classification but does not conduct variable selection. We improved the method by introducing soft-thresholding type penalties to incorporate variable selection into multi-class classification for high dimensional data. The new methods were applied to simulated data and two cancer gene expression data sets. The results demonstrate that the new methods can select a small number of genes for building accurate multi-class classification rules. Furthermore, the important genes selected by the methods overlap significantly, suggesting general agreement among different variable selection schemes.
High accuracy and sparsity make the new methods attractive for cancer diagnostics with gene expression data and defining targets of therapeutic intervention.
Availability: The source MATLAB code are available from http://math.arizona.edu/~hzhang/software.html.
support vector machine (SVM); multi-class SVM; variable selection; shrinkage methods; classification; microarray; cancer classification
Genome-wide association studies (GWAS) have achieved great success in identifying common variants associated with increased risk of developing breast cancer. However, GWAS do not typically provide information about the broader context in which genetic variants operate in different subtypes of breast cancer. The objective of this study was to determine whether genes containing single nucleotide polymorphisms (SNPs, herein called genetic variants) are associated with different subtypes of breast cancer. Additionally, we sought to identify gene regulator networks and biological pathways enriched for these genetic variants. Using supervised analysis, we identified 201 genes that were significantly associated with the six intrinsic subtypes of breast cancer. The results demonstrate that integrative genomics analysis is a powerful approach for linking GWAS information to distinct disease states and provide insights about the broader context in which genetic variants operate in different subtypes of breast cancer.
GWAS subtypes breast cancer
This paper discusses the need for interconnecting computational cancer models from different sources and scales within clinically relevant scenarios to increase the accuracy of the models and speed up their clinical adaptation, validation, and eventual translation. We briefly review current interoperability efforts drawing upon our experiences with the development of in silico models for predictive oncology within a number of European Commission Virtual Physiological Human initiative projects on cancer. A clinically relevant scenario, addressing brain tumor modeling that illustrates the need for coupling models from different sources and levels of complexity, is described. General approaches to enabling interoperability using XML-based markup languages for biological modeling are reviewed, concluding with a discussion on efforts towards developing cancer-specific XML markup to couple multiple component models for predictive in silico oncology.
multi-scale computational tumor modeling; in silico oncology; model interoperability; XML markup languages
Methods for array normalization, such as median and quantile normalization, were developed for mRNA expression arrays. These methods assume few or symmetric differential expression of genes on the array. However, these assumptions are not necessarily appropriate for microRNA expression arrays because they consist of only a few hundred genes and a reasonable fraction of them are anticipated to have disease relevance.
We collected microRNA expression profiles for human tissue samples from a liposarcoma study using the Agilent microRNA arrays. For a subset of the samples, we also profiled their microRNA expression using deep sequencing. We empirically evaluated methods for normalization of microRNA arrays using deep sequencing data derived from the same tissue samples as the benchmark.
In this study, we demonstrated array effects in microRNA arrays using data from a liposarcoma study. We found moderately high correlation between Agilent data and sequence data on the same tumors, with the Pearson correlation coefficients ranging from 0.6 to 0.9. Array normalization resulted in some improvement in the accuracy of the differential expression analysis. However, even with normalization, there is still a significant number of false positive and false negative microRNAs, many of which are expressed at moderate to high levels.
Our study demonstrated the need to develop more efficient normalization methods for microRNA arrays to further improve the detection of genes with disease relevance. Until better methods are developed, an existing normalization method such as quantile normalization should be applied when analyzing microRNA array data.
microRNA; microarray; normalization; differential expression; cancer; sarcoma
The 18,352 pancreatic ductal adenocarcinoma (PDAC) cases from the Surveillance Epidemiology and End Results (SEER) database were analyzed using the Kaplan-Meier method for the following variables: race, gender, marital status, year of diagnosis, age at diagnosis, pancreatic subsite, T-stage, N-stage, M-stage, tumor size, tumor grade, performed surgery, and radiation therapy. Because the T-stage variable did not satisfy the proportional hazards assumption, the cases were divided into cases with T1- and T2-stages (localized tumor) and cases with T3- and T4-stages (extended tumor). For estimating survival and conditional survival probabilities in each group, a multivariate Cox regression model adjusted for the remaining covariates was developed. Testing the reproducibility of model parameters and generalizability of these models showed that the models are well calibrated and have concordance indexes equal to 0.702 and 0.712, respectively. Based on these models, a prognostic estimator of survival for patients diagnosed with PDAC was developed and implemented as a computerized web-based tool.
cancer; survival; Cox model; SEER; pancreatic cancer
Modeling of cancer hazards at age t deals with a dichotomous population, a small part of which (the fraction at risk) will get cancer, while the other part will not. Therefore, we conditioned the hazard function, h(t), the probability density function (pdf), f(t), and the survival function, S(t), on frailty α in individuals. Assuming α has the Bernoulli distribution, we obtained equations relating the unconditional (population level) hazard function, hU(t), cumulative hazard function, HU(t), and overall cumulative hazard, H0, with the h(t), f(t), and S(t) for individuals from the fraction at risk. Computing procedures for estimating h(t), f(t), and S(t) were developed and used to fit the pancreatic cancer data collected by SEER9 registries from 1975 through 2004 with the Weibull pdf suggested by the Armitage-Doll model. The parameters of the obtained excellent fit suggest that age of pancreatic cancer presentation has a time shift about 17 years and five mutations are needed for pancreatic cells to become malignant.
cancer incidence; cancer hazard; frailty; Weibull distribution; pancreatic cancer
We present a novel machine learning approach for the classification of cancer samples using expression data. We refer to the method as “decision trunks,” since it is loosely based on decision trees, but contains several modifications designed to achieve an algorithm that: (1) produces smaller and more easily interpretable classifiers than decision trees; (2) is more robust in varying application scenarios; and (3) achieves higher classification accuracy. The decision trunk algorithm has been implemented and tested on 26 classification tasks, covering a wide range of cancer forms, experimental methods, and classification scenarios. This comprehensive evaluation indicates that the proposed algorithm performs at least as well as the current state of the art algorithms in terms of accuracy, while producing classifiers that include on average only 2–3 markers. We suggest that the resulting decision trunks have clear advantages over other classifiers due to their transparency, interpretability, and their correspondence with human decision-making and clinical testing practices.
classification; machine learning; gene expression; biomarkers
The aim of this study was to perform comparative analysis of multiple public datasets of gene expression in order to identify common genes as potential prognostic biomarkers. Additionally, the study sought to identify biological processes and pathways that are most significantly associated with early distant metastases (<5 years) in women with estrogen receptor-positive (ER+) breast tumors. Datasets from three published studies were selected for in silico analysis of gene expression profiles of ER+ breast cancer, using time to distant metastasis as the clinical endpoint. A subset of 44 differently expressed genes (DEGs) was found common to all three studies and characterized by mitotic checkpoint genes and pathways that regulate mitotic spindle and chromosome dynamics. DEG promoter regions were enriched with NFY binding sites. Analysis of miRNA target sites identified significant enrichment of miR-192, miR-193B, and miR-16-1 targets. Aberrant mitotic regulation could drive increased genomic instability leading to a progression towards an early onset metastatic phenotype. The relative importance of mitotic instability may reflect the clinical utility of mitotic poisons in metastatic breast cancer, including poisons such as the taxanes, epothilones, and vinca alkaloids.
estrogen receptor alpha-positive; mitotic checkpoint signaling; mitotic regulation network; microRNA targets; early distant metastasis
Cancer risk management involves obliterating excess concentration of cancer causing trace elements by the natural immune system and hence intake of nutritious diet is of paramount importance. Human diet should consist of essential macronutrients that have to be consumed in large quantities and trace elements are to be consumed in very little amount. As some of these trace elements are causative factors for various types of cancer and build up at the expense of macronutrients, cancer risk management of these trace elements should be based on their initial concentration in the blood of each individual and not on their tolerable upper intake level. We propose an information theory based Expert System (ES) for estimating the lowest limit of toxicity association between the trace elements and the macronutrients. Such an estimate would enable the physician to prescribe required medication containing the macronutrients to annul the toxicity of cancer risk trace elements. The lowest limit of toxicity association is achieved by minimizing the correlated information of the concentration correlation matrix using the concept of Mutual Information (MI) and an algorithm based on a Technique of Determinant Inequalities (TDI) developed by the authors. The novelty of our ES is that it provides the lowest limit of toxicity profile for all trace elements in the blood not restricted to a group of compounds having similar structure. We demonstrate the superiority our algorithm over Principal Component Analysis in mitigating trace element toxicity in blood samples.
carcinogenic trace elements; high correlation coefficient; cancer screening; expert system; mutual information
Genome-wide association studies (GWAS) have identified genetic variants associated with an increased risk of developing breast cancer. However, the association of genetic variants and their associated genes with the most aggressive subset of breast cancer, the triple-negative breast cancer (TNBC), remains a central puzzle in molecular epidemiology. The objective of this study was to determine whether genes containing single nucleotide polymorphisms (SNPs) associated with an increased risk of developing breast cancer are connected to and could stratify different subtypes of TNBC. Additionally, we sought to identify molecular pathways and networks involved in TNBC. We performed integrative genomics analysis, combining information from GWAS studies involving over 400,000 cases and over 400,000 controls, with gene expression data derived from 124 breast cancer patients classified as TNBC (at the time of diagnosis) and 142 cancer-free controls. Analysis of GWAS reports produced 500 SNPs mapped to 188 genes. We identified a signature of 159 functionally related SNP-containing genes which were significantly (P <10−5) associated with and stratified TNBC. Additionally, we identified 97 genes which were functionally related to, and had similar patterns of expression profiles, SNP-containing genes. Network modeling and pathway prediction revealed multi-gene pathways including p53, NFkB, BRCA, apoptosis, DNA repair, DNA mismatch, and excision repair pathways enriched for SNPs mapped to genes significantly associated with TNBC. The results provide convincing evidence that integrating GWAS information with gene expression data provides a unified and powerful approach for biomarker discovery in TNBC.
triple negative breast cancer GWAS gene expression
The popularity of a large number of microarray applications has in cancer research led to the development of predictive or prognostic gene expression profiles. However, the diversity of microarray platforms has made the full validation of such profiles and their related gene lists across studies difficult and, at the level of classification accuracies, rarely validated in multiple independent datasets. Frequently, while the individual genes between such lists may not match, genes with same function are included across such gene lists. Development of such lists does not take into account the fact that genes can be grouped together as metagenes (MGs) based on common characteristics such as pathways, regulation, or genomic location. Such MGs might be used as features in building a predictive model applicable for classifying independent data. It is, therefore, demanding to systematically compare independent validation of gene lists or classifiers based on metagene or individual gene (SG) features.
In this study we compared the performance of either metagene-or single gene-based feature sets and classifiers using random forest and two support vector machines for classifier building. The performance within the same dataset, feature set validation performance, and validation performance of entire classifiers in strictly independent datasets were assessed by 10 times repeated 10-fold cross validation, leave-one-out cross validation, and one-fold validation, respectively. To test the significance of the performance difference between MG- and SG-features/classifiers, we used a repeated down-sampled binomial test approach.
MG- and SG-feature sets are transferable and perform well for training and testing prediction of metastasis outcome in strictly independent data sets, both between different and within similar microarray platforms, while classifiers had a poorer performance when validated in strictly independent datasets. The study showed that MG- and SG-feature sets perform equally well in classifying independent data. Furthermore, SG-classifiers significantly outperformed MG-classifier when validation is conducted between datasets using similar platforms, while no significant performance difference was found when validation was performed between different platforms.
Prediction of metastasis outcome in lymph node–negative patients by MG- and SG-classifiers showed that SG-classifiers performed significantly better than MG-classifiers when validated in independent data based on the same microarray platform as used for developing the classifier. However, the MG- and SG-classifiers had similar performance when conducting classifier validation in independent data based on a different microarray platform. The latter was also true when only validating sets of MG- and SG-features in independent datasets, both between and within similar and different platforms.
microarray; classification; metagenes; breast cancer