Finding reliable gene markers for accurate disease classification is very challenging due to a number of reasons, including the small sample size of typical clinical data, high noise in gene expression measurements, and the heterogeneity across patients. In fact, gene markers identified in independent studies often do not coincide with each other, suggesting that many of the predicted markers may have no biological significance and may be simply artifacts of the analyzed dataset. To find more reliable and reproducible diagnostic markers, several studies proposed to analyze the gene expression data at the level of groups of functionally related genes, such as pathways. Studies have shown that pathway markers tend to be more robust and yield more accurate classification results. One practical problem of the pathway-based approach is the limited coverage of genes by currently known pathways. As a result, potentially important genes that play critical roles in cancer development may be excluded. To overcome this problem, we propose a novel method for identifying reliable subnetwork markers in a human protein-protein interaction (PPI) network.
In this method, we overlay the gene expression data with the PPI network and look for the most discriminative linear paths that consist of discriminative genes that are highly correlated to each other. The overlapping linear paths are then optimally combined into subnetworks that can potentially serve as effective diagnostic markers. We tested our method on two independent large-scale breast cancer datasets and compared the effectiveness and reproducibility of the identified subnetwork markers with gene-based and pathway-based markers. We also compared the proposed method with an existing subnetwork-based method.
The proposed method can efficiently find reliable subnetwork markers that outperform the gene-based and pathway-based markers in terms of discriminative power, reproducibility and classification performance. Subnetwork markers found by our method are highly enriched in common GO terms, and they can more accurately classify breast cancer metastasis compared to markers found by a previous method.
One important problem in translational genomics is the identification of reliable and reproducible markers that can be used to discriminate between different classes of a complex disease, such as cancer. The typical small sample setting makes the prediction of such markers very challenging, and various approaches have been proposed to address this problem. For example, it has been shown that pathway markers, which aggregate the gene activities in the same pathway, tend to be more robust than gene markers. Furthermore, the use of gene expression ranking has been demonstrated to be robust to batch effects and that it can lead to more interpretable results. In this paper, we propose an enhanced pathway activity inference method that uses gene ranking to predict the pathway activity in a probabilistic manner. The main focus of this work is on identifying robust pathway markers that can ultimately lead to robust classifiers with reproducible performance across datasets. Simulation results based on multiple breast cancer datasets show that the proposed inference method identifies better pathway markers that can predict breast cancer metastasis with higher accuracy. Moreover, the identified pathway markers can lead to better classifiers with more consistent classification performance across independent datasets.
We developed PathAct, a novel method for pathway analysis to investigate the biological and clinical implications of the gene
expression profiles. The advantage of PathAct in comparison with the conventional pathway analysis methods is that it can
estimate pathway activity levels for individual patient quantitatively in the form of a pathway-by-sample matrix. This matrix can
be used for further analysis such as hierarchical clustering and other analysis methods. To evaluate the feasibility of PathAct,
comparison with frequently used gene-enrichment analysis methods was conducted using two public microarray datasets. The
dataset #1 was that of breast cancer patients, and we investigated pathways associated with triple-negative breast cancer by
PathAct, compared with those obtained by gene set enrichment analysis (GSEA). The dataset #2 was another breast cancer dataset
with disease-free survival (DFS) of each patient. Contribution by each pathway to prognosis was investigated by our method as
well as the Database for Annotation, Visualization and Integrated Discovery (DAVID) analysis. In the dataset #1, four out of the six
pathways that satisfied p < 0.05 and FDR < 0.30 by GSEA were also included in those obtained by the PathAct method. For the
dataset #2, two pathways (“Cell Cycle” and “DNA replication”) out of four pathways by PathAct were commonly identified by
DAVID analysis. Thus, we confirmed a good degree of agreement among PathAct and conventional methods. Moreover, several
applications of further statistical analyses such as hierarchical cluster analysis by pathway activity, correlation analysis and
survival analysis between pathways were conducted.
Mapping the pathways that give rise to metastasis is one of the key challenges of breast cancer research. Recently, several large-scale studies have shed light on this problem through analysis of gene expression profiles to identify markers correlated with metastasis. Here, we apply a protein-network-based approach that identifies markers not as individual genes but as subnetworks extracted from protein interaction databases. The resulting subnetworks provide novel hypotheses for pathways involved in tumor progression. Although genes with known breast cancer mutations are typically not detected through analysis of differential expression, they play a central role in the protein network by interconnecting many differentially expressed genes. We find that the subnetwork markers are more reproducible than individual marker genes selected without network information, and that they achieve higher accuracy in the classification of metastatic versus non-metastatic tumors.
breast cancer metastasis; classification; protein networks; pathways; microarrays
An estimated 12% of females in the United States will develop breast cancer in their lifetime. Although, there are advances in treatment options including surgery and chemotherapy, breast cancer is still the second most lethal cancer in women. Thus, there is a clear need for better methods to predict prognosis for each breast cancer patient. With the advent of large genetic databases and the reduction in cost for the experiments, researchers are faced with choosing from a large pool of potential prognostic markers from numerous breast cancer gene expression profile studies.
Five microarray datasets related to breast cancer were examined using gene set analysis and the cancers were categorized into different subtypes using a scoring system based on genetic pathway activity.
We have observed that significant genes in the individual studies show little reproducibility across the datasets. From our comparative analysis, using gene pathways with clinical variables is more reliable across studies and shows promise in assessing a patient's prognosis.
This study concludes that, in light of clinical variables, there are significant gene pathways in common across the datasets. Specifically, several pathways can further significantly stratify patients for survival. These candidate pathways should help to develop a panel of significant biomarkers for the prognosis of breast cancer patients in a clinical setting.
Classification of cancers based on gene expressions produces better accuracy
when compared to that of the clinical markers. Feature selection improves
the accuracy of these classification algorithms by reducing the chance
of overfitting that happens due to large number of features. We develop a
new feature selection method called Biological Pathway-based Feature Selection (BPFS) for microarray data. Unlike most of the existing methods,
our method integrates signaling and gene regulatory pathways with gene
expression data to minimize the chance of overfitting of the method and to
improve the test accuracy. Thus, BPFS selects a biologically meaningful feature
set that is minimally redundant. Our experiments on published breast
cancer datasets demonstrate that all of the top 20 genes found by our method
are associated with cancer. Furthermore, the classification accuracy of our
signature is up to 18% better than that of vant Veers 70 gene signature,
and it is up to 8% better accuracy than the best published feature selection
Gene set enrichment testing has helped bridge the gap from an individual gene to a systems biology interpretation of microarray data. Although gene sets are defined a priori based on biological knowledge, current methods for gene set enrichment testing treat all genes equal. It is well-known that some genes, such as those responsible for housekeeping functions, appear in many pathways, whereas other genes are more specialized and play a unique role in a single pathway. Drawing inspiration from the field of information retrieval, we have developed and present here an approach to incorporate gene appearance frequency (in KEGG pathways) into two current methods, Gene Set Enrichment Analysis (GSEA) and logistic regression-based LRpath framework, to generate more reproducible and biologically meaningful results.
Two breast cancer microarray datasets were analyzed to identify gene sets differentially expressed between histological grade 1 and 3 breast cancer. The correlation of Normalized Enrichment Scores (NES) between gene sets, generated by the original GSEA and GSEA with the appearance frequency of genes incorporated (GSEA-AF), was compared. GSEA-AF resulted in higher correlation between experiments and more overlapping top gene sets. Several cancer related gene sets achieved higher NES in GSEA-AF as well. The same datasets were also analyzed by LRpath and LRpath with the appearance frequency of genes incorporated (LRpath-AF). Two well-studied lung cancer datasets were also analyzed in the same manner to demonstrate the validity of the method, and similar results were obtained.
We introduce an alternative way to integrate KEGG PATHWAY information into gene set enrichment testing. The performance of GSEA and LRpath can be enhanced with the integration of appearance frequency of genes. We conclude that, generally, gene set analysis methods with the integration of information from KEGG PATHWAY performs better both statistically and biologically.
With the advent of high-throughput technologies for measuring genome-wide expression profiles, a large number of methods have been proposed for discovering diagnostic markers that can accurately discriminate between different classes of a disease. However, factors such as the small sample size of typical clinical data, the inherent noise in high-throughput measurements, and the heterogeneity across different samples, often make it difficult to find reliable gene markers. To overcome this problem, several studies have proposed the use of pathway-based markers, instead of individual gene markers, for building the classifier. Given a set of known pathways, these methods estimate the activity level of each pathway by summarizing the expression values of its member genes, and use the pathway activities for classification. It has been shown that pathway-based classifiers typically yield more reliable results compared to traditional gene-based classifiers. In this paper, we propose a new classification method based on probabilistic inference of pathway activities. For a given sample, we compute the log-likelihood ratio between different disease phenotypes based on the expression level of each gene. The activity of a given pathway is then inferred by combining the log-likelihood ratios of the constituent genes. We apply the proposed method to the classification of breast cancer metastasis, and show that it achieves higher accuracy and identifies more reproducible pathway markers compared to several existing pathway activity inference methods.
Breast cancers lacking the estrogen receptor (ER) can be distinguished from other breast cancers on the basis of poor prognosis, high grade, distinctive histopathology and unique molecular signatures. These features further distinguish estrogen receptor negative (ER−) tumor subtypes, but targeted therapy is currently limited to tumors over-expressing the ErbB2 receptor.
To uncover the pathways against which future therapies could be developed we undertook a meta-analysis of gene expression from five large microarray datasets relative to ER status. A measure of association with ER status was calculated for every Affymetrix HG-U133A probe set and the pathways that distinguished ER− tumors were defined by testing for enrichment of biologically defined gene sets using Gene Set Enrichment Analysis (GSEA). As expected, the expression of the direct transcriptional targets of the ER was muted in ER− tumors, but the expression of genes indirectly regulated by estrogen was enhanced. We also observed enrichment of independent MYC- and E2F-driven transcriptional programs. We used a cell model of estrogen and MYC action to define the interaction between estrogen and MYC transcriptional activity in breast cancer. We found that the basal subgroup of ER− breast cancer showed a strong MYC transcriptional response that reproduced the indirect estrogen response seen in estrogen receptor positive (ER+) breast cancer cells.
Increased transcriptional activity of MYC is a characteristic of basal breast cancers where it mimics a large part of an estrogen response in the absence of the ER, suggesting a mechanism by which these cancers achieve estrogen-independence and providing a potential therapeutic target for this poor prognosis sub group of breast cancer.
Gene expression-based prostate cancer gene signatures of poor prognosis are hampered by lack of gene feature reproducibility and a lack of understandability of their function. Molecular pathway-level mechanisms are intrinsically more stable and more robust than an individual gene. The Functional Analysis of Individual Microarray Expression (FAIME) we developed allows distinctive sample-level pathway measurements with utility for correlation with continuous phenotypes (e.g. survival). Further, we and others have previously demonstrated that pathway-level classifiers can be as accurate as gene-level classifiers using curated genesets that may implicitly comprise ascertainment biases (e.g. KEGG, GO). Here, we hypothesized that transformation of individual prostate cancer patient gene expression to pathway-level mechanisms derived from automated high throughput analyses of genomic datasets may also permit personalized pathway analysis and improve prognosis of recurrent disease.
Via FAIME, three independent prostate gene expression arrays with both normal and tumor samples were transformed into two distinct types of molecular pathway mechanisms: (i) the curated Gene Ontology (GO) and (ii) dynamic expression activity networks of cancer (Cancer Modules). FAIME-derived mechanisms for tumorigenesis were then identified and compared. Curated GO and computationally generated "Cancer Module" mechanisms overlap significantly and are enriched for known oncogenic deregulations and highlight potential areas of investigation. We further show in two independent datasets that these pathway-level tumorigenesis mechanisms can identify men who are more likely to develop recurrent prostate cancer (log-rank_p = 0.019).
Curation-free biomodules classification derived from congruent gene expression activation breaks from the paradigm of recapitulating the known curated pathway mechanism universe.
Morphologic features of tumour cells have long been validated for the clinical classification of breast cancers and are regularly used as a “gold standard” to ascertain prognostic outcome in patients. Identification of molecular markers such as expression of the receptors for estrogen (er) and progesterone (pgr) and the human epidermal growth factor receptor 2 (her2) has played an important role in determining targets for the development of efficacious drugs for treatment and has also offered additional predictive value for the therapeutic assessment of patients with breast cancer. More recent technical advancements in identifying several cancer-related genes have provided further opportunities to identify specific subtypes of breast cancer. Among the subtypes, tumours with triple-negative cells are identified using specific staining procedures for basal markers such as cytokeratin 5 and 6 and the absence of er, pgr, and her2 expression. Patients with triple-negative breast cancers therefore have the disadvantage of not benefiting from currently available receptor-targeted systemic therapy. Optimal conditions for the therapeutic assessment of women with triple-negative breast tumours and for the management of their disease have yet to be validated in prospective investigations. The present review discusses the differences between triple-negative breast tumours and basal-like breast tumours and also the role of mutations in the BRCA genes. Attention is also paid to treatment options available to patients with triple-negative breast tumours.
Triple-negative breast tumours; epidermal growth factor receptor; chemotherapy
Pathway analysis of large-scale omics data assists us with the examination of the cumulative effects of multiple functionally related genes, which are difficult to detect using the traditional single gene/marker analysis. So far, most of the genomic studies have been conducted in a single domain, e.g., by genome-wide association studies (GWAS) or microarray gene expression investigation. A combined analysis of disease susceptibility genes across multiple platforms at the pathway level is an urgent need because it can reveal more reliable and more biologically important information.
We performed an integrative pathway analysis of a GWAS dataset and a microarray gene expression dataset in prostate cancer. We obtained a comprehensive pathway annotation set from knowledge-based public resources, including KEGG pathways and the prostate cancer candidate gene set, and gene sets specifically defined based on cross-platform information. By leveraging on this pathway collection, we first searched for significant pathways in the GWAS dataset using four methods, which represent two broad groups of pathway analysis approaches. The significant pathways identified by each method varied greatly, but the results were more consistent within each method group than between groups. Next, we conducted a gene set enrichment analysis of the microarray gene expression data and found 13 pathways with cross-platform evidence, including "Fc gamma R-mediated phagocytosis" (PGWAS = 0.003, Pexpr < 0.001, and Pcombined = 6.18 × 10-8), "regulation of actin cytoskeleton" (PGWAS = 0.003, Pexpr = 0.009, and Pcombined = 3.34 × 10-4), and "Jak-STAT signaling pathway" (PGWAS = 0.001, Pexpr = 0.084, and Pcombined = 8.79 × 10-4).
Our results provide evidence at both the genetic variation and expression levels that several key pathways might have been involved in the pathological development of prostate cancer. Our framework that employs gene expression data to facilitate pathway analysis of GWAS data is not only feasible but also much needed in studying complex disease.
Different microarray studies have compiled gene lists for predicting outcomes of a range of treatments and diseases. These have produced gene lists that have little overlap, indicating that the results from any one study are unstable. It has been suggested that the underlying pathways are essentially identical, and that the expression of gene sets, rather than that of individual genes, may be more informative with respect to prognosis and understanding of the underlying biological process.
We sought to examine the stability of prognostic signatures based on gene sets rather than individual genes. We classified breast cancer cases from five microarray studies according to the risk of metastasis, using features derived from predefined gene sets. The expression levels of genes in the sets are aggregated, using what we call a set statistic. The resulting prognostic gene sets were as predictive as the lists of individual genes, but displayed more consistent rankings via bootstrap replications within datasets, produced more stable classifiers across different datasets, and are potentially more interpretable in the biological context since they examine gene expression in the context of their neighbouring genes in the pathway. In addition, we performed this analysis in each breast cancer molecular subtype, based on ER/HER2 status. The prognostic gene sets found in each subtype were consistent with the biology based on previous analysis of individual genes.
To date, most analyses of gene expression data have focused at the level of the individual genes. We show that a complementary approach of examining the data using predefined gene sets can reduce the noise and could provide increased insight into the underlying biological pathways.
Single sample predictors (SSPs) and Subtype classification models (SCMs) are gene expression–based classifiers used to identify the four primary molecular subtypes of breast cancer (basal-like, HER2-enriched, luminal A, and luminal B). SSPs use hierarchical clustering, followed by nearest centroid classification, based on large sets of tumor-intrinsic genes. SCMs use a mixture of Gaussian distributions based on sets of genes with expression specifically correlated with three key breast cancer genes (estrogen receptor [ER], HER2, and aurora kinase A [AURKA]). The aim of this study was to compare the robustness, classification concordance, and prognostic value of these classifiers with those of a simplified three-gene SCM in a large compendium of microarray datasets.
Thirty-six publicly available breast cancer datasets (n = 5715) were subjected to molecular subtyping using five published classifiers (three SSPs and two SCMs) and SCMGENE, the new three-gene (ER, HER2, and AURKA) SCM. We used the prediction strength statistic to estimate robustness of the classification models, defined as the capacity of a classifier to assign the same tumors to the same subtypes independently of the dataset used to fit it. We used Cohen κ and Cramer V coefficients to assess concordance between the subtype classifiers and association with clinical variables, respectively. We used Kaplan–Meier survival curves and cross-validated partial likelihood to compare prognostic value of the resulting classifications. All statistical tests were two-sided.
SCMs were statistically significantly more robust than SSPs, with SCMGENE being the most robust because of its simplicity. SCMGENE was statistically significantly concordant with published SCMs (κ = 0.65–0.70) and SSPs (κ = 0.34–0.59), statistically significantly associated with ER (V = 0.64), HER2 (V = 0.52) status, and histological grade (V = 0.55), and yielded similar strong prognostic value.
Our results suggest that adequate classification of the major and clinically relevant molecular subtypes of breast cancer can be robustly achieved with quantitative measurements of three key genes.
Colon cancer patients with the same stage show diverse clinical behavior due to tumor heterogeneity. We aimed to discover distinct classes of tumors based on microarray expression patterns, to analyze whether the molecular classification correlated with the histopathological stages or other clinical parameters and to study differences in the survival.
Hierarchical clustering was performed for class discovery in 88 colon tumors (stages I to IV). Pathways analysis and correlations between clinical parameters and our classification were analyzed. Tumor subtypes were validated using an external set of 78 patients. A 167 gene signature associated to the main subtype was generated using the 3-Nearest-Neighbor method. Coincidences with other prognostic predictors were assesed.
Hierarchical clustering identified four robust tumor subtypes with biologically and clinically distinct behavior. Stromal components (p < 0.001), nuclear β-catenin (p = 0.021), mucinous histology (p = 0.001), microsatellite-instability (p = 0.039) and BRAF mutations (p < 0.001) were associated to this classification but it was independent of Dukes stages (p = 0.646). Molecular subtypes were established from stage I. High-stroma-subtype showed increased levels of genes and altered pathways distinctive of tumour-associated-stroma and components of the extracellular matrix in contrast to Low-stroma-subtype. Mucinous-subtype was reflected by the increased expression of trefoil factors and mucins as well as by a higher proportion of MSI and BRAF mutations. Tumor subtypes were validated using an external set of 78 patients. A 167 gene signature associated to the Low-stroma-subtype distinguished low risk patients from high risk patients in the external cohort (Dukes B and C:HR = 8.56(2.53-29.01); Dukes B,C and D:HR = 1.87(1.07-3.25)). Eight different reported survival gene signatures segregated our tumors into two groups the Low-stroma-subtype and the other tumor subtypes.
We have identified novel molecular subtypes in colon cancer with distinct biological and clinical behavior that are established from the initiation of the tumor. Tumor microenvironment is important for the classification and for the malignant power of the tumor. Differential gene sets and biological pathways characterize each tumor subtype reflecting underlying mechanisms of carcinogenesis that may be used for the selection of targeted therapeutic procedures. This classification may contribute to an improvement in the management of the patients with CRC and to a more comprehensive prognosis.
Colon cancer; Microarray gene expression; Molecular classification; Stroma; Survival
Human breast tumors are heterogeneous and consist of phenotypically diverse cells. Breast cancer cells with a CD44+/CD24- phenotype have been suggested to have tumor-initiating properties with stem cell-like and invasive features, although it is unclear whether their presence within a tumor has clinical implications. There is also a large heterogeneity between tumors, illustrated by reproducible stratification into various subtypes based on gene expression profiles or histopathological features. We have explored the prevalence of cells with different CD44/CD24 phenotypes within breast cancer subtypes.
Double-staining immunohistochemistry was used to quantify CD44 and CD24 expression in 240 human breast tumors for which information on other tumor markers and clinical characteristics was available. Gene expression data were also accessible for a cohort of the material.
A considerable heterogeneity in CD44 and CD24 expression was seen both between and within tumors. A complete lack of both proteins was evident in 35% of the tumors, while 13% contained cells of more than one of the CD44+/CD24-, CD44-/CD24+ and CD44+/CD24+ phenotypes. CD44+/CD24- cells were detected in 31% of the tumors, ranging in proportion from only a few to close to 100% of tumor cells. The CD44+/CD24- phenotype was most common in the basal-like subgroup – characterized as negative for the estrogen and progesterone receptors as well as for HER2, and as positive for cytokeratin 5/14 and/or epidermal growth factor receptor, and particularly common in BRCA1 hereditary tumors, of which 94% contained CD44+/CD24- cells. The CD44+/CD24- phenotype was surprisingly scarce in HER2+ tumors, which had a predominantly CD24+ status. A CD44+/CD24- gene expression signature was generated, which included CD44 and α6-integrin (CD49f) among the top-ranked overexpressed genes.
We demonstrate an association between basal-like and particularly BRCA1 hereditary breast cancer and the presence of CD44+/CD24- cells. Not all basal-like tumors and very few HER2+ tumors, however, contain CD44+/CD24- cells, emphasizing that a putative tumorigenic ability may not be confined to cells of this phenotype and that other breast cancer stem cell markers remain to be identified.
Using primary tumor gene expression has been shown to have the ability of finding metastasis-driving gene markers for prediction of breast cancer recurrence (BCR). However, there are some difficulties associated with analysis of microarray data, which led to poor predictive power and inconsistency of previously introduced gene signatures. In this study, a hybrid method was proposed for identifying more predictive gene signatures from microarray datasets. Initially, the parameters of a Rough-Set (RS) theory based feature selection method were tuned to construct a customized gene extraction algorithm. Afterward, using RS gene selection method the most informative genes selected from six independent breast cancer datasets. Then, combined set of these six signature sets, containing 114 genes, was evaluated for prediction of BCR. In final, a meta-signature, containing 18 genes, selected from the combination of datasets and its prediction accuracy compared to the combined signature. The results of 10-fold cross-validation test showed acceptable misclassification error rate (MCR) over 1338 cases of breast cancer patients. In comparison to a recent similar work, our approach reached more than 5% reduction in MCR using a fewer number of genes for prediction. The results also demonstrated 7% improvement in average accuracy in six utilized datasets, using the combined set of 114 genes in comparison with 18-genes meta-signature. In this study, a more informative gene signature was selected for prediction of BCR using a RS based gene extraction algorithm. To conclude, combining different signatures demonstrated more stable prediction over independent datasets.
Breast cancer recurrence prediction; gene expression signature; meta-signature; rough-set theory
Clustering analysis of microarray data is often criticized for giving ambiguous results because of sensitivity to data perturbation or clustering techniques used. In this paper, we describe a new method based on principal component analysis and ensemble consensus clustering that avoids these problems.
We illustrate the method on a public microarray dataset from 36 breast cancer patients of whom 31 were diagnosed with at least two of three pathological stages of disease (atypical ductal hyperplasia (ADH), ductal carcinoma in situ (DCIS) and invasive ductal carcinoma (IDC). Our method identifies an optimum set of genes and divides the samples into stable clusters which correlate with clinical classification into Luminal, Basal-like and Her2+ subtypes. Our analysis reveals a hierarchical portrait of breast cancer progression and identifies genes and pathways for each stage, grade and subtype. An intriguing observation is that the disease phenotype is distinguishable in ADH and progresses along distinct pathways for each subtype. The genetic signature for disease heterogeneity across subtypes is greater than the heterogeneity of progression from DCIS to IDC within a subtype, suggesting that the disease subtypes have distinct progression pathways.
Our method identifies six disease subtype and one normal clusters. The first split separates the normal samples from the cancer samples. Next, the cancer cluster splits into low grade (pathological grades 1 and 2) and high grade (pathological grades 2 and 3) while the normal cluster is unchanged. Further, the low grade cluster splits into two subclusters and the high grade cluster into four. The final six disease clusters are mapped into one Luminal A, three Luminal B, one Basal-like and one Her2+.
We confirm that the cancer phenotype can be identified in early stage because the genes altered in this stage progressively alter further as the disease progresses through DCIS into IDC. We identify six subtypes of disease which have distinct genetic signatures and remain separated in the clustering hierarchy. Our findings suggest that the heterogeneity of disease across subtypes is higher than the heterogeneity of the disease progression within a subtype, indicating that the subtypes are in fact distinct diseases.
Selection of novel molecular markers is an important goal of cancer genomics studies. The aim of our analysis was to apply the multivariate bioinformatical tools to rank the genes – potential markers of papillary thyroid cancer (PTC) according to their diagnostic usefulness. We also assessed the accuracy of benign/malignant classification, based on gene expression profiling, for PTC. We analyzed a 180-array dataset (90 HG-U95A and 90 HG-U133A oligonucleotide arrays), which included a collection of 57 PTCs, 61 benign thyroid tumors, and 62 apparently normal tissues. Gene selection was carried out by the support vector machines method with bootstrapping, which allowed us 1) ranking the genes that were most important for classification quality and appeared most frequently in the classifiers (bootstrap-based feature ranking, BBFR); 2) ranking the samples, and thus detecting cases that were most difficult to classify (bootstrap-based outlier detection). The accuracy of PTC diagnosis was 98.5% for a 20-gene classifier, its 95% confidence interval (CI) was 95.9–100%, with the lower limit of CI exceeding 95% already for five genes. Only 5 of 180 samples (2.8%) were misclassified in more than 10% of bootstrap iterations. We specified 43 genes which are most suitable as molecular markers of PTC, among them some well-known PTC markers (MET, fibronectin 1, dipeptidylpeptidase 4, or adenosine A1 receptor) and potential new ones (UDP-galactose-4-epimerase, cadherin 16, gap junction protein 3, sushi, nidogen, and EGF-like domains 1, inhibitor of DNA binding 3, RUNX1, leiomodin 1, F-box protein 9, and tripartite motif-containing 58). The highest ranking gene, metallophosphoesterase domain-containing protein 2, achieved 96.7% of the maximum BBFR score.
With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes.
Despite important advances in microarray-based molecular classification of tumors, its application in clinical settings remains formidable. This is in part due to the limitation of current analysis programs in discovering robust biomarkers and developing classifiers with a practical set of genes. Genetic programming (GP) is a type of machine learning technique that uses evolutionary algorithm to simulate natural selection as well as population dynamics, hence leading to simple and comprehensible classifiers. Here we applied GP to cancer expression profiling data to select feature genes and build molecular classifiers by mathematical integration of these genes. Analysis of thousands of GP classifiers generated for a prostate cancer data set revealed repetitive use of a set of highly discriminative feature genes, many of which are known to be disease associated. GP classifiers often comprise five or less genes and successfully predict cancer types and subtypes. More importantly, GP classifiers generated in one study are able to predict samples from an independent study, which may have used different microarray platforms. In addition, GP yielded classification accuracy better than or similar to conventional classification methods. Furthermore, the mathematical expression of GP classifiers provides insights into relationships between classifier genes. Taken together, our results demonstrate that GP may be valuable for generating effective classifiers containing a practical set of genes for diagnostic/prognostic cancer classification.
Molecular diagnostics; biomarkers; prostate cancer; evolutionary algorithm; microarray profiling
Many gene-set analysis methods have been previously proposed and compared through simulation studies and analysis of real datasets for binary phenotypes. We focused on the survival phenotype and compared the performances of Gene Set Enrichment Analysis (GSEA), Global Test (GT), Wald-type Test (WT) and Global Boost Test (GBST) methods in a simulation study and on two ovarian cancer data sets. We considered two versions of GSEA by allowing different weights: GSEA1 uses equal weights, yielding results similar to the Kolmogorov-Smirnov test; while GSEA2's weights are based on the correlation between genes and the phenotype.
We compared GSEA1, GSEA2, GT, WT and GBST in a simulation study with various settings for the correlation structure of the genes and the association parameter between the survival outcome and the genes. Simulation results indicated that GT, WT and GBST consistently have higher power than GSEA1 and GSEA2 across all scenarios. However, the power of the five tests depends on the combination of correlation structure and association parameter. For the ovarian cancer data set, using the FDR threshold of q < 0.1, the GT, WT and GBST detected 12, 6 and 8 significant pathways, respectively, whereas neither GSEA1 nor GSEA2 detected any significant pathways. In addition, among the pathways found significant by GT, WT, and GBST, three pathways - Purine metabolism, Leukocyte transendothelial migration and Jak-STAT signaling pathway - overlapped with those reported in previous ovarian cancer microarray studies.
Simulation studies and a real data example indicate that GT, WT and GBST tend to have high power, whereas GSEA1 and GSEA2 have lower power. We also found that the power of the five tests is much higher when genes are correlated than when genes are independent, when survival is positively associated with genes. It seems that there is a synergistic effect in detecting significant gene sets when significant genes have within-class correlation and the association between survival and genes is positive or negative (i.e., one-direction correlation).
Tumor-associated macrophages (TAMs) may play an important role in tumor immunity. We studied the activation state of TAMs in cutaneous SCC, the second most common human cancer. CD163 was identified as a more abundant, sensitive, and accurate marker of TAMs, compared to CD68. CD163+ TAMs produced pro-tumoral factors MMP9 and MMP11, at the gene and protein levels. Gene set enrichment analysis (GSEA) was used to evaluate M1 and M2 macrophage gene sets in the SCC genes and to identify candidate genes in order to phenotypically characterize TAMs. There was co-expression of CD163 and alternatively activated “M2” markers, CD209 and CCL18. There was enrichment for classically activated “M1” genes in SCC, which was confirmed in situ by co-localization of CD163 and phosphorylated STAT1, IL-23p19, IL-12/IL-23p40, and CD127. Also, a subset of TAMs in SCC was bi-activated as CD163+ cells expressed markers for both M1 and M2, shown by triple-label immunofluorescence. These data support heterogeneous activation states of TAMs in SCC, and suggest that a dynamic model of macrophage activation would be more useful to characterize TAMs.
cutaneous squamous cell carcinoma; CD163; macrophages; skin
Tumour-initiating cells (TICs) or cancer stem cells can exist as a small population in malignant tissues. The signalling pathways activated in TICs that contribute to tumourigenesis are not fully understood.
Several breast cancer cell lines were sorted with CD24 and CD44, known markers for enrichment of breast cancer TICs. Tumourigenesis was analysed using sorted cells and total RNA was subjected to gene expression profiling and gene set enrichment analysis (GSEA).
We showed that several breast cancer cell lines have a small population of CD24−/low/CD44+ cells in which TICs may be enriched, and confirmed the properties of TICs in a xenograft model. GSEA revealed that CD24−/low/CD44+ cell populations are enriched for genes involved in transforming growth factor-β, tumour necrosis factor, and interferon response pathways. Moreover, we found the presence of nuclear factor-κB (NF-κB) activity in CD24−/low/CD44+ cells, which was previously unrecognised. In addition, NF-κB inhibitor dehydroxymethylepoxyquinomicin (DHMEQ) prevented tumourigenesis of CD24−/low/CD44+ cells in vivo.
Our findings suggest that signalling pathways identified using GSEA help to identify molecular targets and biomarkers for TIC-like cells.
tumour-initiating cells; NF-κB; CD24; CD44; gene expression profiling; DHMEQ
Breast cancers can be classified by hierarchical clustering using an “intrinsic” gene list into one of at least five molecular subtypes: basal-like, HER2, luminal A, luminal B, and normal breast-like. Five different intrinsic gene lists composed of varying numbers of genes have been used for molecular subtype identification and classification of breast cancers. The aim of this study was to determine the objectivity and interobserver reproducibility of the assignment of molecular subtype classes by hierarchical cluster analysis.
Three publicly available breast cancer datasets (n = 779) were subjected to two-way average-linkage hierarchical cluster analysis using five distinct intrinsic gene lists. We used free-marginal Kappa statistics to analyze interobserver agreement among five breast cancer researchers for the whole classification and for each molecular subtype separately according to each intrinsic gene list for each breast cancer dataset.
None of the classification systems tested produced almost perfect agreement (Kappa ≥ 0.81) among observers. However, substantial interobserver agreement (70.8% to 76.1% of the samples and free-marginal Kappa scores from 0.635 to 0.701) was consistently observed in all datasets for four molecular subtypes (luminal, basal-like, HER2, and normal breast-like). When luminal cancers were subdivided (luminal A, B, and C), none of the classification systems produced substantial agreement (Kappa ≥ 0.61) in all the datasets analyzed. Analysis of each subtype separately revealed that only two (basal-like and HER2) could be reproducibly identified by independent observers (Kappa ≥ 0.81).
Assignment of molecular subtype classes of breast cancer based on the analysis of dendrograms obtained with hierarchical cluster analysis is subjective and shows modest interobserver reproducibility. For the development of a molecular taxonomy, objective definitions for each molecular subtype and standardized methods for their identification are required.