Finding reliable gene markers for accurate disease classification is very challenging due to a number of reasons, including the small sample size of typical clinical data, high noise in gene expression measurements, and the heterogeneity across patients. In fact, gene markers identified in independent studies often do not coincide with each other, suggesting that many of the predicted markers may have no biological significance and may be simply artifacts of the analyzed dataset. To find more reliable and reproducible diagnostic markers, several studies proposed to analyze the gene expression data at the level of groups of functionally related genes, such as pathways. Studies have shown that pathway markers tend to be more robust and yield more accurate classification results. One practical problem of the pathway-based approach is the limited coverage of genes by currently known pathways. As a result, potentially important genes that play critical roles in cancer development may be excluded. To overcome this problem, we propose a novel method for identifying reliable subnetwork markers in a human protein-protein interaction (PPI) network.
In this method, we overlay the gene expression data with the PPI network and look for the most discriminative linear paths that consist of discriminative genes that are highly correlated to each other. The overlapping linear paths are then optimally combined into subnetworks that can potentially serve as effective diagnostic markers. We tested our method on two independent large-scale breast cancer datasets and compared the effectiveness and reproducibility of the identified subnetwork markers with gene-based and pathway-based markers. We also compared the proposed method with an existing subnetwork-based method.
The proposed method can efficiently find reliable subnetwork markers that outperform the gene-based and pathway-based markers in terms of discriminative power, reproducibility and classification performance. Subnetwork markers found by our method are highly enriched in common GO terms, and they can more accurately classify breast cancer metastasis compared to markers found by a previous method.
One important problem in translational genomics is the identification of reliable and reproducible markers that can be used to discriminate between different classes of a complex disease, such as cancer. The typical small sample setting makes the prediction of such markers very challenging, and various approaches have been proposed to address this problem. For example, it has been shown that pathway markers, which aggregate the gene activities in the same pathway, tend to be more robust than gene markers. Furthermore, the use of gene expression ranking has been demonstrated to be robust to batch effects and that it can lead to more interpretable results. In this paper, we propose an enhanced pathway activity inference method that uses gene ranking to predict the pathway activity in a probabilistic manner. The main focus of this work is on identifying robust pathway markers that can ultimately lead to robust classifiers with reproducible performance across datasets. Simulation results based on multiple breast cancer datasets show that the proposed inference method identifies better pathway markers that can predict breast cancer metastasis with higher accuracy. Moreover, the identified pathway markers can lead to better classifiers with more consistent classification performance across independent datasets.
We developed PathAct, a novel method for pathway analysis to investigate the biological and clinical implications of the gene
expression profiles. The advantage of PathAct in comparison with the conventional pathway analysis methods is that it can
estimate pathway activity levels for individual patient quantitatively in the form of a pathway-by-sample matrix. This matrix can
be used for further analysis such as hierarchical clustering and other analysis methods. To evaluate the feasibility of PathAct,
comparison with frequently used gene-enrichment analysis methods was conducted using two public microarray datasets. The
dataset #1 was that of breast cancer patients, and we investigated pathways associated with triple-negative breast cancer by
PathAct, compared with those obtained by gene set enrichment analysis (GSEA). The dataset #2 was another breast cancer dataset
with disease-free survival (DFS) of each patient. Contribution by each pathway to prognosis was investigated by our method as
well as the Database for Annotation, Visualization and Integrated Discovery (DAVID) analysis. In the dataset #1, four out of the six
pathways that satisfied p < 0.05 and FDR < 0.30 by GSEA were also included in those obtained by the PathAct method. For the
dataset #2, two pathways (“Cell Cycle” and “DNA replication”) out of four pathways by PathAct were commonly identified by
DAVID analysis. Thus, we confirmed a good degree of agreement among PathAct and conventional methods. Moreover, several
applications of further statistical analyses such as hierarchical cluster analysis by pathway activity, correlation analysis and
survival analysis between pathways were conducted.
Mapping the pathways that give rise to metastasis is one of the key challenges of breast cancer research. Recently, several large-scale studies have shed light on this problem through analysis of gene expression profiles to identify markers correlated with metastasis. Here, we apply a protein-network-based approach that identifies markers not as individual genes but as subnetworks extracted from protein interaction databases. The resulting subnetworks provide novel hypotheses for pathways involved in tumor progression. Although genes with known breast cancer mutations are typically not detected through analysis of differential expression, they play a central role in the protein network by interconnecting many differentially expressed genes. We find that the subnetwork markers are more reproducible than individual marker genes selected without network information, and that they achieve higher accuracy in the classification of metastatic versus non-metastatic tumors.
breast cancer metastasis; classification; protein networks; pathways; microarrays
An estimated 12% of females in the United States will develop breast cancer in their lifetime. Although, there are advances in treatment options including surgery and chemotherapy, breast cancer is still the second most lethal cancer in women. Thus, there is a clear need for better methods to predict prognosis for each breast cancer patient. With the advent of large genetic databases and the reduction in cost for the experiments, researchers are faced with choosing from a large pool of potential prognostic markers from numerous breast cancer gene expression profile studies.
Five microarray datasets related to breast cancer were examined using gene set analysis and the cancers were categorized into different subtypes using a scoring system based on genetic pathway activity.
We have observed that significant genes in the individual studies show little reproducibility across the datasets. From our comparative analysis, using gene pathways with clinical variables is more reliable across studies and shows promise in assessing a patient's prognosis.
This study concludes that, in light of clinical variables, there are significant gene pathways in common across the datasets. Specifically, several pathways can further significantly stratify patients for survival. These candidate pathways should help to develop a panel of significant biomarkers for the prognosis of breast cancer patients in a clinical setting.
Classification of cancers based on gene expressions produces better accuracy
when compared to that of the clinical markers. Feature selection improves
the accuracy of these classification algorithms by reducing the chance
of overfitting that happens due to large number of features. We develop a
new feature selection method called Biological Pathway-based Feature Selection (BPFS) for microarray data. Unlike most of the existing methods,
our method integrates signaling and gene regulatory pathways with gene
expression data to minimize the chance of overfitting of the method and to
improve the test accuracy. Thus, BPFS selects a biologically meaningful feature
set that is minimally redundant. Our experiments on published breast
cancer datasets demonstrate that all of the top 20 genes found by our method
are associated with cancer. Furthermore, the classification accuracy of our
signature is up to 18% better than that of vant Veers 70 gene signature,
and it is up to 8% better accuracy than the best published feature selection
With the advent of high-throughput technologies for measuring genome-wide expression profiles, a large number of methods have been proposed for discovering diagnostic markers that can accurately discriminate between different classes of a disease. However, factors such as the small sample size of typical clinical data, the inherent noise in high-throughput measurements, and the heterogeneity across different samples, often make it difficult to find reliable gene markers. To overcome this problem, several studies have proposed the use of pathway-based markers, instead of individual gene markers, for building the classifier. Given a set of known pathways, these methods estimate the activity level of each pathway by summarizing the expression values of its member genes, and use the pathway activities for classification. It has been shown that pathway-based classifiers typically yield more reliable results compared to traditional gene-based classifiers. In this paper, we propose a new classification method based on probabilistic inference of pathway activities. For a given sample, we compute the log-likelihood ratio between different disease phenotypes based on the expression level of each gene. The activity of a given pathway is then inferred by combining the log-likelihood ratios of the constituent genes. We apply the proposed method to the classification of breast cancer metastasis, and show that it achieves higher accuracy and identifies more reproducible pathway markers compared to several existing pathway activity inference methods.
Gene set enrichment testing has helped bridge the gap from an individual gene to a systems biology interpretation of microarray data. Although gene sets are defined a priori based on biological knowledge, current methods for gene set enrichment testing treat all genes equal. It is well-known that some genes, such as those responsible for housekeeping functions, appear in many pathways, whereas other genes are more specialized and play a unique role in a single pathway. Drawing inspiration from the field of information retrieval, we have developed and present here an approach to incorporate gene appearance frequency (in KEGG pathways) into two current methods, Gene Set Enrichment Analysis (GSEA) and logistic regression-based LRpath framework, to generate more reproducible and biologically meaningful results.
Two breast cancer microarray datasets were analyzed to identify gene sets differentially expressed between histological grade 1 and 3 breast cancer. The correlation of Normalized Enrichment Scores (NES) between gene sets, generated by the original GSEA and GSEA with the appearance frequency of genes incorporated (GSEA-AF), was compared. GSEA-AF resulted in higher correlation between experiments and more overlapping top gene sets. Several cancer related gene sets achieved higher NES in GSEA-AF as well. The same datasets were also analyzed by LRpath and LRpath with the appearance frequency of genes incorporated (LRpath-AF). Two well-studied lung cancer datasets were also analyzed in the same manner to demonstrate the validity of the method, and similar results were obtained.
We introduce an alternative way to integrate KEGG PATHWAY information into gene set enrichment testing. The performance of GSEA and LRpath can be enhanced with the integration of appearance frequency of genes. We conclude that, generally, gene set analysis methods with the integration of information from KEGG PATHWAY performs better both statistically and biologically.
Gene set enrichment analysis (GSEA) associates gene sets and phenotypes, its use is predicated on the choice of a pre-defined collection of sets. The defacto standard implementation of GSEA provides seven collections yet there are no guidelines for the choice of collections and the impact of such choice, if any, is unknown. Here we compare each of the standard gene set collections in the context of a large dataset of drug response in human cancer cell lines. We define and test a new collection based on gene co-expression in cancer cell lines to compare the performance of the standard collections to an externally derived cell line based collection. The results show that GSEA findings vary significantly depending on the collection chosen for analysis. Henceforth, collections should be carefully selected and reported in studies that leverage GSEA.
Renal cell carcinoma (RCC) is the most lethal type of cancer in the urinary system and often presents as a metastatic disease. Furthermore, there are no effective treatments for the disease. Several studies based on gene expression profiling have been performed with the aim of gaining insights into the pathogenesis of RCC; however, few studies have investigated RCC at the pathway level to search for the possible pathways involved in clear cell RCC (CCRCC). In this study, gene set enrichment analysis (GSEA) was conducted on microarray datasets from CCRCC tissue. DAVID functional enrichment analysis was performed based on the dysregulated genes that were identified in a meta-analysis performed on the microarray datasets from CCRCC tissue. In GSEA, 17 down- and 12 upregulated pathways coexisted in six datasets. The majority of the upregulated pathways were associated with the immune system. In addition, 32 dysregulated pathways were obtained from DAVID functional enrichment analysis, based on the abnormal genes identified by meta-analysis. This study demonstrated that cross-GSEA is a useful method for exploring the critical pathways involved CCRCC; however, an individual dataset with a small sample may introduce bias. A cross-GSEA based on certain well-designed datasets may be required to further the progress made in this study, following the analysis of its results.
clear cell renal cell carcinoma; pathway analysis; meta-analyis; gene expression
The reliability and reproducibility of gene biomarkers for classification of cancer patients has been challenged due to measurement noise and biological heterogeneity among patients. In this paper, we propose a novel module-based feature selection framework, which integrates biological network information and gene expression data to identify biomarkers not as individual genes but as functional modules. Results from four breast cancer studies demonstrate that the identified module biomarkers i) achieve higher classification accuracy in independent validation datasets; ii) are more reproducible than individual gene markers; iii) improve the biological interpretability of results; and iv) are enriched in cancer “disease drivers”.
Cancer biomarkers; systems biology; feature selection; disease classification
Breast cancers lacking the estrogen receptor (ER) can be distinguished from other breast cancers on the basis of poor prognosis, high grade, distinctive histopathology and unique molecular signatures. These features further distinguish estrogen receptor negative (ER−) tumor subtypes, but targeted therapy is currently limited to tumors over-expressing the ErbB2 receptor.
To uncover the pathways against which future therapies could be developed we undertook a meta-analysis of gene expression from five large microarray datasets relative to ER status. A measure of association with ER status was calculated for every Affymetrix HG-U133A probe set and the pathways that distinguished ER− tumors were defined by testing for enrichment of biologically defined gene sets using Gene Set Enrichment Analysis (GSEA). As expected, the expression of the direct transcriptional targets of the ER was muted in ER− tumors, but the expression of genes indirectly regulated by estrogen was enhanced. We also observed enrichment of independent MYC- and E2F-driven transcriptional programs. We used a cell model of estrogen and MYC action to define the interaction between estrogen and MYC transcriptional activity in breast cancer. We found that the basal subgroup of ER− breast cancer showed a strong MYC transcriptional response that reproduced the indirect estrogen response seen in estrogen receptor positive (ER+) breast cancer cells.
Increased transcriptional activity of MYC is a characteristic of basal breast cancers where it mimics a large part of an estrogen response in the absence of the ER, suggesting a mechanism by which these cancers achieve estrogen-independence and providing a potential therapeutic target for this poor prognosis sub group of breast cancer.
Gene expression-based prostate cancer gene signatures of poor prognosis are hampered by lack of gene feature reproducibility and a lack of understandability of their function. Molecular pathway-level mechanisms are intrinsically more stable and more robust than an individual gene. The Functional Analysis of Individual Microarray Expression (FAIME) we developed allows distinctive sample-level pathway measurements with utility for correlation with continuous phenotypes (e.g. survival). Further, we and others have previously demonstrated that pathway-level classifiers can be as accurate as gene-level classifiers using curated genesets that may implicitly comprise ascertainment biases (e.g. KEGG, GO). Here, we hypothesized that transformation of individual prostate cancer patient gene expression to pathway-level mechanisms derived from automated high throughput analyses of genomic datasets may also permit personalized pathway analysis and improve prognosis of recurrent disease.
Via FAIME, three independent prostate gene expression arrays with both normal and tumor samples were transformed into two distinct types of molecular pathway mechanisms: (i) the curated Gene Ontology (GO) and (ii) dynamic expression activity networks of cancer (Cancer Modules). FAIME-derived mechanisms for tumorigenesis were then identified and compared. Curated GO and computationally generated "Cancer Module" mechanisms overlap significantly and are enriched for known oncogenic deregulations and highlight potential areas of investigation. We further show in two independent datasets that these pathway-level tumorigenesis mechanisms can identify men who are more likely to develop recurrent prostate cancer (log-rank_p = 0.019).
Curation-free biomodules classification derived from congruent gene expression activation breaks from the paradigm of recapitulating the known curated pathway mechanism universe.
Pathway analysis of large-scale omics data assists us with the examination of the cumulative effects of multiple functionally related genes, which are difficult to detect using the traditional single gene/marker analysis. So far, most of the genomic studies have been conducted in a single domain, e.g., by genome-wide association studies (GWAS) or microarray gene expression investigation. A combined analysis of disease susceptibility genes across multiple platforms at the pathway level is an urgent need because it can reveal more reliable and more biologically important information.
We performed an integrative pathway analysis of a GWAS dataset and a microarray gene expression dataset in prostate cancer. We obtained a comprehensive pathway annotation set from knowledge-based public resources, including KEGG pathways and the prostate cancer candidate gene set, and gene sets specifically defined based on cross-platform information. By leveraging on this pathway collection, we first searched for significant pathways in the GWAS dataset using four methods, which represent two broad groups of pathway analysis approaches. The significant pathways identified by each method varied greatly, but the results were more consistent within each method group than between groups. Next, we conducted a gene set enrichment analysis of the microarray gene expression data and found 13 pathways with cross-platform evidence, including "Fc gamma R-mediated phagocytosis" (PGWAS = 0.003, Pexpr < 0.001, and Pcombined = 6.18 × 10-8), "regulation of actin cytoskeleton" (PGWAS = 0.003, Pexpr = 0.009, and Pcombined = 3.34 × 10-4), and "Jak-STAT signaling pathway" (PGWAS = 0.001, Pexpr = 0.084, and Pcombined = 8.79 × 10-4).
Our results provide evidence at both the genetic variation and expression levels that several key pathways might have been involved in the pathological development of prostate cancer. Our framework that employs gene expression data to facilitate pathway analysis of GWAS data is not only feasible but also much needed in studying complex disease.
Morphologic features of tumour cells have long been validated for the clinical classification of breast cancers and are regularly used as a “gold standard” to ascertain prognostic outcome in patients. Identification of molecular markers such as expression of the receptors for estrogen (er) and progesterone (pgr) and the human epidermal growth factor receptor 2 (her2) has played an important role in determining targets for the development of efficacious drugs for treatment and has also offered additional predictive value for the therapeutic assessment of patients with breast cancer. More recent technical advancements in identifying several cancer-related genes have provided further opportunities to identify specific subtypes of breast cancer. Among the subtypes, tumours with triple-negative cells are identified using specific staining procedures for basal markers such as cytokeratin 5 and 6 and the absence of er, pgr, and her2 expression. Patients with triple-negative breast cancers therefore have the disadvantage of not benefiting from currently available receptor-targeted systemic therapy. Optimal conditions for the therapeutic assessment of women with triple-negative breast tumours and for the management of their disease have yet to be validated in prospective investigations. The present review discusses the differences between triple-negative breast tumours and basal-like breast tumours and also the role of mutations in the BRCA genes. Attention is also paid to treatment options available to patients with triple-negative breast tumours.
Triple-negative breast tumours; epidermal growth factor receptor; chemotherapy
Different microarray studies have compiled gene lists for predicting outcomes of a range of treatments and diseases. These have produced gene lists that have little overlap, indicating that the results from any one study are unstable. It has been suggested that the underlying pathways are essentially identical, and that the expression of gene sets, rather than that of individual genes, may be more informative with respect to prognosis and understanding of the underlying biological process.
We sought to examine the stability of prognostic signatures based on gene sets rather than individual genes. We classified breast cancer cases from five microarray studies according to the risk of metastasis, using features derived from predefined gene sets. The expression levels of genes in the sets are aggregated, using what we call a set statistic. The resulting prognostic gene sets were as predictive as the lists of individual genes, but displayed more consistent rankings via bootstrap replications within datasets, produced more stable classifiers across different datasets, and are potentially more interpretable in the biological context since they examine gene expression in the context of their neighbouring genes in the pathway. In addition, we performed this analysis in each breast cancer molecular subtype, based on ER/HER2 status. The prognostic gene sets found in each subtype were consistent with the biology based on previous analysis of individual genes.
To date, most analyses of gene expression data have focused at the level of the individual genes. We show that a complementary approach of examining the data using predefined gene sets can reduce the noise and could provide increased insight into the underlying biological pathways.
Single sample predictors (SSPs) and Subtype classification models (SCMs) are gene expression–based classifiers used to identify the four primary molecular subtypes of breast cancer (basal-like, HER2-enriched, luminal A, and luminal B). SSPs use hierarchical clustering, followed by nearest centroid classification, based on large sets of tumor-intrinsic genes. SCMs use a mixture of Gaussian distributions based on sets of genes with expression specifically correlated with three key breast cancer genes (estrogen receptor [ER], HER2, and aurora kinase A [AURKA]). The aim of this study was to compare the robustness, classification concordance, and prognostic value of these classifiers with those of a simplified three-gene SCM in a large compendium of microarray datasets.
Thirty-six publicly available breast cancer datasets (n = 5715) were subjected to molecular subtyping using five published classifiers (three SSPs and two SCMs) and SCMGENE, the new three-gene (ER, HER2, and AURKA) SCM. We used the prediction strength statistic to estimate robustness of the classification models, defined as the capacity of a classifier to assign the same tumors to the same subtypes independently of the dataset used to fit it. We used Cohen κ and Cramer V coefficients to assess concordance between the subtype classifiers and association with clinical variables, respectively. We used Kaplan–Meier survival curves and cross-validated partial likelihood to compare prognostic value of the resulting classifications. All statistical tests were two-sided.
SCMs were statistically significantly more robust than SSPs, with SCMGENE being the most robust because of its simplicity. SCMGENE was statistically significantly concordant with published SCMs (κ = 0.65–0.70) and SSPs (κ = 0.34–0.59), statistically significantly associated with ER (V = 0.64), HER2 (V = 0.52) status, and histological grade (V = 0.55), and yielded similar strong prognostic value.
Our results suggest that adequate classification of the major and clinically relevant molecular subtypes of breast cancer can be robustly achieved with quantitative measurements of three key genes.
Using primary tumor gene expression has been shown to have the ability of finding metastasis-driving gene markers for prediction of breast cancer recurrence (BCR). However, there are some difficulties associated with analysis of microarray data, which led to poor predictive power and inconsistency of previously introduced gene signatures. In this study, a hybrid method was proposed for identifying more predictive gene signatures from microarray datasets. Initially, the parameters of a Rough-Set (RS) theory based feature selection method were tuned to construct a customized gene extraction algorithm. Afterward, using RS gene selection method the most informative genes selected from six independent breast cancer datasets. Then, combined set of these six signature sets, containing 114 genes, was evaluated for prediction of BCR. In final, a meta-signature, containing 18 genes, selected from the combination of datasets and its prediction accuracy compared to the combined signature. The results of 10-fold cross-validation test showed acceptable misclassification error rate (MCR) over 1338 cases of breast cancer patients. In comparison to a recent similar work, our approach reached more than 5% reduction in MCR using a fewer number of genes for prediction. The results also demonstrated 7% improvement in average accuracy in six utilized datasets, using the combined set of 114 genes in comparison with 18-genes meta-signature. In this study, a more informative gene signature was selected for prediction of BCR using a RS based gene extraction algorithm. To conclude, combining different signatures demonstrated more stable prediction over independent datasets.
Breast cancer recurrence prediction; gene expression signature; meta-signature; rough-set theory
As the premalignant lesion of human esophageal adenocarcinoma (EAC), Barrett’s esophagus (BE) is characterized by intestinal metaplasia in the normal esophagus (NE). Gene expression profiling may help us understand the potential molecular mechanism of human BE.
We analyzed three microarray datasets (two cDNA arrays and one oligonucleotide array) and one SAGE dataset with SAM and SAGE(Poisson) to identify individual genes differentially expressed in BE. GSEA was used to identify a priori defined sets of genes that were differentially expressed. These gene sets were either grouped according to certain signaling pathways (GSEA curated), or the presence of consensus binding sequences of known transcription factors (GSEA motif). Immunohistochemical staining (IHC) was used to validate differential gene expression.
Both SAM and SAGE(Poisson) identified 68 differentially expressed genes (55 BE genes and 13 NE genes) with an arbitrary cutoff ratio (≥4 fold). With IHC on matched pairs of NE and BE tissues from 6 patients, these genes were grouped into 6 categories: Category I (25 genes only expressed in BE), Category II (5 genes only expressed in NE), Category III (8 genes expressed more in BE than in NE), and Category IV (2 genes expressed more in NE than in BE). Differential expression of the remaining genes was not confirmed by IHC either due to false discovery (Category V), or lack of proper antibodies (Category VI). Besides individual genes, the TGFβ pathway and several transcription factors (CDX2, HNF1, and HNF4) were identified by GSEA as enriched pathways and motifs in BE. Apart from 9 target genes known to be up-regulated in BE, IHC staining confirmed up-regulation of 19 additional CDX1 and CDX2 target genes in BE.
Our data suggested an important role of CDX1 and CDX2 in the development of BE. The IHC-confirmed gene list may lead to future studies on the molecular mechanism of BE.
Barrett’s esophagus; intestinal metaplasia; expression profile; SAM; GSEA
Colon cancer patients with the same stage show diverse clinical behavior due to tumor heterogeneity. We aimed to discover distinct classes of tumors based on microarray expression patterns, to analyze whether the molecular classification correlated with the histopathological stages or other clinical parameters and to study differences in the survival.
Hierarchical clustering was performed for class discovery in 88 colon tumors (stages I to IV). Pathways analysis and correlations between clinical parameters and our classification were analyzed. Tumor subtypes were validated using an external set of 78 patients. A 167 gene signature associated to the main subtype was generated using the 3-Nearest-Neighbor method. Coincidences with other prognostic predictors were assesed.
Hierarchical clustering identified four robust tumor subtypes with biologically and clinically distinct behavior. Stromal components (p < 0.001), nuclear β-catenin (p = 0.021), mucinous histology (p = 0.001), microsatellite-instability (p = 0.039) and BRAF mutations (p < 0.001) were associated to this classification but it was independent of Dukes stages (p = 0.646). Molecular subtypes were established from stage I. High-stroma-subtype showed increased levels of genes and altered pathways distinctive of tumour-associated-stroma and components of the extracellular matrix in contrast to Low-stroma-subtype. Mucinous-subtype was reflected by the increased expression of trefoil factors and mucins as well as by a higher proportion of MSI and BRAF mutations. Tumor subtypes were validated using an external set of 78 patients. A 167 gene signature associated to the Low-stroma-subtype distinguished low risk patients from high risk patients in the external cohort (Dukes B and C:HR = 8.56(2.53-29.01); Dukes B,C and D:HR = 1.87(1.07-3.25)). Eight different reported survival gene signatures segregated our tumors into two groups the Low-stroma-subtype and the other tumor subtypes.
We have identified novel molecular subtypes in colon cancer with distinct biological and clinical behavior that are established from the initiation of the tumor. Tumor microenvironment is important for the classification and for the malignant power of the tumor. Differential gene sets and biological pathways characterize each tumor subtype reflecting underlying mechanisms of carcinogenesis that may be used for the selection of targeted therapeutic procedures. This classification may contribute to an improvement in the management of the patients with CRC and to a more comprehensive prognosis.
Colon cancer; Microarray gene expression; Molecular classification; Stroma; Survival
Human breast tumors are heterogeneous and consist of phenotypically diverse cells. Breast cancer cells with a CD44+/CD24- phenotype have been suggested to have tumor-initiating properties with stem cell-like and invasive features, although it is unclear whether their presence within a tumor has clinical implications. There is also a large heterogeneity between tumors, illustrated by reproducible stratification into various subtypes based on gene expression profiles or histopathological features. We have explored the prevalence of cells with different CD44/CD24 phenotypes within breast cancer subtypes.
Double-staining immunohistochemistry was used to quantify CD44 and CD24 expression in 240 human breast tumors for which information on other tumor markers and clinical characteristics was available. Gene expression data were also accessible for a cohort of the material.
A considerable heterogeneity in CD44 and CD24 expression was seen both between and within tumors. A complete lack of both proteins was evident in 35% of the tumors, while 13% contained cells of more than one of the CD44+/CD24-, CD44-/CD24+ and CD44+/CD24+ phenotypes. CD44+/CD24- cells were detected in 31% of the tumors, ranging in proportion from only a few to close to 100% of tumor cells. The CD44+/CD24- phenotype was most common in the basal-like subgroup – characterized as negative for the estrogen and progesterone receptors as well as for HER2, and as positive for cytokeratin 5/14 and/or epidermal growth factor receptor, and particularly common in BRCA1 hereditary tumors, of which 94% contained CD44+/CD24- cells. The CD44+/CD24- phenotype was surprisingly scarce in HER2+ tumors, which had a predominantly CD24+ status. A CD44+/CD24- gene expression signature was generated, which included CD44 and α6-integrin (CD49f) among the top-ranked overexpressed genes.
We demonstrate an association between basal-like and particularly BRCA1 hereditary breast cancer and the presence of CD44+/CD24- cells. Not all basal-like tumors and very few HER2+ tumors, however, contain CD44+/CD24- cells, emphasizing that a putative tumorigenic ability may not be confined to cells of this phenotype and that other breast cancer stem cell markers remain to be identified.
Clustering analysis of microarray data is often criticized for giving ambiguous results because of sensitivity to data perturbation or clustering techniques used. In this paper, we describe a new method based on principal component analysis and ensemble consensus clustering that avoids these problems.
We illustrate the method on a public microarray dataset from 36 breast cancer patients of whom 31 were diagnosed with at least two of three pathological stages of disease (atypical ductal hyperplasia (ADH), ductal carcinoma in situ (DCIS) and invasive ductal carcinoma (IDC). Our method identifies an optimum set of genes and divides the samples into stable clusters which correlate with clinical classification into Luminal, Basal-like and Her2+ subtypes. Our analysis reveals a hierarchical portrait of breast cancer progression and identifies genes and pathways for each stage, grade and subtype. An intriguing observation is that the disease phenotype is distinguishable in ADH and progresses along distinct pathways for each subtype. The genetic signature for disease heterogeneity across subtypes is greater than the heterogeneity of progression from DCIS to IDC within a subtype, suggesting that the disease subtypes have distinct progression pathways.
Our method identifies six disease subtype and one normal clusters. The first split separates the normal samples from the cancer samples. Next, the cancer cluster splits into low grade (pathological grades 1 and 2) and high grade (pathological grades 2 and 3) while the normal cluster is unchanged. Further, the low grade cluster splits into two subclusters and the high grade cluster into four. The final six disease clusters are mapped into one Luminal A, three Luminal B, one Basal-like and one Her2+.
We confirm that the cancer phenotype can be identified in early stage because the genes altered in this stage progressively alter further as the disease progresses through DCIS into IDC. We identify six subtypes of disease which have distinct genetic signatures and remain separated in the clustering hierarchy. Our findings suggest that the heterogeneity of disease across subtypes is higher than the heterogeneity of the disease progression within a subtype, indicating that the subtypes are in fact distinct diseases.
Selection of novel molecular markers is an important goal of cancer genomics studies. The aim of our analysis was to apply the multivariate bioinformatical tools to rank the genes – potential markers of papillary thyroid cancer (PTC) according to their diagnostic usefulness. We also assessed the accuracy of benign/malignant classification, based on gene expression profiling, for PTC. We analyzed a 180-array dataset (90 HG-U95A and 90 HG-U133A oligonucleotide arrays), which included a collection of 57 PTCs, 61 benign thyroid tumors, and 62 apparently normal tissues. Gene selection was carried out by the support vector machines method with bootstrapping, which allowed us 1) ranking the genes that were most important for classification quality and appeared most frequently in the classifiers (bootstrap-based feature ranking, BBFR); 2) ranking the samples, and thus detecting cases that were most difficult to classify (bootstrap-based outlier detection). The accuracy of PTC diagnosis was 98.5% for a 20-gene classifier, its 95% confidence interval (CI) was 95.9–100%, with the lower limit of CI exceeding 95% already for five genes. Only 5 of 180 samples (2.8%) were misclassified in more than 10% of bootstrap iterations. We specified 43 genes which are most suitable as molecular markers of PTC, among them some well-known PTC markers (MET, fibronectin 1, dipeptidylpeptidase 4, or adenosine A1 receptor) and potential new ones (UDP-galactose-4-epimerase, cadherin 16, gap junction protein 3, sushi, nidogen, and EGF-like domains 1, inhibitor of DNA binding 3, RUNX1, leiomodin 1, F-box protein 9, and tripartite motif-containing 58). The highest ranking gene, metallophosphoesterase domain-containing protein 2, achieved 96.7% of the maximum BBFR score.
With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes.
Despite important advances in microarray-based molecular classification of tumors, its application in clinical settings remains formidable. This is in part due to the limitation of current analysis programs in discovering robust biomarkers and developing classifiers with a practical set of genes. Genetic programming (GP) is a type of machine learning technique that uses evolutionary algorithm to simulate natural selection as well as population dynamics, hence leading to simple and comprehensible classifiers. Here we applied GP to cancer expression profiling data to select feature genes and build molecular classifiers by mathematical integration of these genes. Analysis of thousands of GP classifiers generated for a prostate cancer data set revealed repetitive use of a set of highly discriminative feature genes, many of which are known to be disease associated. GP classifiers often comprise five or less genes and successfully predict cancer types and subtypes. More importantly, GP classifiers generated in one study are able to predict samples from an independent study, which may have used different microarray platforms. In addition, GP yielded classification accuracy better than or similar to conventional classification methods. Furthermore, the mathematical expression of GP classifiers provides insights into relationships between classifier genes. Taken together, our results demonstrate that GP may be valuable for generating effective classifiers containing a practical set of genes for diagnostic/prognostic cancer classification.
Molecular diagnostics; biomarkers; prostate cancer; evolutionary algorithm; microarray profiling