Numerous studies used microarray gene expression data to extract metastasis-driving gene signatures for the prediction of breast cancer relapse. However, the accuracy and generality of the previously introduced biomarkers are not acceptable for reliable usage in independent datasets. This inadequacy is attributed to ignoring gene interactions by simple feature selection methods, due to their computational burden. In this study, an integrated approach with low computational cost was proposed for identifying a more predictive gene signature, for prediction of breast cancer recurrence. First, a small set of genes was primarily selected as signature by an appropriate filter feature selection (FFS) method. Then, a binary sub-class of protein-protein interaction (PPI) network was used to expand the primary set by adding adjacent proteins of each gene signature from the PPI-network. Subsequently, the support vector machine-based recursive feature elimination (SVMRFE) method was applied to the expression level of all the genes in the expanded set. Finally, the genes with the highest score by SVMRFE were selected as the new biomarkers. Accuracy of the final selected biomarkers was evaluated to classify four datasets on breast cancer patients, including 800 cases, into two cohorts of poor and good prognosis. The results of the five-fold cross validation test, using the support vector machine as a classifier, showed more than 13% improvement in the average accuracy, after modifying the primary selected signatures. Moreover, the method used in this study showed a lower computational cost compared to the other PPI-based methods. The proposed method demonstrated more robust and accurate biomarkers using the PPI network, at a low computational cost. This approach could be used as a supplementary procedure in microarray studies after applying various gene selection methods.
Breast cancer; feature selection method; protein–protein interaction; recurrence prediction; support vector machine
Motivation: With the exponential growth of expression and protein–protein interaction (PPI) data, the frontier of research in systems biology shifts more and more to the integrated analysis of these large datasets. Of particular interest is the identification of functional modules in PPI networks, sharing common cellular function beyond the scope of classical pathways, by means of detecting differentially expressed regions in PPI networks. This requires on the one hand an adequate scoring of the nodes in the network to be identified and on the other hand the availability of an effective algorithm to find the maximally scoring network regions. Various heuristic approaches have been proposed in the literature.
Results: Here we present the first exact solution for this problem, which is based on integer-linear programming and its connection to the well-known prize-collecting Steiner tree problem from Operations Research. Despite the NP-hardness of the underlying combinatorial problem, our method typically computes provably optimal subnetworks in large PPI networks in a few minutes. An essential ingredient of our approach is a scoring function defined on network nodes. We propose a new additive score with two desirable properties: (i) it is scalable by a statistically interpretable parameter and (ii) it allows a smooth integration of data from various sources.
We apply our method to a well-established lymphoma microarray dataset in combination with associated survival data and the large interaction network of HPRD to identify functional modules by computing optimal-scoring subnetworks. In particular, we find a functional interaction module associated with proliferation over-expressed in the aggressive ABC subtype as well as modules derived from non-malignant by-stander cells.
Availability: Our software is available freely for non-commercial purposes at http://www.planet-lisa.net.
Identification of differentially expressed subnetworks from protein–protein interaction (PPI) networks has become increasingly important to our global understanding of the molecular mechanisms that drive cancer. Several methods have been proposed for PPI subnetwork identification, but the dependency among network member genes is not explicitly considered, leaving many important hub genes largely unidentified. We present a new method, based on a bagging Markov random field (BMRF) framework, to improve subnetwork identification for mechanistic studies of breast cancer. The method follows a maximum a posteriori principle to form a novel network score that explicitly considers pairwise gene interactions in PPI networks, and it searches for subnetworks with maximal network scores. To improve their robustness across data sets, a bagging scheme based on bootstrapping samples is implemented to statistically select high confidence subnetworks. We first compared the BMRF-based method with existing methods on simulation data to demonstrate its improved performance. We then applied our method to breast cancer data to identify PPI subnetworks associated with breast cancer progression and/or tamoxifen resistance. The experimental results show that not only an improved prediction performance can be achieved by the BMRF approach when tested on independent data sets, but biologically meaningful subnetworks can also be revealed that are relevant to breast cancer and tamoxifen resistance.
Expression profiling and biomarker(s) discovery aim to provide means for tumour diagnosis, classification, therapy response and prognosis. The identification of novel markers could potentially lead to the building of robust early detection strategies and personalized, effective breast cancer therapies that would improve patient outcome. Recent evidence supports the hypothesis that genomic expression profiling using microarray analysis is a reliable method for breast cancer classification and prognostication. However, genes clearly do not act by themselves, or indeed they do not have catalytic or signalling capabilities. Hence, genetic biomarker information alone cannot perfectly predict cancer and its response to treatment. Genes clearly exert their effect after transcription through translation into active proteins. Consequently, postgenomic projects correlating protein expression profiles with tumour classification have led to some established biomarkers. In this regard, these biomarkers associate with disease prediction and can be associated with treatment response. Recently, Brozokova and colleagues demonstrated that surface-enhanced laser desorption ionization time of flight mass spectrometry (SELDI-TOF MS) profiling of breast cancer tissue proteomes can potentially expand the biomarker repertoire and our knowledge of breast cancer behaviour.
One of the major goals in gene and protein expression profiling of cancer is to identify biomarkers and build classification models for prediction of disease prognosis or treatment response. Many traditional statistical methods, based on microarray gene expression data alone and individual genes' discriminatory power, often fail to identify biologically meaningful biomarkers thus resulting in poor prediction performance across data sets. Nonetheless, the variables in multivariable classifiers should synergistically interact to produce more effective classifiers than individual biomarkers.
We developed an integrated approach, namely network-constrained support vector machine (netSVM), for cancer biomarker identification with an improved prediction performance. The netSVM approach is specifically designed for network biomarker identification by integrating gene expression data and protein-protein interaction data. We first evaluated the effectiveness of netSVM using simulation studies, demonstrating its improved performance over state-of-the-art network-based methods and gene-based methods for network biomarker identification. We then applied the netSVM approach to two breast cancer data sets to identify prognostic signatures for prediction of breast cancer metastasis. The experimental results show that: (1) network biomarkers identified by netSVM are highly enriched in biological pathways associated with cancer progression; (2) prediction performance is much improved when tested across different data sets. Specifically, many genes related to apoptosis, cell cycle, and cell proliferation, which are hallmark signatures of breast cancer metastasis, were identified by the netSVM approach. More importantly, several novel hub genes, biologically important with many interactions in PPI network but often showing little change in expression as compared with their downstream genes, were also identified as network biomarkers; the genes were enriched in signaling pathways such as TGF-beta signaling pathway, MAPK signaling pathway, and JAK-STAT signaling pathway. These signaling pathways may provide new insight to the underlying mechanism of breast cancer metastasis.
We have developed a network-based approach for cancer biomarker identification, netSVM, resulting in an improved prediction performance with network biomarkers. We have applied the netSVM approach to breast cancer gene expression data to predict metastasis in patients. Network biomarkers identified by netSVM reveal potential signaling pathways associated with breast cancer metastasis, and help improve the prediction performance across independent data sets.
Breast cancer metastasis is a complex, multi-step biological process. Genetic mutations along with epigenetic alterations in the form of DNA methylation patterns and histone modifications contribute to metastasis-related gene expression changes and genomic instability. So far, these epigenetic contributions to breast cancer metastasis have not been well characterized, and there is only a limited understanding of the functional mechanisms affected by such epigenetic alterations. Furthermore, no genome-wide assessments have been undertaken to identify altered DNA methylation patterns in the context of metastasis and their effects on specific functional pathways or gene networks.
We have used a human gene promoter tiling microarray platform to analyze a cell line model of metastasis to lymph nodes composed of a poorly metastatic MDA-MB-468GFP human breast adenocarcinoma cell line and its highly metastatic variant (468LN). Gene networks and pathways associated with metastasis were identified, and target genes associated with epithelial–mesenchymal transition were validated with respect to DNA methylation effects on gene expression.
We integrated data from the tiling microarrays with targets identified by Ingenuity Pathways Analysis software and observed epigenetic variations in genes implicated in epithelial–mesenchymal transition and with tumor cell migration. We identified widespread genomic hypermethylation and hypomethylation events in these cells and we confirmed functional associations between methylation status and expression of the CDH1, CST6, EGFR, SNAI2 and ZEB2 genes by quantitative real-time PCR. Our data also suggest that the complex genomic reorganization present in cancer cells may be superimposed over promoter-specific methylation events that are responsible for gene-specific expression changes.
This is the first whole-genome approach to identify genome-wide and gene-specific epigenetic alterations, and the functional consequences of these changes, in the context of breast cancer metastasis to lymph nodes. This approach allows the development of epigenetic signatures of metastasis to be used concurrently with genomic signatures to improve mapping of the evolving molecular landscape of metastasis and to permit translational approaches to target epigenetically regulated molecular pathways related to metastatic progression.
Mapping the pathways that give rise to metastasis is one of the key challenges of breast cancer research. Recently, several large-scale studies have shed light on this problem through analysis of gene expression profiles to identify markers correlated with metastasis. Here, we apply a protein-network-based approach that identifies markers not as individual genes but as subnetworks extracted from protein interaction databases. The resulting subnetworks provide novel hypotheses for pathways involved in tumor progression. Although genes with known breast cancer mutations are typically not detected through analysis of differential expression, they play a central role in the protein network by interconnecting many differentially expressed genes. We find that the subnetwork markers are more reproducible than individual marker genes selected without network information, and that they achieve higher accuracy in the classification of metastatic versus non-metastatic tumors.
breast cancer metastasis; classification; protein networks; pathways; microarrays
Although omic-based discovery approaches can provide powerful tools for biomarker identification, several reservations have been raised regarding the clinical applicability of gene expression studies, such as their prohibitive cost. However, the limited availability of antibodies is a key barrier to the development of a lower cost alternative, namely a discrete collection of immunohistochemistry (IHC)-based biomarkers. The aim of this study was to use a systematic approach to generate and screen affinity-purified, mono-specific antibodies targeting progression-related biomarkers, with a view towards developing a clinically applicable IHC-based prognostic biomarker panel for breast cancer.
We examined both in-house and publicly available breast cancer DNA microarray datasets relating to invasion and metastasis, thus identifying a cohort of candidate progression-associated biomarkers. Of these, 18 antibodies were released for extended analysis. Validated antibodies were screened against a tissue microarray (TMA) constructed from a cohort of consecutive breast cancer cases (n = 512) to test the immunohistochemical surrogate signature.
Antibody screening revealed 3 candidate prognostic markers: the cell cycle regulator, Anillin (ANLN); the mitogen-activated protein kinase, PDZ-Binding Kinase (PBK); and the estrogen response gene, PDZ-Domain Containing 1 (PDZK1). Increased expression of ANLN and PBK was associated with poor prognosis, whilst increased expression of PDZK1 was associated with good prognosis. A 3-marker signature comprised of high PBK, high ANLN and low PDZK1 expression was associated with decreased recurrence-free survival (p < 0.001) and breast cancer-specific survival (BCSS) (p < 0.001). This novel signature was associated with high tumour grade (p < 0.001), positive nodal status (p = 0.029), ER-negativity (p = 0.006), Her2-positivity (p = 0.036) and high Ki67 status (p < 0.001). However, multivariate Cox regression demonstrated that the signature was not a significant predictor of BCSS (HR = 6.38; 95% CI = 0.79-51.26, p = 0.082).
We have developed a comprehensive biomarker pathway that extends from discovery through to validation on a TMA platform. This proof-of-concept study has resulted in the identification of a novel 3-protein prognostic panel. Additional biochemical markers, interrogated using this high-throughput platform, may further augment the prognostic accuracy of this panel to a point that may allow implementation into routine clinical practice.
Prognostic biomarkers; Tissue microarray; Breast cancer; Antibody screening; Antibody validation
In cancer biology, it is very important to understand the phenotypic changes of the patients and discover new cancer subtypes. Recently, microarray-based technologies have shed light on this problem based on gene expression profiles which may contain outliers due to either chemical or electrical reasons. These undiscovered subtypes may be heterogeneous with respect to underlying networks or pathways, and are related with only a few of interdependent biomarkers. This motivates a need for the robust gene expression-based methods capable of discovering such subtypes, elucidating the corresponding network structures and identifying cancer related biomarkers. This study proposes a penalized model-based Student’s t clustering with unconstrained covariance (PMT-UC) to discover cancer subtypes with cluster-specific networks, taking gene dependencies into account and having robustness against outliers. Meanwhile, biomarker identification and network reconstruction are achieved by imposing an adaptive penalty on the means and the inverse scale matrices. The model is fitted via the expectation maximization algorithm utilizing the graphical lasso. Here, a network-based gene selection criterion that identifies biomarkers not as individual genes but as subnetworks is applied. This allows us to implicate low discriminative biomarkers which play a central role in the subnetwork by interconnecting many differentially expressed genes, or have cluster-specific underlying network structures. Experiment results on simulated datasets and one available cancer dataset attest to the effectiveness, robustness of PMT-UC in cancer subtype discovering. Moveover, PMT-UC has the ability to select cancer related biomarkers which have been verified in biochemical or biomedical research and learn the biological significant correlation among genes.
Breast cancer is one of the most common cancer types. Due to the complexity of this disease, it is important to face its study with an integrated and multilevel approach, from genes, transcripts and proteins to molecular networks, cell populations and tissues. According to the systems biology perspective, the biological functions arise from complex networks: in this context, concepts like molecular pathways, protein-protein interactions (PPIs), mathematical models and ontologies play an important role for dissecting such complexity.
In this work we present the Genes-to-Systems Breast Cancer (G2SBC) Database, a resource which integrates data about genes, transcripts and proteins reported in literature as altered in breast cancer cells. Beside the data integration, we provide an ontology based query system and analysis tools related to intracellular pathways, PPIs, protein structure and systems modelling, in order to facilitate the study of breast cancer using a multilevel perspective. The resource is available at the URL http://www.itb.cnr.it/breastcancer.
The G2SBC Database represents a systems biology oriented data integration approach devoted to breast cancer. By means of the analysis capabilities provided by the web interface, it is possible to overcome the limits of reductionist resources, enabling predictions that can lead to new experiments.
Since high-throughput protein-protein interaction (PPI) data has recently become available for humans, there has been a growing interest in combining PPI data with other genome-wide data. In particular, the identification of phenotype-related PPI subnetworks using gene expression data has been of great concern. Successful integration for the identification of significant subnetworks requires the use of a search algorithm with a proper scoring method. Here we propose a multivariate analysis of variance (MANOVA)-based scoring method with a greedy search for identifying differentially expressed PPI subnetworks.
Given the MANOVA-based scoring method, we performed a greedy search to identify the subnetworks with the maximum scores in the PPI network. Our approach was successfully applied to human microarray datasets. Each identified subnetwork was annotated with the Gene Ontology (GO) term, resulting in the phenotype-related functional pathway or complex. We also compared these results with those of other scoring methods such as t statistic- and mutual information-based scoring methods. The MANOVA-based method produced subnetworks with a larger number of proteins than the other methods. Furthermore, the subnetworks identified by the MANOVA-based method tended to consist of highly correlated proteins.
This article proposes a MANOVA-based scoring method to combine PPI data with expression data using a greedy search. This method is recommended for the highly sensitive detection of large subnetworks.
Recent years have seen the development of various pathway-based methods for the analysis of microarray gene expression data. These approaches have the potential to bring biological insights into microarray studies. A variety of methods have been proposed to construct networks using gene expression data. Because individual pathways do not act in isolation, it is important to understand how different pathways coordinate to perform cellular functions. However, there are no published methods describing how to build pathway clusters that are closely related to traits of interest.
We propose to build pathway clusters from pathway-based classification methods. The proposed methods allow researchers to identify clusters of pathways sharing similar functions. These pathways may or may not share genes. As an illustration, our approach is applied to three human breast cancer microarray data sets. We found that our methods yielded consistent and interpretable results for these three data sets. We further investigated one of the pathway clusters found using PubMatrix. We found that informative genes in the pathway clusters do have more publications with keywords, like estrogen receptor, compared with informative genes in other top pathways. In addition, using the shortest path analysis in GeneGo's MetaCore and Human Protein Reference Database, we were able to identify the links which connect the pathways without shared genes within the pathway cluster.
Our proposed pathway clustering methods allow bioinformaticians and biologists to investigate how informative genes within pathways are related to each other and understand possible crosstalk between pathways in a cluster. Therefore, building pathway clusters may lead to a better understanding of molecular mechanisms affecting a trait of interest, and help generate further biological hypotheses from gene expression data.
Lung cancer is the leading cause of cancer deaths worldwide. Many studies have investigated the carcinogenic process and identified the biomarkers for signature classification. However, based on the research dedicated to this field, there is no highly sensitive network-based method for carcinogenesis characterization and diagnosis from the systems perspective.
In this study, a systems biology approach integrating microarray gene expression profiles and protein-protein interaction information was proposed to develop a network-based biomarker for molecular investigation into the network mechanism of lung carcinogenesis and diagnosis of lung cancer. The network-based biomarker consists of two protein association networks constructed for cancer samples and non-cancer samples.
Based on the network-based biomarker, a total of 40 significant proteins in lung carcinogenesis were identified with carcinogenesis relevance values (CRVs). In addition, the network-based biomarker, acting as the screening test, proved to be effective in diagnosing smokers with signs of lung cancer.
A network-based biomarker using constructed protein association networks is a useful tool to highlight the pathways and mechanisms of the lung carcinogenic process and, more importantly, provides potential therapeutic targets to combat cancer.
Cancer is a disease associated with the deregulation of multiple gene networks. Microarray data has permitted researchers to identify gene panel markers for diagnosis or prognosis of cancer but these are not sufficient to make specific mechanistic assertions about phenotype switches. We propose a strategy to identify putative mechanisms of cancer phenotypes by protein-protein interactions (PPI). We first extracted the logic status of a PPI via the relative expression of the corresponding gene pair. The joint association of a gene pair on a cancer phenotype was calculated by entropy minimization and assessed using a support vector machine. A typical predictor is “If Src high-expression, and Cav-1 low-expression, then cancer.” We achieved 90% accuracy on test data with a majority of predictions associated with the MAPK pathway, focal adhesion, apoptosis and cell cycle. Our results can aid in the development of phenotype discrimination biomarkers and identification of putative therapeutic interference targets for drug development.
cancer; biomarker; phenotype discrimination; protein-protein interaction
Lung cancer is one of the leading causes of cancer mortality worldwide. The main types of lung cancer are small cell lung cancer (SCLC) and nonsmall cell lung cancer (NSCLC). In this work, a computational method was proposed for identifying lung-cancer-related genes with a shortest path approach in a protein-protein interaction (PPI) network. Based on the PPI data from STRING, a weighted PPI network was constructed. 54 NSCLC- and 84 SCLC-related genes were retrieved from associated KEGG pathways. Then the shortest paths between each pair of these 54 NSCLC genes and 84 SCLC genes were obtained with Dijkstra's algorithm. Finally, all the genes on the shortest paths were extracted, and 25 and 38 shortest genes with a permutation P value less than 0.05 for NSCLC and SCLC were selected for further analysis. Some of the shortest path genes have been reported to be related to lung cancer. Intriguingly, the candidate genes we identified from the PPI network contained more cancer genes than those identified from the gene expression profiles. Furthermore, these genes possessed more functional similarity with the known cancer genes than those identified from the gene expression profiles. This study proved the efficiency of the proposed method and showed promising results.
Breast cancer is the most common malignancy among women worldwide in terms of incidence and mortality. About 10% of North American women will be diagnosed with breast cancer during their lifetime and 20% of those will die of the disease. Breast cancer is a heterogeneous disease and biomarkers able to correctly classify patients into prognostic groups are needed to better tailor treatment options and improve outcomes. One powerful method used for biomarker discovery is sample screening with mass spectrometry, as it allows direct comparison of protein expression between normal and pathological states. The purpose of this study was to use a systematic and objective method to identify biomarkers with possible prognostic value in breast cancer patients, particularly in identifying cases most likely to have lymph node metastasis and to validate their prognostic ability using breast cancer tissue microarrays.
Methods and Findings
Differential proteomic analyses were employed to identify candidate biomarkers in primary breast cancer patients. These analyses identified decorin (DCN) and endoplasmin (HSP90B1) which play important roles regulating the tumour microenvironment and in pathways related to tumorigenesis. This study indicates that high expression of Decorin is associated with lymph node metastasis (p<0.001), higher number of positive lymph nodes (p<0.0001) and worse overall survival (p = 0.01). High expression of HSP90B1 is associated with distant metastasis (p<0.0001) and decreased overall survival (p<0.0001) these patients also appear to benefit significantly from hormonal treatment.
Using quantitative proteomic profiling of primary breast cancers, two new promising prognostic and predictive markers were found to identify patients with worse survival. In addition HSP90B1 appears to identify a group of patients with distant metastasis with otherwise good prognostic features.
Despite the lifetimes that increased in breast cancers due to the the early screening programs and new therapeutic strategies, many cases still are being lost due to the metastatic relapses. For this reason, new approaches such as the proteomic techniques have currently become the prime objectives of breast cancer researches. Various omic-based techniques have been applied with increasing success to the molecular characterisation of breast tumours, which have resulted in a more detailed classification scheme and have produced clinical diagnostic tests that have been applied to both the prognosis and the prediction of outcome to the treatment. Implementation of the proteomics-based techniques is also seen as crucial if we are to develop a systems biology approach in the discovery of biomarkers of the early diagnosis, prognosis and prediction of the outcome of the breast cancer therapies. In this review, we discuss the studies that have been conducted thus far, for the discovery of diagnostic, prognostic and predictive biomarkers, and evaluate the potential of the discriminating proteins identified in this research for clinical use as breast cancer biomarkers.
Breast cancer; early diagnosis; prognostic markers; proteomic techniques; micro array techniques; mass spectrometry; surface enhanced laser desorption ionisation; matrix-assisted laser desorption ionisation.
Identifying breast cancer patients is crucial to the clinical diagnosis and therapy for this disease. Conventional gene-based methods for breast cancer diagnosis ignore gene-gene interactions and thus may lead to loss of power. In this study, we proposed a novel method to select classification features, called “Selection of Significant Expression-Correlation Differential Motifs” (SSECDM). This method applied a network motif-based approach, combining a human signaling network and high-throughput gene expression data to distinguish breast cancer samples from normal samples. Our method has higher classification performance and better classification accuracy stability than the mutual information (MI) method or the individual gene sets method. It may become a useful tool for identifying and treating patients with breast cancer and other cancers, thus contributing to clinical diagnosis and therapy for these diseases.
Microarray has become increasingly popular biotechnology in biological and medical researches, and has been widely applied in classification of treatment subtypes using expression patterns of biomarkers. We developed a statistical procedure to identify expression biomarkers for treatment subtype classification by constructing an F-statistic based on Henderson method III. Monte Carlo simulations were conducted to examine the robustness and efficiency of the proposed method. Simulation results showed that our method could provide satisfying power of identifying differentially expressed genes (DEGs) with false discovery rate (FDR) lower than the given type I error rate. In addition, we analyzed a leukemia dataset collected from 38 leukemia patients with 27 samples diagnosed as acute lymphoblastic leukemia (ALL) and 11 samples as acute myeloid leukemia (AML). We compared our results with those from the methods of significance analysis of microarray (SAM) and microarray analysis of variance (MAANOVA). Among these three methods, only expression biomarkers identified by our method can precisely identify the three human acute leukemia subtypes.
Microarray; Biomarker; Henderson method III; Gene expression pattern; Mixed linear model
Single sample predictors (SSPs) and Subtype classification models (SCMs) are gene expression–based classifiers used to identify the four primary molecular subtypes of breast cancer (basal-like, HER2-enriched, luminal A, and luminal B). SSPs use hierarchical clustering, followed by nearest centroid classification, based on large sets of tumor-intrinsic genes. SCMs use a mixture of Gaussian distributions based on sets of genes with expression specifically correlated with three key breast cancer genes (estrogen receptor [ER], HER2, and aurora kinase A [AURKA]). The aim of this study was to compare the robustness, classification concordance, and prognostic value of these classifiers with those of a simplified three-gene SCM in a large compendium of microarray datasets.
Thirty-six publicly available breast cancer datasets (n = 5715) were subjected to molecular subtyping using five published classifiers (three SSPs and two SCMs) and SCMGENE, the new three-gene (ER, HER2, and AURKA) SCM. We used the prediction strength statistic to estimate robustness of the classification models, defined as the capacity of a classifier to assign the same tumors to the same subtypes independently of the dataset used to fit it. We used Cohen κ and Cramer V coefficients to assess concordance between the subtype classifiers and association with clinical variables, respectively. We used Kaplan–Meier survival curves and cross-validated partial likelihood to compare prognostic value of the resulting classifications. All statistical tests were two-sided.
SCMs were statistically significantly more robust than SSPs, with SCMGENE being the most robust because of its simplicity. SCMGENE was statistically significantly concordant with published SCMs (κ = 0.65–0.70) and SSPs (κ = 0.34–0.59), statistically significantly associated with ER (V = 0.64), HER2 (V = 0.52) status, and histological grade (V = 0.55), and yielded similar strong prognostic value.
Our results suggest that adequate classification of the major and clinically relevant molecular subtypes of breast cancer can be robustly achieved with quantitative measurements of three key genes.
Combining multiple microarray datasets increases sample size and leads to improved reproducibility in identification of informative genes and subsequent clinical prediction. Although microarrays have increased the rate of genomic data collection, sample size is still a major issue when identifying informative genetic biomarkers. Because of this, feature selection methods often suffer from false discoveries, resulting in poorly performing predictive models. We develop a simple meta-analysis-based feature selection method that captures the knowledge in each individual dataset and combines the results using a simple rank average. In a comprehensive study that measures robustness in terms of clinical application (i.e., breast, renal, and pancreatic cancer), microarray platform heterogeneity, and classifier (i.e., logistic regression, diagonal LDA, and linear SVM), we compare the rank average meta-analysis method to five other meta-analysis methods. Results indicate that rank average meta-analysis consistently performs well compared to five other meta-analysis methods.
To construct biologically interpretable gene sets for muscular dystrophy (MD) sub-type classification, we propose a novel computational scheme to integrate protein-protein interaction (PPI) network, functional gene set information, and mRNA profiling data. The workflow of the proposed scheme includes the following three major steps: firstly, we apply an affinity propagation clustering (APC) approach to identify gene sub-networks associated with each MD sub-type, in which a new distance metric is proposed for APC to combine PPI network information and gene-gene co-expression relationship; secondly, we further incorporate functional gene set knowledge, which complements the physical PPI information, into our scheme for biomarker identification; finally, based on the constructed sub-networks and gene set features, we apply multi-class support vector machines (MSVMs) for MD sub-type classification, with which to highlight the biomarkers contributing to sub-type prediction. The experimental results show that our scheme can help identify sub-networks and gene sets that are more relevant to MD than those constructed by other conventional approaches. Moreover, our integrative strategy improves the prediction accuracy substantially, especially for those ’hard-to-classify’ sub-types.
Gene expression; Classification; Muscular dystrophy; Affinity propagation clustering; Biomarker discovery
The use of high-throughput array technology is omnipresent in diverse areas specifically, early diagnosis of disease, discovery of
infectious agents, search for biological markers and screening of potential drug candidates. Here, we integrated gene expression
data with the network-based approach to identify novel genes that were playing central role in the network through
interconnecting to a number of differentially expressed breast cancer genes. The 62 cancerous genes retrieved from the Breast
Cancer Gene Database (BCGD) were mapped in the normalized data accessed from Stanford Microarray Database (SMD) to
analyze their pattern. Interaction networks for each gene were constructed to understand the biology of the metastasis at systems
level. The individual networks were fused together for the detection of interacting hubs, 38 novel genes were found to be deeply
intermingled with the central hub node. Gene Ontology studies were made to depict the biology of the hub nodes not alone
through gene ranking but by applying the Hyper geometric test with the Benjamini Hochberg False Discovery Rate (FDR)
correction method at a significance level of 0.05. Analyzing p-values from the statistical test indicated that most of the novel genes
were involved in the same biological function as the disordered genes like signal transducer, transcription regulator, enzyme
binding, molecular transducer and receptor signaling protein activity and same pathway as MAPK signaling, Apoptosis, Wnt
Signaling, ErbB signaling and Cell Cycle. Lastly, we identified 3 novel genes CHUK, INSR and CREBBP showing high connections
with the 12 novel genes reported in literatures as well with the perturbed genes. As a result, these genes can be considered as
significant finding in revealing the basis and pathways responsible for breast cancer.
Microarray; Breast cancer gene database (BCGD); Estrogen receptors (ER); Tamoxifen; Expression pattern; Molecular interaction networks; Novel genes
Knowledge-driven text mining is becoming an important research area for identifying pharmacogenomics target genes. However, few of such studies have been focused on the pharmacogenomics targets of adverse drug events (ADEs). The objective of the present study is to build a framework of knowledge integration and discovery that aims to support pharmacogenomics target predication of ADEs. We integrate a semantically annotated literature corpus Semantic MEDLINE with a semantically coded ADE knowledgebase known as ADEpedia using a semantic web based framework. We developed a knowledge discovery approach combining a network analysis of a protein-protein interaction (PPI) network and a gene functional classification approach. We performed a case study of drug-induced long QT syndrome for demonstrating the usefulness of the framework in predicting potential pharmacogenomics targets of ADEs.
Nowadays modern biology aims at unravelling the strands of complex biological structures such as the protein-protein interaction (PPI) networks. A key concept in the organization of PPI networks is the existence of dense subnetworks (functional modules) in them. In recent approaches clustering algorithms were applied at these networks and the resulting subnetworks were evaluated by estimating the coverage of well-established protein complexes they contained. However, most of these algorithms elaborate on an unweighted graph structure which in turn fails to elevate those interactions that would contribute to the construction of biologically more valid and coherent functional modules.
In the current study, we present a method that corroborates the integration of protein interaction and microarray data via the discovery of biologically valid functional modules. Initially the gene expression information is overlaid as weights onto the PPI network and the enriched PPI graph allows us to exploit its topological aspects, while simultaneously highlights enhanced functional association in specific pairs of proteins. Then we present an algorithm that unveils the functional modules of the weighted graph by expanding a kernel protein set, which originates from a given 'seed' protein used as starting-point.
The integrated data and the concept of our approach provide reliable functional modules. We give proofs based on yeast data that our method manages to give accurate results in terms both of structural coherency, as well as functional consistency.