Search tips
Search criteria

Results 1-16 (16)

Clipboard (0)

Select a Filter Below

Year of Publication
more »
1.  GeneSigDB—a curated database of gene expression signatures 
Nucleic Acids Research  2009;38(Database issue):D716-D725.
The primary objective of most gene expression studies is the identification of one or more gene signatures; lists of genes whose transcriptional levels are uniquely associated with a specific biological phenotype. Whilst thousands of experimentally derived gene signatures are published, their potential value to the community is limited by their computational inaccessibility. Gene signatures are embedded in published article figures, tables or in supplementary materials, and are frequently presented using non-standard gene or probeset nomenclature. We present GeneSigDB ( a manually curated database of gene expression signatures. GeneSigDB release 1.0 focuses on cancer and stem cells gene signatures and was constructed from more than 850 publications from which we manually transcribed 575 gene signatures. Most gene signatures (n = 560) were successfully mapped to the genome to extract standardized lists of EnsEMBL gene identifiers. GeneSigDB provides the original gene signature, the standardized gene list and a fully traceable gene mapping history for each gene from the original transcribed data table through to the standardized list of genes. The GeneSigDB web portal is easy to search, allows users to compare their own gene list to those in the database, and download gene signatures in most common gene identifier formats.
PMCID: PMC2808880  PMID: 19934259
2.  Profiles of Genomic Instability in High-Grade Serous Ovarian Cancer Predict Treatment Outcome 
High-grade serous cancer (HGSC) is the most common cancer of the ovary and is characterized by chromosomal instability. Defects in homologous recombination repair (HRR) are associated with genomic instability in HGSC, and are exploited by therapy targeting DNA repair. Defective HRR causes uniparental deletions and loss of heterozygosity (LOH). Our purpose is to profile LOH in HGSC and correlate our findings to clinical outcome, and compare HGSC and high-grade breast cancers.
Experimental Design
We examined LOH and copy number changes using single nucleotide polymorphism array data from three HGSC cohorts and compared results to a cohort of high-grade breast cancers. The LOH profiles in HGSC were matched to chemotherapy resistance and progression-free survival (PFS).
LOH-based clustering divided HGSC into two clusters. The major group displayed extensive LOH and was further divided into two subgroups. The second group contained remarkably less LOH. BRCA1 promoter methylation was associated with the major group. LOH clusters were reproducible when validated in two independent HGSC datasets. LOH burden in the major cluster of HGSC was similar to triple-negative, and distinct from other high-grade breast cancers. Our analysis revealed an LOH cluster with lower treatment resistance and a significant correlation between LOH burden and PFS.
Separating HGSC by LOH-based clustering produces remarkably stable subgroups in three different cohorts. Patients in the various LOH clusters differed with respect to chemotherapy resistance, and the extent of LOH correlated with PFS. LOH burden may indicate vulnerability to treatment targeting DNA repair, such as PARP1 inhibitors.
PMCID: PMC4205235  PMID: 22912389
3.  Epithelial Progeny of Estrogen-Exposed Breast Progenitor Cells Display a Cancer-like Methylome 
Cancer research  2008;68(6):1786-1796.
Estrogen imprinting is used to describe a phenomenon in which early developmental exposure to endocrine disruptors increases breast cancer risk later in adult life. We propose that long-lived, self-regenerating stem and progenitor cells are more susceptible to the exposure injury than terminally differentiated epithelial cells in the breast duct. Mammospheres, containing enriched breast progenitors, were used as an exposure system to simulate this imprinting phenomenon in vitro. Using MeDIP-chip, a methylation microarray screening method, we found that 0.5% (120 loci) of human CpG islands were hypermethylated in epithelial cells derived from estrogen-exposed progenitors compared with the non–estrogen-exposed control cells. This epigenetic event may lead to progressive silencing of tumor suppressor genes, including RUNX3, in these epithelial cells, which also occurred in primary breast tumors. Furthermore, normal tissue in close proximity to the tumor site also displayed RUNX3 hypermethylation, suggesting that this aberrant event occurs in early breast carcinogenesis. The high prevalence of estrogen-induced epigenetic changes in primary tumors and the surrounding histologically normal tissues provides the first empirical link between estrogen injury of breast stem/progenitor cells and carcinogenesis. This finding also offers a mechanistic explanation as to why a tumor suppressor gene, such as RUNX3, can be heritably silenced by epigenetic mechanisms in breast cancer.
PMCID: PMC4172329  PMID: 18339859
4.  A multivariate approach to the integration of multi-omics datasets 
BMC Bioinformatics  2014;15:162.
To leverage the potential of multi-omics studies, exploratory data analysis methods that provide systematic integration and comparison of multiple layers of omics information are required. We describe multiple co-inertia analysis (MCIA), an exploratory data analysis method that identifies co-relationships between multiple high dimensional datasets. Based on a covariance optimization criterion, MCIA simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale, to extract the most variant from each dataset and facilitate biological interpretation and pathway analysis.
We demonstrate integration of multiple layers of information using MCIA, applied to two typical “omics” research scenarios. The integration of transcriptome and proteome profiles of cells in the NCI-60 cancer cell line panel revealed distinct, complementary features, which together increased the coverage and power of pathway analysis. Our analysis highlighted the importance of the leukemia extravasation signaling pathway in leukemia that was not highly ranked in the analysis of any individual dataset. Secondly, we compared transcriptome profiles of high grade serous ovarian tumors that were obtained, on two different microarray platforms and next generation RNA-sequencing, to identify the most informative platform and extract robust biomarkers of molecular subtypes. We discovered that the variance of RNA-sequencing data processed using RPKM had greater variance than that with MapSplice and RSEM. We provided novel markers highly associated to tumor molecular subtype combined from four data platforms. MCIA is implemented and available in the R/Bioconductor “omicade4” package.
We believe MCIA is an attractive method for data integration and visualization of several datasets of multi-omics features observed on the same set of individuals. The method is not dependent on feature annotation, and thus it can extract important features even when there are not present across all datasets. MCIA provides simple graphical representations for the identification of relationships between large datasets.
PMCID: PMC4053266  PMID: 24884486
Multivariate analysis; Multiple co-inertia; Data integration; Omic data; Visualization
5.  RamiGO: an R/Bioconductor package providing an AmiGO Visualize interface 
Bioinformatics  2013;29(5):666-668.
Summary: The R/Bioconductor package RamiGO is an R interface to AmiGO that enables visualization of Gene Ontology (GO) trees. Given a list of GO terms, RamiGO uses the AmiGO visualize API to import Graphviz-DOT format files into R, and export these either as images (SVG, PNG) or into Cytoscape for extended network analyses. RamiGO provides easy customization of annotation, highlighting of specific GO terms, colouring of terms by P-value or export of a simplified summary GO tree. We illustrate RamiGO functionalities in a genome-wide gene set analysis of prognostic genes in breast cancer.
Availability and implementation: RamiGO is provided in R/Bioconductor, is open source under the Artistic-2.0 License and is available with a user manual containing installation, operating instructions and tutorials. It requires R version 2.15.0 or higher. URL:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3582261  PMID: 23297033
6.  Stem Cell-Like Gene Expression in Ovarian Cancer Predicts Type II Subtype and Prognosis 
PLoS ONE  2013;8(3):e57799.
Although ovarian cancer is often initially chemotherapy-sensitive, the vast majority of tumors eventually relapse and patients die of increasingly aggressive disease. Cancer stem cells are believed to have properties that allow them to survive therapy and may drive recurrent tumor growth. Cancer stem cells or cancer-initiating cells are a rare cell population and difficult to isolate experimentally. Genes that are expressed by stem cells may characterize a subset of less differentiated tumors and aid in prognostic classification of ovarian cancer. The purpose of this study was the genomic identification and characterization of a subtype of ovarian cancer that has stem cell-like gene expression. Using human and mouse gene signatures of embryonic, adult, or cancer stem cells, we performed an unsupervised bipartition class discovery on expression profiles from 145 serous ovarian tumors to identify a stem-like and more differentiated subgroup. Subtypes were reproducible and were further characterized in four independent, heterogeneous ovarian cancer datasets. We identified a stem-like subtype characterized by a 51-gene signature, which is significantly enriched in tumors with properties of Type II ovarian cancer; high grade, serous tumors, and poor survival. Conversely, the differentiated tumors share properties with Type I, including lower grade and mixed histological subtypes. The stem cell-like signature was prognostic within high-stage serous ovarian cancer, classifying a small subset of high-stage tumors with better prognosis, in the differentiated subtype. In multivariate models that adjusted for common clinical factors (including grade, stage, age), the subtype classification was still a significant predictor of relapse. The prognostic stem-like gene signature yields new insights into prognostic differences in ovarian cancer, provides a genomic context for defining Type I/II subtypes, and potential gene targets which following further validation may be valuable in the clinical management or treatment of ovarian cancer.
PMCID: PMC3594231  PMID: 23536770
7.  survcomp: an R/Bioconductor package for performance assessment and comparison of survival models 
Bioinformatics  2011;27(22):3206-3208.
Summary: The survcomp package provides functions to assess and statistically compare the performance of survival/risk prediction models. It implements state-of-the-art statistics to (i) measure the performance of risk prediction models; (ii) combine these statistical estimates from multiple datasets using a meta-analytical framework; and (iii) statistically compare the performance of competitive models.
Availability: The R/Bioconductor package survcomp is provided open source under the Artistic-2.0 License with a user manual containing installation, operating instructions and use case scenarios on real datasets. survcomp requires R version 2.13.0 or higher.
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3208391  PMID: 21903630
8.  iBBiG: iterative binary bi-clustering of gene sets 
Bioinformatics  2012;28(19):2484-2492.
Motivation: Meta-analysis of genomics data seeks to identify genes associated with a biological phenotype across multiple datasets; however, merging data from different platforms by their features (genes) is challenging. Meta-analysis using functionally or biologically characterized gene sets simplifies data integration is biologically intuitive and is seen as having great potential, but is an emerging field with few established statistical methods.
Results: We transform gene expression profiles into binary gene set profiles by discretizing results of gene set enrichment analyses and apply a new iterative bi-clustering algorithm (iBBiG) to identify groups of gene sets that are coordinately associated with groups of phenotypes across multiple studies. iBBiG is optimized for meta-analysis of large numbers of diverse genomics data that may have unmatched samples. It does not require prior knowledge of the number or size of clusters. When applied to simulated data, it outperforms commonly used clustering methods, discovers overlapping clusters of diverse sizes and is robust in the presence of noise. We apply it to meta-analysis of breast cancer studies, where iBBiG extracted novel gene set—phenotype association that predicted tumor metastases within tumor subtypes.
Availability: Implemented in the Bioconductor package iBBiG
PMCID: PMC3463116  PMID: 22789589
9.  The Stem Cell Discovery Engine: an integrated repository and analysis system for cancer stem cell comparisons 
Nucleic Acids Research  2011;40(Database issue):D984-D991.
Mounting evidence suggests that malignant tumors are initiated and maintained by a subpopulation of cancerous cells with biological properties similar to those of normal stem cells. However, descriptions of stem-like gene and pathway signatures in cancers are inconsistent across experimental systems. Driven by a need to improve our understanding of molecular processes that are common and unique across cancer stem cells (CSCs), we have developed the Stem Cell Discovery Engine (SCDE)—an online database of curated CSC experiments coupled to the Galaxy analytical framework. The SCDE allows users to consistently describe, share and compare CSC data at the gene and pathway level. Our initial focus has been on carefully curating tissue and cancer stem cell-related experiments from blood, intestine and brain to create a high quality resource containing 53 public studies and 1098 assays. The experimental information is captured and stored in the multi-omics Investigation/Study/Assay (ISA-Tab) format and can be queried in the data repository. A linked Galaxy framework provides a comprehensive, flexible environment populated with novel tools for gene list comparisons against molecular signatures in GeneSigDB and MSigDB, curated experiments in the SCDE and pathways in WikiPathways. The SCDE is available at
PMCID: PMC3245064  PMID: 22121217
10.  GeneSigDB: a manually curated database and resource for analysis of gene expression signatures 
Nucleic Acids Research  2011;40(Database issue):D1060-D1066.
GeneSigDB ( or is a database of gene signatures that have been extracted and manually curated from the published literature. It provides a standardized resource of published prognostic, diagnostic and other gene signatures of cancer and related disease to the community so they can compare the predictive power of gene signatures or use these in gene set enrichment analysis. Since GeneSigDB release 1.0, we have expanded from 575 to 3515 gene signatures, which were collected and transcribed from 1604 published articles largely focused on gene expression in cancer, stem cells, immune cells, development and lung disease. We have made substantial upgrades to the GeneSigDB website to improve accessibility and usability, including adding a tag cloud browse function, facetted navigation and a ‘basket’ feature to store genes or gene signatures of interest. Users can analyze GeneSigDB gene signatures, or upload their own gene list, to identify gene signatures with significant gene overlap and results can be viewed on a dynamic editable heatmap that can be downloaded as a publication quality image. All data in GeneSigDB can be downloaded in numerous formats including .gmt file format for gene set enrichment analysis or as a R/Bioconductor data file. GeneSigDB is available from
PMCID: PMC3245038  PMID: 22110038
11.  Confounding Effects in “A Six-Gene Signature Predicting Breast Cancer Lung Metastasis” 
Cancer research  2009;69(18):7480-7485.
The majority of breast cancer deaths result from metastases rather than from direct effects of the primary tumor itself. Recently, Landemaine and colleagues described a six-gene signature purported to predict lung metastasis risk. They analyzed gene expression in 23 metastases from breast cancer patients (5 lung, 18 non-lung) identifying a 21-gene signature. Expression of 16 of these was analyzed in primary breast tumors from 72 patients with known outcome, and six were selected that were predictive of lung metastases: DSC2, TFCP2L1, UGT8, ITGB8, ANP32E, and FERMT1. Despite the value of such a signature, our analysis indicates that this analysis ignored potentially important confounding factors and that their signature is instead a surrogate for molecular subtype.
PMCID: PMC3128918  PMID: 19723662
12.  RAP80 Targets BRCA1 to Specific Ubiquitin Structures at DNA Damage Sites 
Science (New York, N.Y.)  2007;316(5828):1198-1202.
Mutations affecting the BRCT domains of the breast cancer–associated tumor suppressor BRCA1 disrupt the recruitment of this protein to DNA double-strand breaks (DSBs). The molecular structures at DSBs recognized by BRCA1 are presently unknown. We report the interaction of the BRCA1 BRCT domain with RAP80, a ubiquitin-binding protein. RAP80 targets a complex containing the BRCA1-BARD1 (BRCA1-associated ring domain protein 1) E3 ligase and the deubiquitinating enzyme (DUB) BRCC36 to MDC1-γH2AX–dependent lysine6- and lysine63-linked ubiquitin polymers at DSBs. These events are required for cell cycle checkpoint and repair responses to ionizing radiation, implicating ubiquitin chain recognition and turnover in the BRCA1-mediated repair of DSBs.
PMCID: PMC2706583  PMID: 17525341
13.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data 
BMC Bioinformatics  2006;7:359.
Numerous feature selection methods have been applied to the identification of differentially expressed genes in microarray data. These include simple fold change, classical t-statistic and moderated t-statistics. Even though these methods return gene lists that are often dissimilar, few direct comparisons of these exist. We present an empirical study in which we compare some of the most commonly used feature selection methods. We apply these to 9 publicly available datasets, and compare, both the gene lists produced and how these perform in class prediction of test datasets.
In this study, we compared the efficiency of the feature selection methods; significance analysis of microarrays (SAM), analysis of variance (ANOVA), empirical bayes t-statistic, template matching, maxT, between group analysis (BGA), Area under the receiver operating characteristic (ROC) curve, the Welch t-statistic, fold change, rank products, and sets of randomly selected genes. In each case these methods were applied to 9 different binary (two class) microarray datasets. Firstly we found little agreement in gene lists produced by the different methods. Only 8 to 21% of genes were in common across all 10 feature selection methods. Secondly, we evaluated the class prediction efficiency of each gene list in training and test cross-validation using four supervised classifiers.
We report that the choice of feature selection method, the number of genes in the genelist, the number of cases (samples) and the noise in the dataset, substantially influence classification success. Recommendations are made for choice of feature selection. Area under a ROC curve performed well with datasets that had low levels of noise and large sample size. Rank products performs well when datasets had low numbers of samples or high levels of noise. The Empirical bayes t-statistic performed well across a range of sample sizes.
PMCID: PMC1544358  PMID: 16872483
14.  Optimized between-group classification: a new jackknife-based gene selection procedure for genome-wide expression data 
BMC Bioinformatics  2005;6:239.
A recent publication described a supervised classification method for microarray data: Between Group Analysis (BGA). This method which is based on performing multivariate ordination of groups proved to be very efficient for both classification of samples into pre-defined groups and disease class prediction of new unknown samples. Classification and prediction with BGA are classically performed using the whole set of genes and no variable selection is required. We hypothesize that an optimized selection of highly discriminating genes might improve the prediction power of BGA.
We propose an optimized between-group classification (OBC) which uses a jackknife-based gene selection procedure. OBC emphasizes classification accuracy rather than feature selection. OBC is a backward optimization procedure that maximizes the percentage of between group inertia by removing the least influential genes one by one from the analysis. This selects a subset of highly discriminative genes which optimize disease class prediction. We apply OBC to four datasets and compared it to other classification methods.
OBC considerably improved the classification and predictive accuracy of BGA, when assessed using independent data sets and leave-one-out cross-validation.
The R code is freely available [see Additional file 1] as well as supplementary information [see Additional file 2].
PMCID: PMC1261161  PMID: 16191195
15.  Expression Profiler: next generation—an online platform for analysis of microarray data 
Nucleic Acids Research  2004;32(Web Server issue):W465-W470.
Expression Profiler (EP, is a web-based platform for microarray gene expression and other functional genomics-related data analysis. The new architecture, Expression Profiler: next generation (EP:NG), modularizes the original design and allows individual analysis-task-related components to be developed by different groups and yet still seamlessly to work together and share the same user interface look and feel. Data analysis components for gene expression data preprocessing, missing value imputation, filtering, clustering methods, visualization, significant gene finding, between group analysis and other statistical components are available from the EBI (European Bioinformatics Institute) web site. The web-based design of Expression Profiler supports data sharing and collaborative analysis in a secure environment. Developed tools are integrated with the microarray gene expression database ArrayExpress and form the exploratory analytical front-end to those data. EP:NG is an open-source project, encouraging broad distribution and further extensions from the scientific community.
PMCID: PMC441608  PMID: 15215431
16.  Cross-platform comparison and visualisation of gene expression data using co-inertia analysis 
BMC Bioinformatics  2003;4:59.
Rapid development of DNA microarray technology has resulted in different laboratories adopting numerous different protocols and technological platforms, which has severely impacted on the comparability of array data. Current cross-platform comparison of microarray gene expression data are usually based on cross-referencing the annotation of each gene transcript represented on the arrays, extracting a list of genes common to all arrays and comparing expression data of this gene subset. Unfortunately, filtering of genes to a subset represented across all arrays often excludes many thousands of genes, because different subsets of genes from the genome are represented on different arrays. We wish to describe the application of a powerful yet simple method for cross-platform comparison of gene expression data. Co-inertia analysis (CIA) is a multivariate method that identifies trends or co-relationships in multiple datasets which contain the same samples. CIA simultaneously finds ordinations (dimension reduction diagrams) from the datasets that are most similar. It does this by finding successive axes from the two datasets with maximum covariance. CIA can be applied to datasets where the number of variables (genes) far exceeds the number of samples (arrays) such is the case with microarray analyses.
We illustrate the power of CIA for cross-platform analysis of gene expression data by using it to identify the main common relationships in expression profiles on a panel of 60 tumour cell lines from the National Cancer Institute (NCI) which have been subjected to microarray studies using both Affymetrix and spotted cDNA array technology. The co-ordinates of the CIA projections of the cell lines from each dataset are graphed in a bi-plot and are connected by a line, the length of which indicates the divergence between the two datasets. Thus, CIA provides graphical representation of consensus and divergence between the gene expression profiles from different microarray platforms. Secondly, the genes that define the main trends in the analysis can be easily identified.
CIA is a robust, efficient approach to coupling of gene expression datasets. CIA provides simple graphical representations of the results making it a particularly attractive method for the identification of relationships between large datasets.
PMCID: PMC317282  PMID: 14633289

Results 1-16 (16)