|Home | About | Journals | Submit | Contact Us | Français|
Mounting evidence suggests that malignant tumors are initiated and maintained by a subpopulation of cancerous cells with biological properties similar to those of normal stem cells. However, descriptions of stem-like gene and pathway signatures in cancers are inconsistent across experimental systems. Driven by a need to improve our understanding of molecular processes that are common and unique across cancer stem cells (CSCs), we have developed the Stem Cell Discovery Engine (SCDE)—an online database of curated CSC experiments coupled to the Galaxy analytical framework. The SCDE allows users to consistently describe, share and compare CSC data at the gene and pathway level. Our initial focus has been on carefully curating tissue and cancer stem cell-related experiments from blood, intestine and brain to create a high quality resource containing 53 public studies and 1098 assays. The experimental information is captured and stored in the multi-omics Investigation/Study/Assay (ISA-Tab) format and can be queried in the data repository. A linked Galaxy framework provides a comprehensive, flexible environment populated with novel tools for gene list comparisons against molecular signatures in GeneSigDB and MSigDB, curated experiments in the SCDE and pathways in WikiPathways. The SCDE is available at http://discovery.hsci.harvard.edu.
Cells in adult non-germinal tissues such as blood, skin and intestine turn over briskly and are known to require stem cells for lifelong renewal. These tissue stem cells are capable of proliferation and self-renewal, and can produce differentiated progeny through the expression of tissue-specific genes. Recent evidence suggests that studying adult stem cells can provide insight into cancer cell biology. Only small fractions of tumor-derived cells are clonogenic in culture or tumorigenic in vivo (1,2). Cancers are therefore thought to rely on the activity of stem or stem-like cells that are tumorigenic and exhibit the cardinal properties of self-renewal and multi-lineage differentiation potential.
Stem and differentiated cells within a tumor are reported to differ in sensitivity toward therapy (3). Studies have independently established embryonic stem cell gene expression signatures where cancer subtypes with poor survival prognosis are enriched in treatment-resistant, stem-like cells. Stem cell signatures resulting in poor prognosis have so far been found in glioma, breast, lung, colon and esophageal cancers (4–10). Comparing stem cell populations therefore has the potential to identify new molecular targets for drug and immune therapies that destroy the self-renewing cancer stem cells (CSCs). However, descriptions of gene and pathway stem-like signatures across cancers are inconsistent across platforms, tissues and laboratories.
Driven by a need to understand CSC molecular profiles generated at the Harvard Stem Cell Institute (HSCI), we have developed a platform to integrate CSC experimental information: the Stem Cell Discovery Engine (http://discovery.hsci.harvard.edu). We have collected, curated and integrated this data into the Stem Cell Discovery Engine (SCDE) to permit molecular comparisons between normal and cancerous stem cells, between stem-cell compartments in blood, intestine and brain, and between mouse models and human tissues.
The SCDE is a modular online system designed to handle data submission, curation, analysis, integration and dissemination of stem cell-related experiments (Figure 1). The system has two components: (i) a tissue and cancer stem cell database accessible through the BioInvestigation Index (BII) (11) and (ii) a customized instance of the Galaxy analysis engine (12,13). It includes tools that integrate public stem cell data with user-submitted experiments. Its initial focus is on gene list manipulation, and interaction with the curated Gene Signatures Database (GeneSigDB) (14), Molecular Signatures Database (MSigDB) (15), and WikiPathways pathway database (16) (Figure 1). A description of the database in accordance with BioDBCore standards (17) is available in Supplementary Table S1.
The SCDE database provides a source of structured experimental information on assays, derived gene lists and pathway profiles. Heterogeneity in experimental information has been reduced by rigorous, manual curation of the experimental model, cell and tissue types, disease state, surface markers and other relevant data. Submitted user data is first checked for relevance, i.e. studies must be performed using well-defined stem cell, tissue stem cell and/or cancer stem cell populations, and must produce genome-scale data with potential to provide insight into the stem-like characteristics of cancers. All of the raw data with its sample characteristics must be available. Data input fields are then mapped to the ontologies listed in Table 1 according to species-specificity and overall coverage of the ontology. New terms are submitted to the ontology maintainers for future inclusion. This ensures that new terms are standardized and incorporated for community use. Experimental protocols and analytical methods are annotated with the goal of providing sufficient information to reproduce or perform similar experiments and to derive the processed data. Derived data in the form of gene lists are converted to standardized identifiers to be used for gene list comparisons within Galaxy.
We stored experimental metadata in the Investigation/Study/Assay (ISA-Tab) format, i.e. high level information about the experiment is recorded in the ‘investigation’ file, sample attributes and factors in the ‘study’ files, and protocols and analysis methods in the ‘assay’ files. This general purpose tab-delimited grammar manages metadata from diverse studies, and enables users to align with community-defined minimum information, ontologies and checklists (11,18,19). It comes with support tools for curation (including semi-automated annotation tagging through the NCBO BioPortal annotation service (20) to speed the process) and format conversion (http://isatab.sourceforge.net) to make it straightforward to submit data to international public repositories, such as the Gene Expression Omnibus (GEO) (21). ISA-Tab is supported and maintained by a global collaboration of biocurators (22). While the initial cost of curation is high, it allows for sharing of ISA-Tab configurations we have developed specifically for stem cell data that can be used within the various ISA tools by the stem cell community. The goal is to build a curation network and establish community involvement so that standards are agreed upon and adopted.
A primary focus was a selection of studies related to normal and CSCs, and in particular for three model systems—blood, intestine and brain. In these tissues, the behavior of native stem cells is especially well characterized, investigators generally agree on stem cell definitions, and cancer is common. Table 2 shows the distribution of data across organisms, tissues and types of measurements. The database integrates 53 public studies comprised of 1098 molecular assays from CSC-related experiments from multiple tissues, species and heterogeneous platforms. Five additional studies comprised of 84 assays are stored as private, unpublished data that are available to specific researchers upon login and are ready for dissemination upon publication. Fifteen studies were contributed by researchers in the HSCI community and an additional 40 studies related to CSC biology were selected from StemBase (23,24). Forty-six studies were performed in rodent models and 13 in human cells; these include two studies containing samples assayed from both rodent and human models. The database is made up in large part by microarray expression profiling studies but results from nucleotide sequencing (i.e. ChIP-seq) studies of histone methylation and transcription factor binding, histology and expression analysis by RT–PCR are also included.
Researchers can submit their own data or suggest public data to a curator, who manually curates it according to community-accepted standards and ontologies (Table 1). In cases where published studies have associated data deposited in ArrayExpress, the MAGEtoISA converter tool permits rapid conversion from MAGE-TAB to ISA-Tab format, which is then manually evaluated by a curator for completeness and corrected where necessary.
To ensure that all stem cell data are comparable, primary and derived data sets are organized in a standardized manner and disseminated to the public using a local instance of the SCDE Bioinvestigation Index (BII). This data repository is designed to support storage, querying and display of multi-omics data sets (11). The annotated metadata allows users to search the entire corpus of experiments in the BII based on organism, measurement type (e.g. transcriptional profiling), technology (e.g. nucleotide sequencing), and platform (e.g. Illumina) or to search free text across all fields (Figure 2A). Study pages display the details of each experiment (Figure 2B–D). The annotation has focused on ensuring that cell types, tissues and experimental variables are consistently reported to improve query capabilities, and to establish sound annotation practices to describe stem cell research (e.g. descriptions of genetic modifications).
Published studies are automatically made publicly available. ISA-Tab formatted metadata can be downloaded for information pertaining to the assays, such as normalization procedures for microarray experiments and GEO accession identifiers where available. Raw primary data (e.g. CEL files for Affymetrix microarrays) and processed derived data (e.g. author-generated gene lists) can also be downloaded from the BII using the ‘Raw Data’ and ‘Processed Data’ buttons (Figure 2E). Alternatively, the data can be accessed within the SCDE Galaxy framework for analysis as described in the following section. Researchers with the appropriate access permissions can query unpublished data to perform early analyses, and upon publication, have the added benefit of exporting their ISA-Tab formatted data for submission to ArrayExpress using the conversion tools. The corresponding functionality for submission to GEO in MiniML format is in progress and will represent a valuable incentive for the stem cell community to use the SCDE as a first port of call for submission of CSC functional genomics data.
In addition to querying experimental metadata, the SCDE provides functionality to interrogate stem cell molecular profiles in a linked Galaxy instance, with the goal of identifying similarities and differences between normal and cancer stem cell experiments. All raw and processed data stored in the BII and several additional manually curated stem cell-related gene lists are accessible from within Galaxy for analysis.
Manual curation and consistent identifier conversion differentiate the SCDE from other gene list comparison tools. Derived gene lists have been mapped to standardized gene symbols using methods developed for GeneSigDB. Such standardization allows for comparisons to determine genes that are shared or unique across experiments. Tools are available to compare a single gene list (SCDE ListMatch) or multiple gene lists (SCDE ListCompare) against curated gene signatures in GeneSigDB, molecular signatures in MSigDB, derived gene lists from the SCDE database and pathways in WikiPathways. These tools allow users to identify genes in common with defined reference signatures and pathways (Figure 3). Results are summarized and ranked according to a hypergeometric test P-value and linked to the relevant overlapping gene sets (Figure 3B). For WikiPathways comparisons, a link is provided to visualize the gene matches in color-coded diagrams of canonical pathways (Figure 3C). The SCDE Intersect tool identifies genes that are common to multiple gene lists. By using the Galaxy interface, users can maintain a record of their analysis history and easily compare multiple data sets stored in their history.
The SCDE database provides a repository for curated CSC data and a framework for developing methods to compare molecular information on stem cell related populations. We illustrate the functionality of the SCDE using the following use case as an example. A leukemia researcher enters the SCDE through the BII interface. A search for the term ‘leukemia’ in the free text search box produces five transcriptional profiling studies performed in mouse models. The user selects the first result (ARMSTRONG-S-1) to obtain further details of the study and is provided with information about genetic modifications, hematopoietic progenitor cell types, immunophenotypes, type of leukemia studied and the mouse strain used in the experiment. Wishing to perform a related experiment, the researcher downloads the experimental metadata in ISA-Tab format, which provides him with additional information about the sample cell types, labeling protocol, microarray chip used, number of replicates, normalization procedure, etc. After performing his experiment, the researcher returns to the SCDE to determine how similar his results are to the ARMSTRONG-S-1 study, or indeed to any of the experiments in the SCDE. Using the Galaxy web interface, he uploads his list of differentially expressed genes from his leukemia experiment and uses the ListMatch tool to determine the following: (i) significant overlap with gene signatures from SCDE experiments (this may reveal similarities to the leukemia studies or other hematopoietic stem cell experiments contained in the SCDE; (ii) genes enriched in curated signatures from GeneSigDB or mSigDB (such overlaps provide information about similar diseases states, positional biases and functional groupings) and (iii) genes that overlap with known pathways from WikiPathways (genes are projected onto the canonical pathway diagram to indicate where they occur within the pathway). Going a step further, the researcher uses the ListCompare tool to find the overlap with the ARMSTRONG-S-1 gene list with reference to canonical pathways in WikiPathways. This allows him to identify pathways that contain genes from both lists even where the intersection of the two lists is small, generating hypotheses about possible pathways to study further. Having done his analysis within Galaxy, the researcher saves the gene lists, parameters and results and can share this data with his collaborators or make it publicly available.
The SCDE is unique in its community-oriented approach for identifying relevant experiments, capturing and curating study information, and integrating new analysis capabilities compared to previous resources. The adoption of the ISA-Tab format permits inclusion of multiple diverse data types and demonstrates that the tools we have used are ready for scale up. The Galaxy framework allows us to rapidly add relevant analysis methods developed by the growing Galaxy development community in which we are active participants. The implementation of open source software projects that are gaining community support will ensure that the SCDE continues to evolve. The tools developed for the SCDE Galaxy instance have been published on bitbucket at the URL http://bitbucket.org/hbc/galaxy-central-hbc. By publishing this resource and making the infrastructure available, we hope to develop the stem cell community and obtain feedback on annotation practices, relevant data sets and analytical methods.
While the comparison of gene signatures is informative, a systematic approach to compare and determine the role of key pathway contributions across different experimental systems and cancers against a consistent background is needed. A pathway fingerprinting method to determine functional similarity among experiments independently of platform or species is being developed for integration into the SCDE (Altschuler et al., submitted). We will continue to expand the SCDE to include additional CSC-related studies and new data types, and work with the stem cell community to further refine relevant ontology terms, as has been the case for the Cell Ontology (25). A further focus will be to develop methods to integrate epigenetic data with gene expression. Scientists interested in adding or curating studies, or in implementing analysis options that are not yet available, are encouraged to contact us at firstname.lastname@example.org.
Supplementary Data are available at NAR Online.
National Institutes of Health (1RC2CA148222-01 to R.S., S.A. and W.H.); Harvard Stem Cell Institute. Funding for open access charge: Harvard School of Public Health Dean's Fund; National Institutes of Health Stimulus awards.
Conflict of interest statement. None declared.
We acknowledge assistance from the Harvard Stem Cell Institute for contributing CSC data and are grateful for discussions and assistance from Dr Amit Sinha. We thank Dr Miguel Andrade-Navarro for access to the StemBase data. Thank you also to Emily Merrill and Dr Sudeshna Das for their input on ontology usage.