Cells in adult non-germinal tissues such as blood, skin and intestine turn over briskly and are known to require stem cells for lifelong renewal. These tissue stem cells are capable of proliferation and self-renewal, and can produce differentiated progeny through the expression of tissue-specific genes. Recent evidence suggests that studying adult stem cells can provide insight into cancer cell biology. Only small fractions of tumor-derived cells are clonogenic in culture or tumorigenic
in vivo (
1,
2). Cancers are therefore thought to rely on the activity of stem or stem-like cells that are tumorigenic and exhibit the cardinal properties of self-renewal and multi-lineage differentiation potential.
Stem and differentiated cells within a tumor are reported to differ in sensitivity toward therapy (
3). Studies have independently established embryonic stem cell gene expression signatures where cancer subtypes with poor survival prognosis are enriched in treatment-resistant, stem-like cells. Stem cell signatures resulting in poor prognosis have so far been found in glioma, breast, lung, colon and esophageal cancers (
4–10). Comparing stem cell populations therefore has the potential to identify new molecular targets for drug and immune therapies that destroy the self-renewing cancer stem cells (CSCs). However, descriptions of gene and pathway stem-like signatures across cancers are inconsistent across platforms, tissues and laboratories.
Driven by a need to understand CSC molecular profiles generated at the Harvard Stem Cell Institute (HSCI), we have developed a platform to integrate CSC experimental information: the Stem Cell Discovery Engine (
http://discovery.hsci.harvard.edu). We have collected, curated and integrated this data into the Stem Cell Discovery Engine (SCDE) to permit molecular comparisons between normal and cancerous stem cells, between stem-cell compartments in blood, intestine and brain, and between mouse models and human tissues.
SCDE overview
The SCDE is a modular online system designed to handle data submission, curation, analysis, integration and dissemination of stem cell-related experiments (). The system has two components: (i) a tissue and cancer stem cell database accessible through the BioInvestigation Index (BII) (
11) and (ii) a customized instance of the Galaxy analysis engine (
12,
13). It includes tools that integrate public stem cell data with user-submitted experiments. Its initial focus is on gene list manipulation, and interaction with the curated Gene Signatures Database (GeneSigDB) (
14), Molecular Signatures Database (MSigDB) (
15), and WikiPathways pathway database (
16) (). A description of the database in accordance with BioDBCore standards (
17) is available in
Supplementary Table S1.
Curation of experimental metadata and derived data
The SCDE database provides a source of structured experimental information on assays, derived gene lists and pathway profiles. Heterogeneity in experimental information has been reduced by rigorous, manual curation of the experimental model, cell and tissue types, disease state, surface markers and other relevant data. Submitted user data is first checked for relevance, i.e. studies must be performed using well-defined stem cell, tissue stem cell and/or cancer stem cell populations, and must produce genome-scale data with potential to provide insight into the stem-like characteristics of cancers. All of the raw data with its sample characteristics must be available. Data input fields are then mapped to the ontologies listed in according to species-specificity and overall coverage of the ontology. New terms are submitted to the ontology maintainers for future inclusion. This ensures that new terms are standardized and incorporated for community use. Experimental protocols and analytical methods are annotated with the goal of providing sufficient information to reproduce or perform similar experiments and to derive the processed data. Derived data in the form of gene lists are converted to standardized identifiers to be used for gene list comparisons within Galaxy.
We stored experimental metadata in the Investigation/Study/Assay (ISA-Tab) format, i.e. high level information about the experiment is recorded in the ‘investigation’ file, sample attributes and factors in the ‘study’ files, and protocols and analysis methods in the ‘assay’ files. This general purpose tab-delimited grammar manages metadata from diverse studies, and enables users to align with community-defined minimum information, ontologies and checklists (
11,
18,
19). It comes with support tools for curation (including semi-automated annotation tagging through the NCBO BioPortal annotation service (
20) to speed the process) and format conversion (
http://isatab.sourceforge.net) to make it straightforward to submit data to international public repositories, such as the Gene Expression Omnibus (GEO) (
21). ISA-Tab is supported and maintained by a global collaboration of biocurators (
22). While the initial cost of curation is high, it allows for sharing of ISA-Tab configurations we have developed specifically for stem cell data that can be used within the various ISA tools by the stem cell community. The goal is to build a curation network and establish community involvement so that standards are agreed upon and adopted.
Database contents
A primary focus was a selection of studies related to normal and CSCs, and in particular for three model systems
—blood, intestine and brain. In these tissues, the behavior of native stem cells is especially well characterized, investigators generally agree on stem cell definitions, and cancer is common. shows the distribution of data across organisms, tissues and types of measurements. The database integrates 53 public studies comprised of 1098 molecular assays from CSC-related experiments from multiple tissues, species and heterogeneous platforms. Five additional studies comprised of 84 assays are stored as private, unpublished data that are available to specific researchers upon login and are ready for dissemination upon publication. Fifteen studies were contributed by researchers in the HSCI community and an additional 40 studies related to CSC biology were selected from StemBase (
23,
24). Forty-six studies were performed in rodent models and 13 in human cells; these include two studies containing samples assayed from both rodent and human models. The database is made up in large part by microarray expression profiling studies but results from nucleotide sequencing (i.e. ChIP-seq) studies of histone methylation and transcription factor binding, histology and expression analysis by RT
–PCR are also included.
Data acquisition and dissemination
Researchers can submit their own data or suggest public data to a curator, who manually curates it according to community-accepted standards and ontologies (). In cases where published studies have associated data deposited in ArrayExpress, the MAGEtoISA converter tool permits rapid conversion from MAGE-TAB to ISA-Tab format, which is then manually evaluated by a curator for completeness and corrected where necessary.
To ensure that all stem cell data are comparable, primary and derived data sets are organized in a standardized manner and disseminated to the public using a local instance of the SCDE Bioinvestigation Index (BII). This data repository is designed to support storage, querying and display of multi-omics data sets (
11). The annotated metadata allows users to search the entire corpus of experiments in the BII based on organism, measurement type (e.g. transcriptional profiling), technology (e.g. nucleotide sequencing), and platform (e.g. Illumina) or to search free text across all fields (A). Study pages display the details of each experiment (B–D). The annotation has focused on ensuring that cell types, tissues and experimental variables are consistently reported to improve query capabilities, and to establish sound annotation practices to describe stem cell research (e.g. descriptions of genetic modifications).
Published studies are automatically made publicly available. ISA-Tab formatted metadata can be downloaded for information pertaining to the assays, such as normalization procedures for microarray experiments and GEO accession identifiers where available. Raw primary data (e.g. CEL files for Affymetrix microarrays) and processed derived data (e.g. author-generated gene lists) can also be downloaded from the BII using the ‘Raw Data’ and ‘Processed Data’ buttons (E). Alternatively, the data can be accessed within the SCDE Galaxy framework for analysis as described in the following section. Researchers with the appropriate access permissions can query unpublished data to perform early analyses, and upon publication, have the added benefit of exporting their ISA-Tab formatted data for submission to ArrayExpress using the conversion tools. The corresponding functionality for submission to GEO in MiniML format is in progress and will represent a valuable incentive for the stem cell community to use the SCDE as a first port of call for submission of CSC functional genomics data.
Querying CSC molecular signatures using Galaxy
In addition to querying experimental metadata, the SCDE provides functionality to interrogate stem cell molecular profiles in a linked Galaxy instance, with the goal of identifying similarities and differences between normal and cancer stem cell experiments. All raw and processed data stored in the BII and several additional manually curated stem cell-related gene lists are accessible from within Galaxy for analysis.
Manual curation and consistent identifier conversion differentiate the SCDE from other gene list comparison tools. Derived gene lists have been mapped to standardized gene symbols using methods developed for GeneSigDB. Such standardization allows for comparisons to determine genes that are shared or unique across experiments. Tools are available to compare a single gene list (SCDE ListMatch) or multiple gene lists (SCDE ListCompare) against curated gene signatures in GeneSigDB, molecular signatures in MSigDB, derived gene lists from the SCDE database and pathways in WikiPathways. These tools allow users to identify genes in common with defined reference signatures and pathways (). Results are summarized and ranked according to a hypergeometric test P-value and linked to the relevant overlapping gene sets (B). For WikiPathways comparisons, a link is provided to visualize the gene matches in color-coded diagrams of canonical pathways (C). The SCDE Intersect tool identifies genes that are common to multiple gene lists. By using the Galaxy interface, users can maintain a record of their analysis history and easily compare multiple data sets stored in their history.