Content in the Epigenomics resource is derived primarily from data originally submitted to archival databases at the NCBI, specifically, the Gene Expression Omnibus (GEO) and the Sequence Read Archive (SRA) (15
). Although originally established to warehouse gene expression data, GEO has become a general-purpose database for molecular abundance data from wide variety of experiment types, including those aimed at epigenomics. In addition to the raw abundance data, extensive meta data may be provided with GEO submissions in order to fully describe the biological and experimental context. In recent years, measurements of molecular abundance are increasingly being generated by the use of next-gen sequencing-based approaches. SRA serves as a repository for raw next-gen sequence data, together with detailed information on the sequencing instrument and other experimental variables.
In constructing the new Epigenomics database, we have identified the subset of GEO and SRA data that pertain to epigenomics, subjected them to additional review, and reorganized them in a fashion that is more useful for epigenomics researchers. In many cases, data producers provide WIG files as part of their GEO submissions, which allows them to be directly leveraged in the Epigenomics resource. However, because some submissions either lack WIG files or have files based on older genome assemblies, we have developed a pipeline for generating epigenetic tracks that primarily uses the processed output from the Bowtie aligner (17
). The Epigenomics database currently has data tracks for epigenetic features, including histone modifications, DNA methylation, chromatin accessibility and expression of small non-coding RNAs. Data is also available for several chromatin associated factors such as histone modifying enzymes, transcription factors, and components of the core transcriptional machinery. More of these data type will be included as they become available. Furthermore, gene expression data for relevant biological samples will also be included in Epigenomics.
The two fundamental types of database records are ‘studies’ and ‘samples’, both of which are assigned unique accession numbers (with prefixes ESS and ESM, respectively). A study refers to one or more experiments with a common set of scientific aims. Most often an epigenomic study will correspond to a publication or to a publicly available data set. For each study, there is a brief summary of the scientific design, together with a listing of the biological samples that were studied and the epigenetic features examined. Full data source information is provided, including the submitter’s institution, links to the original data submissions in GEO and SRA, links to literature citations in PubMed and (where available) links to the full-text articles in PubMed Central.
Each study is associated with a collection of samples. A sample corresponds to the biological material that was examined and includes detailed biological source attributes with values drawn from controlled vocabularies. In order to unify and consistently assign biological attributes, extensive manual curation is performed using submitted meta data as a foundation. This process may include examining primary literature and researching on-line repositories of cell lines, mouse strains, and tissues. There are over 20 biological attribute fields available, including strain, cultivar, ecotype, individual, gender, age, developmental stage, cell line, cell type, tissue type, health status and many others.
The Epigenomics database was first released in June of 2010. As of this writing, it contains 69 studies, 337 samples and over 1100 data tracks from five well-studied species (). Currently, data tracks for global expression of micro and small RNAs (194 tracks), histone modifications (626 tracks), DNA methylation (128 tracks), chromatin accessibility (60 tracks) and various chromatin associated factors (including RNA polymerase, transcription factors, and various histone modifying enzymes) (140 tracks), are available among others.
Epigenomics database current record holdingsa