Since the Human Genome Project concluded in 2003, international funding agencies, particularly the National Institutes of Health (NIH), have continued to focus on large-scale, community resource projects such as HapMap (1
), 1000 genomes (2
), the ENCODE pilot (3
) and many others. Included in this effort are model organism-specific projects, beginning with the sequence of the first multicellular organism, Caenorhabditis elegans
, published in 1998 (4
), which was quickly followed by Drosophila melanogaster
in 2000 (5
). Ultimately, the aim of all such large-scale projects is to provide resources for the greater research community. These projects almost always require a centralized Data Collection Center (DCC) where the entirety of the data is integrated, undergoes quality control checks and is distributed to the community with sufficient experimental detail to be clear and useful.
The nature and composition of each large-scale project imposes considerations that affect the data collection strategy employed by any particular DCC. Three major influences are the number of contributing laboratories, their geographic distribution and the number of different data types and protocols involved. The number of contributing laboratories may vary from a handful [the Drosophila
genome primarily involved three labs (5
)] to dozens (e.g. The Cancer Genome Atlas Project; http://cancergenome.nih.gov/wwd/program
). In addition, geography can impose network bandwidth constraints for transferring and locating data, and time zone differences may constrain communications between groups. Furthermore, the data types generated may be homogeneous (e.g. HapMap produced SNPs using a limited number of protocols) or highly variable (e.g. ENCODE is using an eclectic assortment of assays to identify many different genomic features). In all cases, a project’s DCC must handle large quantities of data, ranging from a few hundred gigabytes to petabytes.
The model organism Encyclopedia of DNA Elements (modENCODE) initiative is designed to characterize the genomes of D. melanogaster
and C. elegans
. As a resource, modENCODE serves the model organism research communities, and complements the related human ENCODE project (http://www.genome.gov/10005107
), with the ultimate objective of advancing comparative genomics. The consortium comprises 11 research projects: 4 projects for worm, 6 for fly and 1 contributing to both organisms. The modENCODE project was initially funded for 4 years, but has since been extended to 5 years. Of the approximately $17.5
M annual budget (excluding supplemental funding), 55% supports D. melanogaster
efforts, 30% supports C. elegans
efforts and the remaining 15% is split equally between the DCC and the Data Analysis Center (DAC). These projects represent 52 different data production laboratories at 33 different research institutions in the USA, Canada and the UK. Even within the DCC, with three contributing institutions, geographic location is a consideration. The DCC principal investigator and three staff members (data liaisons and GBrowse development) are located in Toronto, Canada; one co-PI and four staff members (pipeline, data integration and liaisons) are in Berkeley, California; and a second co-PI and three staff members (modMine) are in Cambridge, UK. This DCC staff is charged with tracking, integrating and promptly making available to the research community all modENCODE data generated for the two organisms being studied. The worm and fly genomes are only 97 and 165 million base pairs, respectively, and are small in comparison to the human genome and the data likely to be produced from the 1000 Genomes or cancer genome projects. Thus, by volume, modENCODE is considered a medium-sized (10 terabyte) project.
Of the three factors described above, the most significant challenge for ENCODE and modENCODE is the diversity of feature types coming from the participating laboratories [e.g. transcription factor (TF) binding site characterization, mRNA transcription levels, ncRNAs, stage-specific gene models, chromatin states and DNA replication control], multiplied by the use of a wide variety of different methods and platforms. This is further complicated for the modENCODE DCC by the need to accommodate and integrate data from two organisms. In addition, each participating laboratory must take advantage of cutting edge technologies, and consequently, data production often pushes the envelope of contemporary data storage capacity, requiring a DCC to keep pace.
The metadata challenge
In the context of these operational requirements, the modENCODE DCC’s overarching objective is to ensure that the community is provided with knowledge of the experimental conditions, protocols and verification checks used to generate each data set so that the corpus can be effectively used in future research. Perhaps the greatest challenge in making the large and diverse body of data available to the greater community is providing easy lookup of relevant submissions. Beyond a basic species-specific query, the type of questions that we want the community to be able to ask include: ‘What submissions use the Oregon-R strain?’, ‘Which transcription factor antibodies were produced in a rabbit host?’, ‘Find only those experiments where worms were grown at 23°C’, ‘Find the genomic regions expressed only during pupal stages’, etc. However, an interface is only useful if queries return all relevant results. The factors most critical to making such queries possible are uniformity in data representation, and the completeness and specificity of the associated metadata.
Metadata standards have long been recognized for their utility in making experiments more understandable and integrative. For example, Minimum Information About a Microarray Experiment (MIAME) in conjunction with the Microarray and Gene Expression Data (MGED) ontology has become the standard for describing microarray experiments in the major data repositories, including Gene Expression Omnibus (GEO), ArrayExpress (AE), Short-Read Archive (SRA) and the National Center for Biotechnology Information (NCBI) (6
). However, despite the existence of a standard ontology, each repository has its own level of ‘control’ that it imposes on its MIAME-compliant data. AE takes a more controlled approach to collecting metadata, and many of the required MIAME items are specified through controlled vocabulary (CV) terms from the MGED ontology (7
). NCBI, on the other hand, has taken a looser approach; its MIAME metadata is collected in free-text form. The benefits to a more controlled approach are that the resulting metadata is more uniform and more amenable to computational reasoning. The drawback is that it may not be quick and easy to specify the metadata since many biologists are unfamiliar with the CVs or ontologies used. NCBI’s approach presents a much lower barrier to entry, which they suggest encourages a high rate of deposition (8
); however, the freedom of expressivity that comes with free text has consequences in less-consistent, and often underspecified, descriptions of the experimental details (9
With the success of MIAME, there followed many additional ‘Minimum Information’ standards groups, collected together under the umbrella of Minimum Information for Biological and Biomedical Investigations (MIBBI) Foundry (10
). Of particular relevance is the draft of the Minimum INformation about a high-throughput SEQuencing Experiment (MINSEQE) (http://www.mged.org/minseqe/
), although this proposal is still in draft form and does not yet have a concrete specification.
The NGS challenge
The modENCODE DCC's efforts to standardize its metadata collection was complicated by the rapid shift to next-generation sequencing (NGS) that occurred just as the project was getting underway. At the beginning of the modENCODE project, NGS throughput had begun an exponential rise that continues to this day, but GEO was only just starting to accept short-read data and the SRA was not yet up and running. Anticipating the change in technology usage, the modENCODE DCC began preparing to accept and process high-throughput NGS data. To this end, we created a concrete realization of the MINSEQE standards for the modENCODE project.
From discussions with the ENCODE group and the experiences reported by AE, we knew that collecting metadata would be one of the largest challenges we faced. To support the types of queries mentioned above, the modENCODE DCC devoted considerable time and attention to the metadata collection process. It would require active collaboration with the data providers by biologically trained staff knowledgeable in the experimental techniques, data types, data formats and software that would be employed. Additionally, we knew the volume of data submitted would necessitate scalability and as much automation in the data quality control process as possible, yet the diversity of experiment and data types would require flexibility and swift responses to changing requirements, two demands, which are often incompatible.
An effective consortium DCC must make a large volume of data readily accessible to the research community as soon as the data are experimentally verified. To respect the research objectives of DCC data producers, resource users are encouraged to observe a 9-month waiting period. During this time, they may freely use the modENCODE data in their own research programs, but must defer publications until either after the waiting period or until they have conferred and obtained agreement from the original producers. (The modENCODE data release policy is available at http://www.genome.gov/27528022
). We present here several principles in the design of the modENCODE DCC and our approach to collecting, storing and cataloging data. We describe the ramifications of collecting thorough and deep metadata for describing experiments. The lessons we have learned are applicable to both large data centers and small groups looking to host data for the broader community.