Development of high-throughput genomic and postgenomic technologies has caused a change in approaches to data handling and processing (1). One biological sample might be used to generate many kinds of “big” data in parallel, such as genome sequence (genomics), patterns of gene and protein expression (transcriptomics and proteomics), and metabolite concentrations and fluxes (metabolomics). Extensive computer manipulations are required for even basic analyses of such data; the challenges mount further when two or more studies' outputs must be compared or integrated.
Grassroots movements (2-5), efforts including the Science Commons, which is initiating an open-access data protocol (6), as well as top-down (funder-led) efforts (see table, page 235), have led to a range of policies for data management and sharing. A recent European Science Foundation consultation exercise confirmed a lack of explicit, well-documented data-sharing policies for most funding agencies in European countries (7). If we are to avoid squandering the immediate and extended value of big data, a focused strategy will be pivotal.
Early policies were driven by the need to manage long-term data sets (those accrued over 30 or more years), such as those in the social and environmental sciences. More recently, policies have emerged in response to increased funding for high-throughput approaches in major 'omics fields. The European Commission has invited the member states to develop policies to implement access, dissemination, and preservation for scientific knowledge and data (8).
Beyond public and private funding agencies, regulatory agencies such as the U.S. Food and Drug Administration (FDA) (9), European Medicines Agency (EMEA) (10), and U.S. Environmental Protection Agency (EPA) (11) are also working to define guidelines to facilitate electronic submission of traditional and 'omics data types. These, as well as industry guidelines, are beyond the scope of this document, but much could be learned from an exchange of ideas and practices (12).
The policies listed here share common principles. They aim to protect cumulative data outputs. All recognize data as a public good and data sharing as a way to accelerate subsequent exploitation. On a practical level, all acknowledge the right of first use for data providers and the right to appropriate accreditation. Likewise, these policies have been generated through the same basic process (table S1) (13).
Despite these commonalities, there is still room for heterogeneity, as expected, given the different types of communities served by each funder and the data types they generate. Care must be taken, though, that these differences do not impede seamless interoperability. The path a funding agency takes in supporting its data policy largely reflects the relative emphasis placed on managing versus sharing data. A focus on managing is often accompanied by an institutional infrastructure. Such centralization provides economy of scale, institutional memory, and reusable capability, but it also incurs a substantial direct cost that may compete with research funding (14). The UK Natural Environment Research Council (NERC) sustains a system of national data centers and has invested in the NERC Environmental Bioinformatics Centre (NEBC) to cover 'omics data (15, 16). Similarly, the UK Economic and Social Research Council provides a central data service for social scientists (17). Policies that focus on sharing tend to place more responsibility on researchers. For example, the UK Biotechnology and Biological Sciences Research Council (BBSRC) is supporting its data-sharing policy through funds that allow researchers to develop their own solutions from the bottom up.
Massive-scale raw data must be highly structured to be useful to downstream users. Standardized solutions are increasingly available for describing, formatting, submitting, and exchanging data (18, 19). These reporting standards include minimum information checklists, ontologies, and file formats. Minimum information checklists are simple, structured documents that reflect the consensus view of a community on the information to report about particular kinds of biological studies or instrument-based assays. Ontologies provide terms needed to describe the minimal information requirements. File formats define a shared syntax to transmit and exchange standardized information.
Data sharing, and the good annotation practices it depends on, must become part of the fabric of daily research for researchers and funders.
There are now an escalating number of community-developed checklists, ontologies, and file-format projects, a positive sign of community engagement. But this proliferation brings with it new sociological and technological challenges—creating interoperability and avoiding unnecessary overlaps and duplication of efforts. These projects largely focus on a particular technology or a specific biological knowledge domain (e.g., ontologies for anatomy, gene functions, or the environment) and are by nature fragmented and not designed to be interoperable. A range of activities are fostering harmonization and consolidation of these standards for checklists (5), ontologies (4), and representation of information in electronic formats (2, 3).
Many large coordinative initiatives (20-23) are working to address the problem of archiving and integrating data. The ELIXIR project (22) aims to construct and operate a common, sustainable bioinformatics research infrastructure to support the life sciences across Europe. The Infrastructure for Spatial Information in the European Community (INSPIRE) directive requires that Europe binds together its geospatial data into portals (23). Widely useful are initiatives like the Digital Curation Centre (DCC), which tracks data standards, documents best practice, and has published a data life-cycle model to underpin long-term data-preservation policies (24).