|Home | About | Journals | Submit | Contact Us | Français|
As the size and complexity of scientific datasets and the corresponding information stores grow, standards for collecting, describing, formatting, submitting and exchanging information are playing an increasingly active role. Several initiatives occupy strategic positions in the international scenario, both within and across domains. However, the job of harmonising reporting standards is still very much a work in progress; both software interoperability and the data integration remain challenging as things stand.
The status quo with respect to standardization initiatives is summarized here, with particular emphasis on the motivation for, and the challenges of, ongoing synergistic activities amongst the academic community focused on the creation of truly interoperable standards.
Groups generating standards should engage with ongoing cross-domain activities to simplify the integration of heterogeneous data sets to the greatest possible extent.
In the area of life science, the cycle of data generation and processing is being vastly accelerated by the development of high-throughput experimental methods associated with genomic and post-genomic technologies (e.g. genomics, transcriptomics, proteomics, and metabolomics, hereafter referred as ‘omics’). Biological and biomedical studies commonly range from simple one assay-based to complex multi-assay studies. For the latter type, for example, consider the reporting of a study looking at the effect on a number of subjects treated with different drugs by characterizing the metabolic profile of their urine (by mass spectroscopy), measuring protein and gene expression in the liver (by mass spectrometry and DNA microarrays, respectively), and conducting conventional histological analysis. Omics studies are information intensive and to record their complex structure it is necessary to define and capture the experimental metadata, including experimental design, sample source(s) and treatment(s), the preparation of the sample for the analytical assay, the processes and instruments used throughout, and the final data. It is widely recognized that capturing experimental metadata on this level of granularity is required to correctly interpret the results that they contextualize and enable efficient data sharing.
Many groups have risen to this challenge; standards for collecting, describing, formatting, submitting and exchanging both the data and metadata from such complex studies either are under development or have been released . Currently, several standards initiatives occupy strategic positions in the international scenario, largely falling into two groups identifiable by the needs of their respective user communities.
One group of initiatives is driven by regulatory frameworks, and often supported by accredited (de jure) Standards Developing Organizations (SDOs). Most significantly, these efforts focus on the Voluntary eXploratory Data Submissions (VXDS) and electronic data submission programs of the US Food and Drug Administration (FDA) [2–4] or around initiatives by other governmental agencies, such as the US Environmental Protection Agency (EPA) . These initiatives also include long-standing efforts in the clinical and non-clinical domains  alongside more recent activities in the pharmacogenomics area that add complex omics technologies to biomedical studies .
A second group of initiatives that address particular (omics or other) technologies or defined domains of application (e.g. system biology, pathways) have emerged from the academic community, in many cases with the support of commercial organizations such as instrument vendors. Such initiatives (e.g., [8–14]) are focused on supporting tool interoperability and data exchange among public and proprietary systems through the development of three kinds of (de facto) reporting standards: ‘minimum information’ checklists (specifications of data set content, however encoded), ontologies (semantics) and file formats (syntax). Minimal information checklists are easy-to-read, structured documents that reflect the consensus view of the essential pieces of information that should be reported; ontologies provide the semantics needed to describe the minimal information requirements and file formats the syntax to transmit and exchange these. Combining these three kinds of reporting standards a submission tool, for example, should guide researchers through the process of meeting the reporting requirements made by a given minimal information specification, enable straightforward practical use of ontology terms and export the collected information in a standard format to a given database.
Domain-specific initiatives are regarded as important because they address ‘real world’ data reporting requirements. Unfortunately, focusing on particular communities’ interests or technologies leads to duplication of effort. More seriously, the development of (largely arbitrarily) different standards severely hinders data integration. Nowadays researchers are able to perform multi-assay studies where the same sample is run through the full range of ‘omics and conventional technologies, in combination. In this specific case, it is critical that the reporting standards are designed to be interoperable and fit neatly into a jigsaw, with users being able to take the pieces that are relevant to report their study. The fragmentation severely hinders the interoperability of databases and tools, implementing such reporting standards: this scenario is illustrated by the ArrayExpress  and PRIDE  – two EBI public repositories for transciptomics and proteomics data respectively. These systems implement (non-interoperable) standards applicable only for their ‘omics’ domains. Consequently, users have to deal with different submission formats, diverse representations of the metadata and terminologies when depositing their datasets in these systems, and similarly when downloading other datasets. Such fragmentation has a strong impact on the user community, particularly by hampering deposition and integration of multi-assay studies.
Fortunately, amongst the academic community a number of initiatives aim to foster the harmonization and consolidation of the three kinds of reporting standard previously described
Content: Twenty-seven groups now participate in the Minimum Information for Biomedical or Biological Investigations (MIBBI) project, which offers a one-stop shop for those exploring the range of extant ‘minimum information’ checklists (such as MIAME ) and fosters their collaborative, integrative development .
Semantics: More than 70 groups participate in the Open Biological Ontology (OBO) Foundry. The objective of the project is to encourage the development of orthogonal, interoperable ontologies .
Syntax: Several groups participate in the Functional Genomics (FuGE) project to develop a single generic data model that will underpin a variety of XML-based file formats by providing a single common framework . Recently, a complementary initiative has been begun by a (growing) number of communities; to collaboratively develop the Investigation/Study/Assay (ISA-TAB) a tabular framework for presenting experimental metadata  that uses a reference system to complements existing biomedical formats such as the Study Data Tabulation Model (SDTM, ).
These integrative cross-domain reporting standards are implemented by the BioInvestigation Index, a new prototype infrastructure at EBI set to provide users with a common structured representation and (public) storage mechanism for a variety of studies . Although relying on EBI production systems, such as ArrayExpress and Pride, the BioInvestigation Index shields the users from the diverse reporting standards, by implementing the MIBBI, OBO Foundry and ISA-TAB synergistic efforts in its annotation and submission tool .
To achieve interoperability from a technical perspective, ‘meta’ standardization projects such as MIBBI, OBO Foundry and ISA-TAB help (i) resolving overlaps between domain-specific standards and (ii) plugging gaps where they exist. It is anticipated, also, that some reporting standards will be more mature – ‘ready’ to be integrated – than others, particularly because development takes time and ‘buy-in’ both from potential users and those that govern them (journals, funders, regulators). These are technically complex, but demonstrably tractable tasks. By contrast, the sociological barriers facing these kinds of largescale collaborations can be far more challenging, mandating extensive liaison between communities.
Managing the process of consensus building from start to finish takes time and expertise. However, the time participants can dedicate to these projects is chronically limited due to lack of financial support. The massively collaborative nature of such undertakings requires frequent face-to-face workshops to create the necessary conditions for the building of consensus. Unfortunately, for the initiatives that have emerged from the academic community, this is difficult without central grants or with limited funds . Despite this chronic resource limitation, the lack of standardization is so problematic for researchers and those that support them, repeatedly proving to be a significant bottleneck in the collection, sharing, and integration of data that both developers, and the potential users, continue to participate on an almost exclusively voluntary basis.
Two stakeholders have pivotal roles to play as enablers. Journals increasingly require compliance with appropriate consensual reporting standards, contingent on the availability of appropriate software and public repositories [26, 27, 28]. Consistent reporting has a positive and long-lasting impact on the value of collective scientific outputs. This has also been recognized by funding agencies that are increasingly playing an active role in the strategic stewardship of omics data, often through the development of data policies encouraging the use of (existing) standards and public standards-compliant repositories for data collection and management .
This paper has illustrated the growing number of standards and the complexity facing those attempting to use them, for example, to report or integrate datasets from multiple domains. We have also indicated the existence of a number of synergistic projects seeking to simplify the process of integrating reporting standards, where possible. Of course, this is not an exhaustive list; several coordinative infrastructure initiatives work to address the problem of sharing and archiving large amounts of data, according to common standards (e.g., [30–32]).
There are many benefits accruing to the development and acceptance of reporting standards. For example, by limiting the range and variability of standards, the development and maintenance costs for commercial and academic software developers of standards-compliant products comes down. This results in more appropriate resources for the biomedical and scientific community, making the job of capturing, annotating, integrating, sharing and exploiting (meta)data simpler, increasing the prima facie value of the data to others (secondary users), and by extension, increasing the return on the investment of (public) funds that supported their generation.
Above all actions a ‘top-down’ coordination is needed to help bringing these standardization efforts closer to address the fragmentation issue. Although, regulatory- or biomedical-driven initiatives have far stricter guidelines than academia, much could be learned from exchange of ideas and practices of these sectors.