|Home | About | Journals | Submit | Contact Us | Français|
Open discussion of ideas and full disclosure of supporting facts provide the bedrock for scientific discourse and new developments. Traditionally, this has been formally accomplished through published papers, in which both the salient ideas and the supporting facts are combined in a single discrete ‘package’. With the advent of methods for large-scale and high-throughput analyses, the generation and transmission of the underlying factual information – the data – are being transformed in an electronic process that involves submitting and retrieving information to and from scientific databases. For most projects, the standard requirement is that all relevant data must be made available at a publicly accessible site at the time of a paper’s publication1.
One of the significant lessons from the Human Genome Project (HGP) was the recognition that making data broadly available prior to publication can be profoundly valuable to the scientific enterprise and lead to public benefits. This is particularly the case when there is a community of scientists that can productively use the data quickly – beyond what the data producers could accomplish themselves in a similar time period, and sometimes for scientific purposes that were not anticipated at the onset of the project. The principles for rapid release of genome-sequence data by the HGP were first formulated at a 1996 meeting held in Bermuda; these were then implemented as policy by several research funding agencies. In exchange for ‘early release’ of their data, the sequencing groups can request the right to be the first to describe and analyze their complete datasets in peer-reviewed publications. The human genome sequence2 was the highest profile dataset rapidly released before publication, with assembled sequence data released within 24 hours of generation by each member of the consortium of international sequencing centers. This experience ultimately demonstrated that the broad and early availability of sequence data greatly benefited life sciences research by leading to many new insights and discoveries.
Recognizing that (1) advances in DNA sequencing technologies that allow massive datasets to be produced by an ever-growing number of laboratories have created a need to update policies related to the release of these data, and (2) extending early data release policies to other types of large datasets can be beneficial, a diverse and international group of scientists, ethicists, lawyers, journal editors, and representatives from funding-agencies met in Toronto in May 2009, at a Data Release Workshop convened by Genome Canada and other international agencies. By design, the Toronto meeting continued discussions and policy development planning from previous meetings, in particular: the Bermuda meetings (in 1996, 1997 and 1998), which focused on genome sequence data generated by the HGP3–5; the Fort Lauderdale meeting (in 2003), which recommended that rapid pre-publication release be applied to other types of data whose primary utility was a resource for the scientific community, and also established the responsibilities of the resource producers, resource users, and the funding agencies6; and the Amsterdam meeting (in 2008), which extended the scope of rapid data release to proteomics data7. Although these meetings’ recommendations were applicable to many genomics and proteomics projects, many outside the major centers and funding agencies remain unaware of the details of these policies.
Attendees of the Toronto meeting re-affirmed the value of rapid pre-publication data release for biological and medical datasets that have broad utility and agreed that pre-publication data release should go beyond genomics and proteomics studies to other datasets [e.g., chemical structure, metabolomic, and RNAi datasets, and annotated clinical resources (cohorts, tissue banks, and case-control studies)]. In each of these domains, there are diverse data types and study designs, ranging from large reference projects with broad utility (for which meeting participants endorsed pre-publication data release) to investigator-led hypothesis-testing and data generating projects (for which the minimum standard must be the release of generated data at the time of publication). Several issues discussed at previous data release meetings were not revisited, as they were considered fundamental to all types of data release (whether pre-publication or publication-associated). These included: (1) specification of quality standards for all data; (2) creation of databases designed to facilitate usage of all released data types; (3) archiving of raw data in a retrievable form; (4) housing of both ‘finished’ and ‘unfinished’ data in databases; and (5) provision of long-term support for databases by funding agencies. New issues that were addressed include the importance of simultaneously releasing metadata (such as environmental/experimental conditions and phenotypes) that will enable users to fully exploit the data, as well as the complexities associated with human subjects data due to concerns about privacy and confidentiality.
At a practical level, the Toronto meeting developed a set of suggested ‘best practices’ for funding agencies, for scientists in their different roles (e.g., data producers, data analysts/users, and manuscript reviewers), and for journal editors (see Box 1).
Funding agencies should require rapid pre-publication data release for projects that generate datasets that have broad utility, are large in scale, and are ‘reference’ in character. Many such projects have emerged after discussions between funding agencies and the stakeholder scientific community prior to concentrating large amounts of funds in a limited number of data-producing groups, thereby ensuring the efficient generation of the data resource. Table 1 provides examples of projects using different designs, technologies, and approaches that have several of these attributes, but also shows projects that are more hypothesis-based for which pre-publication data release should not be mandated. It was agreed at the meeting that the requirements for pre-publication data release must be made clear when funding opportunities are first announced and that proactive engagement of funders is beneficial throughout the project, as exemplified by the several genome-sequencing projects (e.g., for mouse and many other vertebrates), the International HapMap Project, the ENCODE project, the 1000 Genomes project, and most recently the International Cancer Genome Consortium, the Human Microbiome Project, and the MetaHIT project. For all projects with a data-generation component, the Toronto meeting participants recommended that funding agencies require that data-sharing plans be presented as part of grant applications and that these plans be subjected to peer review. Funding agencies should exercise flexibility in range of circumstances, for example the possibility that large-scale data-generation projects need not necessarily lead to traditional publications, and that certain projects may only need to release some of their generated data prior to publication. Meanwhile, it is desirable to have general consistency in data-sharing policies among funding agencies, whenever possible. At the same time, funding agencies and academic institutions should positively recognize investigators who adopt pre-publication data-release practices; this would be enabled by having released datasets recognized as part of grants and promotion processes as well as tracked using Internet systems similar to those used for traditional publications8.
Rapid pre-publication data release can lead to tensions between the interests of the data-producing scientists who request a protected time period to publish a first description of a dataset and other scientists who wish to rapidly publish their own analyses based on the same data. To date, many papers have been published by third parties reporting research findings enabled by datasets released prior to publication. These have rarely affected subsequent publications authored by the data producers describing the datasets themselves. Nevertheless, the Toronto meeting participants recognized that this is an ongoing concern that can be addressed by fostering a scientific culture that encourages cooperation on the part of data producers, data analysts, reviewers, and journal editors.
Data producers should, as early as possible and ideally before large-scale data generation begins, clarify their overall plans and intentions for data analysis by providing a citeable statement that can be placed in the publication field of database submissions. This statement must provide clear details about the dataset to be produced, the associated metadata, the experimental design, pilot data, data standards, security, quality control procedures, expected timelines, data release mechanisms, and contact details for lead investigators. If data producers request a protected time period to allow them to be the first to publish the dataset, this should be limited to global analyses of the data and ideally expire within one year. This document would preferably be a ‘marker paper’ that is subjected to peer review and published in a scientific journal. Alternatively, other citeable sources, such as digital object identifiers to specific pages on well-maintained funding agency or institutional web sites, could also be used. Data producers would benefit from defining a citable reference for the database, as it can later be used to reflect impact of the datasets8.
In turn, the data analysts (i.e., data users) should carefully read the source information associated with a released dataset. Data analysts should pay particular attention to any caveats about data quality, as rapidly released data are not stable, in that they may not have had the full complement of quality control analyses compared to more mature data that become available later in a project. As such, it would be prudent for data analysts to assess the benefits and potential problems in immediately analysing released data. They should communicate with data producers to clarify issues of data quality in relation to the intended analyses, whenever possible. In addition, data users should be aware that some datasets are associated with version numbers: the appropriate version number should be tracked and then provided in any published analyses of those data.
Resulting papers describing studies that do not overlap with the intentions stated by the data producers in the marker paper (or other citeable source) may be submitted for publication at any time, but must appropriately cite the data source. Papers describing studies that do overlap with the data producer’s proposed analyses should be handled carefully and respectfully, ideally including a dialogue with the data producer to see if a mutually agreeable publication schedule (such as co-publication or inclusion within a set of companion papers) can be developed. In this regard, it is important for data users to realize that, historically, many such dialogues have led to both coordinated publications and new scientific insights contributed by all parties. Despite the best intentions of all parties, occasional instances might occur when another researcher publishes the results of analyses carried out on pre-publication data and those analyses overlap with the planned studies of the data producer. While such instances are hopefully rare, these should be viewed as a small risk to the data producers, one that comes with the much greater overall benefit of early data release.
As reviewers of manuscripts submitted for publications, scientists should be mindful that pre-publication datasets are likely to have been released before extensive quality control is performed, and any unnoticed errors may cause problems in the analyses performed by third parties. Where the use of pre-publication data is limited or not critical to a study’s conclusions, the reviewers should only expect the normal scientific practice of clear citation and interpretation. However, when the main conclusions of a study rely on a pre-publication dataset, reviewers should be satisfied that the quality of the data is described and taken into account in the analysis.
Toronto meeting participants recommended that journals play an active role in the dialogue about rapid pre-publication data release (e.g., in both their guide to authors and instructions to reviewers). Journal editors should encourage reviewers to be aware that large-scale datasets may be subject to specific policies regarding how to cite and use the data. Ultimately, journal editors must rely on their reviewers’ recommendations for reaching decisions about publication. By emphasizing the importance of quality review of pre-publication data sets in the manuscript review process, greater awareness and recognition of data producers can be achieved and standards of analysis and publication will be raised.
Clinical, socio-demographic, genomic, and other data about human subjects participating in genetic and epidemiological research studies require particularly careful consideration due to the issues relating to privacy protection and the potential harms that could arise from misuse. These issues are critical to all databases housing information about human subjects, whether or not they contain pre-publication data. These complexities are increased by factors such as managing participant withdrawal or control of data usage once it is in the public domain. For these reasons, it is important to develop and implement robust governance models and procedures for human subjects data early in a project. Lessons can likely be learned from recent models adopted by several projects: Open Databases for data variables that cannot be used to identify individuals and Controlled Access Databases for clinical and genomic data that are associated with a unique but not directly identifiable individual9. Under such conditions, arguments can be made for the release of data for studies involving human subjects, as doing so can augment the opportunities for new discoveries that could ultimately benefit individuals, communities, and society at large.
The rapid pre-publication release of the human genome sequence data by the HGP constituted a landmark model for cooperation between heterogeneous communities of data generators and analysts, successfully demonstrating how 'big science' can be structured for biological research. This data release policy has served the field of genomics well. The benefits of its application to subsequent endeavours have been demonstrated both in providing useful datasets well in advance of a project's completion and in enabling novel scientific advances to be made worldwide. The Toronto meeting participants acknowledged that many issues remain with pre-publication release of data, that there is a range of opinions in the scientific community, that the landscape continues to change rapidly, and policies need to be reviewed on a regular basis. Nonetheless, wider adoption of the general principles that are fundamental to sharing data as early as possible will positively impact the pace of scientific discovery and should be embraced in a practical and well-reasoned fashion.
The authors wish to thank the following funding agencies for supporting the Toronto Data Release Workshop: Biotechnology and Biological Sciences Research Council, the European Commission, Genome Canada, the National Human Genome Research Institute, the National Science Foundation, and the Wellcome Trust. We are also grateful to Genny Cardin and the staff at Genome Canada for logistical assistance.
A complete list of the authors and their affiliations are provided at URL.