At a practical level, the Toronto meeting developed a set of suggested ‘best practices’ for funding agencies, for scientists in their different roles (e.g., data producers, data analysts/users, and manuscript reviewers), and for journal editors (see Box 1
Box 1. Guidelines for the Release of Pre-publication Data
- Rapid pre-publication data release should be encouraged for projects with the following attributes:
- Large-scale (i.e., requiring significant resources over time)
- Broad utility
- Creating reference datasets
- Associated with community buy-in, which is often the case with top-down initiatives
- Funding agencies should facilitate the specification of data release policies for relevant projects by:
- Explicitly stating any data release requirements, especially mandatory pre-publication data release, in solicitations and instructions to applicants
- Ensuring that evaluation of data release plans are part of the peer-review process
- Proactively establishing analysis plans and timelines for projects releasing data pre-publication
- Fostering investigator-initiated pre-publication data release
- Helping to develop appropriate consent, security, access and governance mechanisms that protect research participants while encouraging pre-publication data release
- Providing long-term support of databases
- Data producers should state their intentions and enable analyses of their data by:
- Informing data users about the data being generated, data standards and quality, planned analyses, timelines, and relevant contact information, ideally through publication of a citeable marker paper near the start of the project or by provision of a citable URL at the project or funding-agency website
- Providing relevant metadata (e.g., questionnaires, phenotypes, environmental conditions, and laboratory methods) that will assist other researchers in reproducing and/or independently analyzing the data, while protecting interests of individuals enrolled in studies focusing on humans
- Ensuring that research participants are informed that their data will be shared with other scientists in the research community
- Publishing their initial global analyses, as stated in the marker paper or citable URL, in a timely fashion
- Creating databases designed to archive all data (including underlying raw data) in an easily retrievable form and facilitate usage of both pre-processed and processed data
- Data analysts/users should freely analyze released pre-publication data and act responsibly in publishing analyses of those data by:
- Respecting the scientific etiquette that allows data producers to publish the first global analyses of their dataset
- Reading the citeable document associated with the project
- Accurately and completely citing the source of pre-publication data, including the version of the dataset (if appropriate)
- Being aware that released pre-publication data may be associated with quality issues that will be later rectified by the data producers
- Contacting the data producers to discuss publication plans in the case of overlap between planned analyses
- Ensuring that use of data does not harm research participants and is in conformity with ethical approvals
- Scientific journal editors should engage the research community about issues related to pre-publication data release and provide guidance to authors and reviewers on the third-party use of pre-publication data in manuscripts 6
Funding agencies should require rapid pre-publication data release for projects that generate datasets that have broad utility, are large in scale, and are ‘reference’ in character. Many such projects have emerged after discussions between funding agencies and the stakeholder scientific community prior to concentrating large amounts of funds in a limited number of data-producing groups, thereby ensuring the efficient generation of the data resource. provides examples of projects using different designs, technologies, and approaches that have several of these attributes, but also shows projects that are more hypothesis-based for which pre-publication data release should not be mandated. It was agreed at the meeting that the requirements for pre-publication data release must be made clear when funding opportunities are first announced and that proactive engagement of funders is beneficial throughout the project, as exemplified by the several genome-sequencing projects (e.g., for mouse and many other vertebrates), the International HapMap Project, the ENCODE project, the 1000 Genomes project, and most recently the International Cancer Genome Consortium, the Human Microbiome Project, and the MetaHIT project. For all projects with a data-generation component, the Toronto meeting participants recommended that funding agencies require that data-sharing plans be presented as part of grant applications and that these plans be subjected to peer review. Funding agencies should exercise flexibility in range of circumstances, for example the possibility that large-scale data-generation projects need not necessarily lead to traditional publications, and that certain projects may only need to release some of their generated data prior to publication. Meanwhile, it is desirable to have general consistency in data-sharing policies among funding agencies, whenever possible. At the same time, funding agencies and academic institutions should positively recognize investigators who adopt pre-publication data-release practices; this would be enabled by having released datasets recognized as part of grants and promotion processes as well as tracked using Internet systems similar to those used for traditional publications8
Examples of pre-publication data release guidelines for different project types.
Rapid pre-publication data release can lead to tensions between the interests of the data-producing scientists who request a protected time period to publish a first description of a dataset and other scientists who wish to rapidly publish their own analyses based on the same data. To date, many papers have been published by third parties reporting research findings enabled by datasets released prior to publication. These have rarely affected subsequent publications authored by the data producers describing the datasets themselves. Nevertheless, the Toronto meeting participants recognized that this is an ongoing concern that can be addressed by fostering a scientific culture that encourages cooperation on the part of data producers, data analysts, reviewers, and journal editors.
Data producers should, as early as possible and ideally before large-scale data generation begins, clarify their overall plans and intentions for data analysis by providing a citeable statement that can be placed in the publication field of database submissions. This statement must provide clear details about the dataset to be produced, the associated metadata, the experimental design, pilot data, data standards, security, quality control procedures, expected timelines, data release mechanisms, and contact details for lead investigators. If data producers request a protected time period to allow them to be the first to publish the dataset, this should be limited to global analyses of the data and ideally expire within one year. This document would preferably be a ‘marker paper’ that is subjected to peer review and published in a scientific journal. Alternatively, other citeable sources, such as digital object identifiers to specific pages on well-maintained funding agency or institutional web sites, could also be used. Data producers would benefit from defining a citable reference for the database, as it can later be used to reflect impact of the datasets8
In turn, the data analysts (i.e., data users) should carefully read the source information associated with a released dataset. Data analysts should pay particular attention to any caveats about data quality, as rapidly released data are not stable, in that they may not have had the full complement of quality control analyses compared to more mature data that become available later in a project. As such, it would be prudent for data analysts to assess the benefits and potential problems in immediately analysing released data. They should communicate with data producers to clarify issues of data quality in relation to the intended analyses, whenever possible. In addition, data users should be aware that some datasets are associated with version numbers: the appropriate version number should be tracked and then provided in any published analyses of those data.
Resulting papers describing studies that do not overlap with the intentions stated by the data producers in the marker paper (or other citeable source) may be submitted for publication at any time, but must appropriately cite the data source. Papers describing studies that do overlap with the data producer’s proposed analyses should be handled carefully and respectfully, ideally including a dialogue with the data producer to see if a mutually agreeable publication schedule (such as co-publication or inclusion within a set of companion papers) can be developed. In this regard, it is important for data users to realize that, historically, many such dialogues have led to both coordinated publications and new scientific insights contributed by all parties. Despite the best intentions of all parties, occasional instances might occur when another researcher publishes the results of analyses carried out on pre-publication data and those analyses overlap with the planned studies of the data producer. While such instances are hopefully rare, these should be viewed as a small risk to the data producers, one that comes with the much greater overall benefit of early data release.
As reviewers of manuscripts submitted for publications, scientists should be mindful that pre-publication datasets are likely to have been released before extensive quality control is performed, and any unnoticed errors may cause problems in the analyses performed by third parties. Where the use of pre-publication data is limited or not critical to a study’s conclusions, the reviewers should only expect the normal scientific practice of clear citation and interpretation. However, when the main conclusions of a study rely on a pre-publication dataset, reviewers should be satisfied that the quality of the data is described and taken into account in the analysis.
Toronto meeting participants recommended that journals play an active role in the dialogue about rapid pre-publication data release (e.g., in both their guide to authors and instructions to reviewers). Journal editors should encourage reviewers to be aware that large-scale datasets may be subject to specific policies regarding how to cite and use the data. Ultimately, journal editors must rely on their reviewers’ recommendations for reaching decisions about publication. By emphasizing the importance of quality review of pre-publication data sets in the manuscript review process, greater awareness and recognition of data producers can be achieved and standards of analysis and publication will be raised.