Journal articles as immutable, citable, archives of knowledge, have been, and continue to be, the mainstay of scholarly communication. Viewed by many scientists as the end product of their engagement in a piece of research, the "article" contains an argument or statement about an hypothesis, backed up by supporting data. However, as new technologies drive research toward larger and more complex datasets, these two features of the journal article are becoming increasingly disarticulated [1
]. In some scientific disciplines – for example crystallography, astronomy and molecular biology – digital repositories have become important avenues for "publishing" data. This approach has found common cause with social and political forces that are arguing for greater accountability and transparency of science. The Open Science movement for the free use (and re-use) of data, results and protocols, is championed by many as the best way to improve the collective societal return on our investment in scientific research [2
]. Data publication is widely recognised as being central to delivering this. But in truth, outside a handful of disciplines, publication of science data is the exception, not the rule.
Data publication has the potential to deliver significant benefits from local to global scales. Organisations and research disciplines can benefit from increased recognition [3
]. There are significant potential cost savings for funders through greater reuse of data [4
], and economic benefits by stimulating entrepreneurial uses of data by commercial companies [5
]. Data publication can help to discourage scientific misconduct [6
], and in many cases (e.g. environmental and ecological data) provides the only outlet for data that are irreplaceable because of the unique circumstances in which they were collected. So why, when so much is to be gained from data publication, do scientists compromise scientific development, and effectively leave their work unfinished by not publishing their data? I argue that it is not through lack of money or policy that scientists behave in this way. Likewise, misunderstandings and inertia with the scientific community are only partly to blame. A more likely cause is that the benefits to an individual of making their data publicly available are less evident to the scientist than they are to the governments, funding agencies and scientific community that support them. Only by addressing this imbalance, and making these benefits immediate and transparent to practising scientists, will data publication become the norm. Here are three suggestions on how this can be achieved:
Make it easy – developing the cyberinfrastructure
For those scientists for whom data publication is possible, it is too often considered a chore. Of course it is dangerous to generalize across a multiplicity of scientific disciplines, each with their own specialised norms and practices, but as a taxonomist and systematist generating molecular, morphological and phylogenetic data in support of my biodiversity research program, my experience with data publication systems has always been a painful affair. Almost without exception they require a substantial time investment, sometimes involving personal contact with a remote database manager who massages my data into a form such that it can be readily parsed. If data publication is to become a part of normal scientific practice it has to be easy to achieve. This requires a robust infrastructure that is quick and simple to use, works with the applications and data formats currently employed, and gives the scientist confidence that it will work and still be there when needed. Data standards are part of this process, but perhaps more important is the development of robust applications that hide the complexity of these data standards through a well designed interface. Funding agencies need to respond to these infrastructural needs, which have to be maintained beyond the typical lifecycle of a standard grant application if they are to have lasting impact. Related to this is a need for a career path and recognition structure for those informaticians who develop the software and standards associated with these systems [7
]. Without this human infrastructure, the data, computational and communication components of this cyberinfrastructure cannot be sustained.
Make it citable – motivating data publication through peer recognition
A primary motivation for article publication is to demonstrate the authors' contribution to science [8
]. This attracts peer recognition that influences the authors' reputation, employment and research opportunities. Article citation is the most common metric of peer recognition and if a comparable metric could be brought to bear on data publication, it follows that value and impact of data publication could be similarly tracked to motivate authors.
At present data publication where possible, is largely motivated by enforcement through the editorial practices of particular journals. These require that authors lodge data in a suitable repository as a prerequisite to publication. In this instance the citation of data usually takes the form of an opaque identifier (e.g. the GenBank accession number [9
] or Web site URL) rather than the data authors or editors in a manner equivalent to a traditional article citation. This failure to cite the authors of an original data source has plenty of precedent: for example, publications that describe new species are rarely mentioned in subsequent studies. If they were, the scientific contributions of taxonomists would be amongst the most cited articles worldwide. Opaque identifiers will continue to be required for data publication for practical reasons, since large datasets are increasingly collaborative, often involving many hundreds of authors. Nevertheless, data publishers should be able to demonstrate the same editorial standards as article publishers, by making the authors' and editors' names and addresses readily accessible, preferably in a way that can be read by both humans and machines for computation of citation metrics. Not only would this introduce greater transparency and accountability in science, but through peer recognition, motivate authors to publish their data.
Make it useful – moving beyond data archival
One reason for publishing data is to archive those data in a form that it is available to others for reuse. This activity however, has little value for the contributor who already holds the data and may have to exert considerable effort to publish it. Automatically enhancing the value of the data to the contributor once it has been published can address this problem. This may be in form of functional enhancements that facilitate the subsequent manipulation, editing and annotation of the data, or semantic enrichments that automatically connect data to other published sources [10
]. This fusion of data might take the form of descriptive metadata that assists in data discovery, connection to definitions of concepts and terms found within the data set, and enhanced visualisations of data. Importantly, these enhancements should be reciprocal between linked data, enriching the value of old and new data alike as the knowledgebase grows. Not only does this enhance the discoverability of published data but, because these links are machine readable, it can facilitate the computation and (perhaps eventually) the semantic reasoning across the links.
How can we achieve all this with a multiplicity of distributed stakeholders, many of whom have conflicting or competing interests? To my mind, data stewardship is best accomplished in systems and repositories where the custodian has trusted status within relevant communities of practice. Such trust is earned with difficulty and lost with ease; therefore it makes most sense to place these repositories with scientific societies, institutions and journals that have a history of supporting, archiving and enabling these communities. This is counter to the trend toward large national data centres that must accommodate the diverse interests of potential contributors spanning many broad scientific disciplines. Scientific data exist in many types and formats, and is subject to varying legal, cultural, protection and practical constraints. They are often used in different ways according to their context and have varying life-cycle requirements. Who better to understand these needs than the communities that are generating and using the data. This, however, risks the construction of data silos – walled parochial gardens of disciplinary data that remain unconnected to the wider world.