1. Expanded Publication Process
We must expand what it means to publish in the biosciences. Traditional journal articles alone are ill-suited to capture the fruits of modern research; and databases, Web sites, archived presentations and high-level commentaries are a valuable and real part of the scientific information landscape. Academic publication should reflect this.
To modernize academic publishing, we propose three main changes.
First, all scientists should be able to publish, share and access data on the Web. Funding agencies should include stable digital storage with research grants, and tie continued funding to the appropriate use of this storage. This ensures that every author is able to archive pre-prints, host supplementary and unpublished data, and make their findings widely available in digital format.
Second, we propose that journals expand the publication process to yield a broader spectrum of output. Beyond the traditional text-based article, authors should produce two key products: a brief lay summary of their work (similar to those required by the journal PLoS Medicine), and a machine-readable XML summary of pertinent facts in the article which we term the Structured Digital Abstract (Figure ). The former product assists public and non-specialist consumption of scientific research, and the latter would ease pressure on database curators and streamline the large-scale automatic deposition of author-vetted biological facts.
Figure 1 A schematic illustration of the proposed Structured Digital Abstract for a single genetics article . This document – a machine-readable summary of pertinent findings arranged for simple database deposit – would be coded in XML and (more ...)
Third, findings should be contributed to an appropriate database when the paper is published. (This already occurs in several fields: in protein structure determination, for instance, deposition of data sets to the Protein Data Bank is a standard co-requisite for publication.) A Structured Digital Abstract will facilitate such deposit; individual journals could partner with related databases to decide on format and spearhead such practices. We might begin by using currently available information extraction software such as BioRAT [9
] – a program that distills key facts from full-text documents in the biosciences – at the pre-print stage. Authors could build upon a BioRAT-style initial summary of their paper to create the final Structured Digital Abstract, submitted along with the manuscript for journal publication.
2. Central Indexing of Data
A key thrust of our future vision is the interconnection of disparate data sources: journal text is intelligently linked to database resources, third-party commentaries, archived talks and Web sites. To begin this process, we must establish a reliable way to identify database objects – similar to the Digital Object Identifier (DOI®
] already used for articles – and reference them consistently within journal text [11
]. This conceptually simple idea will go a long way towards linking journal text with database information; for instance, a user browsing a given sequence feature in a database could instantly see and access all journal articles that reference that feature.
As a corollary, editorial boards should regulate a move towards a unified, standard naming convention for biology. The LSID proposal to unify database-specific ID-management issues into a single system and assign a unique identifier to all objects in the life sciences represents a promising advance towards such unified nomenclature.
Once consistently labeled, gene names and other biological identifiers in article text should be annotated for association with their database counterparts. This need not be overly tedious, as much of this process could be automated and implemented by journals during publication; a software system could identify most putative anchors and produce a checklist which authors would simply approve or alter [11
]; the name lists (gazetteers) used by existing information extraction tools for biology will be useful here [9
]. The end result would be journal text that comes pre-annotated with unique meta-identifiers, a suitable scaffold for the next generation of search and indexing.
The efficient interrelation of biological data sources has been the subject of much recent work. In particular, current applications of the semantic web to the biological world are promising: for instance, projects such as Atlas [12
] (large-scale data integration infrastructure) and YeastHub [13
] (using resource-description framework structures to warehouse tabular biological data) offer initial avenues for the handling of tomorrow's highly-annotated articles and data sets. Indeed, several prominent computer science researchers recently proposed the semantic web as the future of the Web in general [14
Immediate gains will accompany Web search engines indexing the full text of scientific articles. Already underway to some degree with the Google Scholar service [15
], this will rapidly expand search power beyond abstracts and keywords and dramatically improve public access to scientific information. It bears mention that such indexing is clearly dependent on some form of open access to scientific literature; initial efforts to index full-text must rely upon the cooperation of publishers, or free repositories such as PubMed Central and institutional archives.
Until widespread open access to published literature is a reality, local archiving on institutional Web space is a convenient stopgap to permit interim indexing of full-text documents. It is estimated that over 90% of academic journals allow some form of author archiving, but with widely differing rules [16
]. Until these rules are standardized, publicized and well understood, authors will remain hesitant to archive. The Open Access movement is already pressing this issue, and we strongly support the wide adoption of Science Commons publication agreements [17
], which clarify author rights relating to manuscript archiving.
3. Credit for Digital Contributions to Science
Scientific contribution should not be measured solely by journal publications. Database maintenance is already vital to modern research, and we should implement a consistent citation system to credit database contributions. Full-text publication will remain the cornerstone of the research process – after all, human-readable discussion will always be in high demand – but recognition should also be afforded to those who create, maintain and update the database records we depend upon daily. If database contribution is properly acknowledged, we will see more widespread attention devoted to maintaining these key resources in the future.
Moreover, the ability to quickly establish if an idea has previously been put forth – and to properly credit it, if so – is important to scientists. Full author identification and centrally searchable content will simplify this process, and facilitate attribution and acknowledgment.
Community annotation of published research is a key step towards harnessing the full power of the Internet in scientific communication. The new journal PLoS One
] offers community-driven peer-review, permitting online discussion and rating of work by a wide spectrum of interested parties. With a tangible model for open review in place, it will be interesting to observe the success of this approach, and whether other publishers follow suit.
Finally, overall progress toward our future vision will likely change how we view authoring and editing in science. Specifically, curating biological databases is of increasing importance. This complex task demands scientific expertise paired with writing, editing, programming and database administration skills. We believe data management techniques will one day be taught to undergraduate-level scientists, as students and researchers of all levels learn to oversee and tend their corner of the digital data landscape.