The primary goal of biocuration is to stitch together a comprehensive, accurate, and up-to-date picture of current biological knowledge. This network of information provides researchers with easy access to detailed and highly cross-linked information that is traceable back to its source. To accomplish this, it is critical that the published data be readily available for curation. Budgetary limitations should not cause limitations in the inclusion of any publication in the curation process. This diminishes the accessibility of the researcher’s data.
Recent initiatives to increase access, and institutional support for them, have begun to help change this by providing publication venues where everyone can access publications and their enclosed data freely and in a timely manner. Freely available articles benefit journal publishers by opening new avenues for those journal articles to be found in complex searches conducted against curated data at biological databases. Continued expansion of publication models that provide rapid access to the published literature for the widest possible audience, institutional support for publication in journals with high accessibility, and improved data access collaborations between biological databases and journal publishers will be important if biocuration is to successfully pursue its goal of complete and up-to-date curation of biological knowledge.
As the number of papers increases, several MODs have begun projects to assess the feasibility of utilizing various text-mining tools to aid and streamline the literature curation processes. Aside from technical problems such as obtaining full text in a format suitable for scanning, including figure legends and supplemental data, such tools must be able to accurately recognize key phrases in their proper context, and associate these with various controlled vocabularies (gene lists, phenotype and GO ontologies, etc.). For example, Van Auken et al. have reported a promising test using Textpresso (Muller et al. 2004
) to curate GO cellular component annotations at Wormbase (Van Auken et al. 2009
). However, it is not clear how such a tool might deal with a much larger literature corpus such as MGI’s (at least 1,000 papers a month), or how well this tool might handle curation of complex phenotypes due to conditional knockouts and relate these phenotypes to GO biological process terms, etc. More recently, MGI has evaluated several tools to aid in the bottleneck of associating selected papers to specific genes and reports a possible increase in assignment throughput of 20–40% (Dowell et al. 2009
). It is unlikely that such tools will replace the need for expert manual extraction of experimental data within a relevant paper, but they may aid in selecting papers for further scrutiny.
It is hoped that alternative curation models, likely involving more direct participation by the research community, will help address the data gap between the ever-increasing amount of information published in the literature and the amount of data available through curated biological databases. Several community curation models are currently under investigation, including partnerships between journals and databases (Seringhaus and Gerstein 2007
; Ceol et al. 2008
; Ort and Grennan 2008
; Seringhaus and Gerstein 2008
), direct editing of database entries by field experts (Menda et al. 2008
), and the use of wikis to supplement existing curated databases (see, e.g., http://wiki.yeastgenome.org/index.php/Main_Page
). Wikis provide a forum for researchers to discuss issues that may never be published, such as controversies about certain experimental results. Participation of the research community can greatly augment the curator’s ability to develop a unified, comprehensive, precise, accurate, and highly cross-referenced view of the current biological knowledge.
Just as biocuration matures, so does our understanding of the complex nature of biology. The concept of a gene, one of the most basic tenets of biology, is changing as our understanding of genetics expands. Biocurators will continue to find ways to integrate complex new biological concepts into their existing frameworks. Great challenges lie ahead for bioresearch and biocuration alike. It is becoming apparent that the current gene-based data models used by MODs need to be expanded to allow incorporation, access, and visualization of a growing number of complex entities and processes such as microRNAs, epigenetic influences, gene regulation networks/pathways, and protein complexes. With continued staffing of highly qualified biocurators, sufficient funding, and active collaboration between journals, biological databases and the research community, we can successfully meet these challenges.