The sequencing and analysis of the human genome have largely elucidated the ‘parts list’ of the cellular machinery in the form of approximately 25 000 genes. However, the comprehensive annotation of gene function remains a formidable challenge. The scale of the task ahead is illustrated by two simple analyses of links between PubMed and the Entrez Gene database.
The first analysis showed that while there are several well-studied genes that have thousands of indexed citations in the literature, that degree of functional annotation falls off steeply (1
). Almost 80% of Entrez Gene entries had five or fewer linked references in PubMed; almost 50% had zero linked references (A). This pattern was even more pronounced when examining links to the Gene Ontology (also shown in A). Clearly, there remains much work to be done to functionally annotate the human genome and to comprehensively catalog these findings in gene annotation databases.
Figure 1. Analysis of links between the Entrez Gene and PubMed databases. (A) Examining the degree of gene annotation from the perspective of Entrez Gene, we found that while a few genes are very well annotated with links to PubMed references, the vast majority (more ...)
A second analysis examined the rate at which PubMed entries were being linked to Entrez Gene (B). Between 1970 and 2008, the number of publications added to PubMed grew at an annualized rate of ~3.4%. On examining that same time period, it was found that the number of articles that are currently linked to at least one Entrez Gene entry ‘grew’ by ~18% per year, but still <4% of all PubMed entries in recent years (and <1.5% overall) are linked to Entrez Gene. Assuming that >4% of PubMed-indexed articles have relevance to human gene function, this finding suggests that the rate limiting step is not generating the data, but capturing the derived knowledge in gene annotation databases.
Currently, the process of annotating gene function typically entails large-scale efforts by the model organism community (2–4
) and genome annotation centers (5
). Formal annotation of gene function often utilizes controlled vocabularies like Gene Ontology (6
). While the annotation process can be aided by the use of computational tools, ultimately the assignment of gene function is a manual process requiring the attention of one or more domain experts (7
). This centralized model has been very successful in its goal to systematically advance gene annotation, creating essential tools and ontologies in the process.
However, this model alone may not be sufficient to efficiently and systematically annotate gene function. Many leading voices in the gene annotation and model organism communities recently wrote a feature article in Nature
describing the current state and future of biocuration (8
). They noted the immense challenge to the curator community (typically numbering in tens to hundreds of people) to keep pace with the biomedical literature (currently 18 million articles in PubMed, roughly 750 000 new articles per year). Specifically, these curation experts suggest that merely preserving the existing models of gene annotation will lead to an increasing lag between curated data and biological knowledge, and that ‘sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation’ (8
Thus, although leaders in the curation community have successfully set up a robust pipeline and infrastructure, and although the individuals in the curation community are clearly skilled in the annotation process, the amount of resources devoted to this important task may be simply insufficient relative to the volume of biomedical data being generated.
Recently, several efforts have been published, which attempt to harness the principle of ‘community intelligence’ (9–15
). In particular, we introduced the Gene Wiki (11
), an effort to systematically annotate articles in the online encyclopedia, Wikipedia, for approximately 9000 human genes. Articles were created or amended with content mined from structured gene annotation databases, including Entrez Gene, Ensembl, UniProt and the Protein Data Bank (PDB). Although the emphasis of the Gene Wiki is on describing human gene function, data from model organisms is often contributed as appropriate.
Here, we present an update describing the recent systematic improvements to the Gene Wiki. Moreover, we report on a retrospective analysis of Gene Wiki usage and editing. Finally, we offer some concluding remarks on general progress and challenges facing efforts to collaboratively engage the entire community of scientists.