|Home | About | Journals | Submit | Contact Us | Français|
Annotating the function of all human genes is a critical, yet formidable, challenge. Current gene annotation efforts focus on centralized curation resources, but it is increasingly clear that this approach does not scale with the rapid growth of the biomedical literature. The Gene Wiki utilizes an alternative and complementary model based on the principle of community intelligence. Directly integrated within the online encyclopedia, Wikipedia, the goal of this effort is to build a gene-specific review article for every gene in the human genome, where each article is collaboratively written, continuously updated and community reviewed. Previously, we described the creation of Gene Wiki ‘stubs’ for approximately 9000 human genes. Here, we describe ongoing systematic improvements to these articles to increase their utility. Moreover, we retrospectively examine the community usage and improvement of the Gene Wiki, providing evidence of a critical mass of users and editors. Gene Wiki articles are freely accessible within the Wikipedia web site, and additional links and information are available at http://en.wikipedia.org/wiki/Portal:Gene_Wiki.
The sequencing and analysis of the human genome have largely elucidated the ‘parts list’ of the cellular machinery in the form of approximately 25 000 genes. However, the comprehensive annotation of gene function remains a formidable challenge. The scale of the task ahead is illustrated by two simple analyses of links between PubMed and the Entrez Gene database.
The first analysis showed that while there are several well-studied genes that have thousands of indexed citations in the literature, that degree of functional annotation falls off steeply (1). Almost 80% of Entrez Gene entries had five or fewer linked references in PubMed; almost 50% had zero linked references (Figure 1A). This pattern was even more pronounced when examining links to the Gene Ontology (also shown in Figure 1A). Clearly, there remains much work to be done to functionally annotate the human genome and to comprehensively catalog these findings in gene annotation databases.
A second analysis examined the rate at which PubMed entries were being linked to Entrez Gene (Figure 1B). Between 1970 and 2008, the number of publications added to PubMed grew at an annualized rate of ~3.4%. On examining that same time period, it was found that the number of articles that are currently linked to at least one Entrez Gene entry ‘grew’ by ~18% per year, but still <4% of all PubMed entries in recent years (and <1.5% overall) are linked to Entrez Gene. Assuming that >4% of PubMed-indexed articles have relevance to human gene function, this finding suggests that the rate limiting step is not generating the data, but capturing the derived knowledge in gene annotation databases.
Currently, the process of annotating gene function typically entails large-scale efforts by the model organism community (2–4) and genome annotation centers (5). Formal annotation of gene function often utilizes controlled vocabularies like Gene Ontology (6). While the annotation process can be aided by the use of computational tools, ultimately the assignment of gene function is a manual process requiring the attention of one or more domain experts (7). This centralized model has been very successful in its goal to systematically advance gene annotation, creating essential tools and ontologies in the process.
However, this model alone may not be sufficient to efficiently and systematically annotate gene function. Many leading voices in the gene annotation and model organism communities recently wrote a feature article in Nature describing the current state and future of biocuration (8). They noted the immense challenge to the curator community (typically numbering in tens to hundreds of people) to keep pace with the biomedical literature (currently 18 million articles in PubMed, roughly 750 000 new articles per year). Specifically, these curation experts suggest that merely preserving the existing models of gene annotation will lead to an increasing lag between curated data and biological knowledge, and that ‘sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation’ (8).
Thus, although leaders in the curation community have successfully set up a robust pipeline and infrastructure, and although the individuals in the curation community are clearly skilled in the annotation process, the amount of resources devoted to this important task may be simply insufficient relative to the volume of biomedical data being generated.
Recently, several efforts have been published, which attempt to harness the principle of ‘community intelligence’ (9–15). In particular, we introduced the Gene Wiki (11), an effort to systematically annotate articles in the online encyclopedia, Wikipedia, for approximately 9000 human genes. Articles were created or amended with content mined from structured gene annotation databases, including Entrez Gene, Ensembl, UniProt and the Protein Data Bank (PDB). Although the emphasis of the Gene Wiki is on describing human gene function, data from model organisms is often contributed as appropriate.
Here, we present an update describing the recent systematic improvements to the Gene Wiki. Moreover, we report on a retrospective analysis of Gene Wiki usage and editing. Finally, we offer some concluding remarks on general progress and challenges facing efforts to collaboratively engage the entire community of scientists.
As described earlier (11), the initial Gene Wiki effort focused on creating or amending gene pages to include a free-text summary, an ‘infobox’ with links to public databases, Gene Ontology annotations and when available, protein structure identifiers. These data were primarily harvested from the Entrez Gene database (16). Since that publication, we have introduced several other systematic improvements to the Gene Wiki.
Our previously described Gene Wiki effort resulted in approximately 9000 Wikipedia pages on human genes. However, with the exception of a few well-developed gene pages that existed prior to our involvement, these pages were accessible only through search engines and not through links from other articles. In the parlance of Wikipedia, these stubs were ‘orphans’ that were disconnected from the Wikipedia network defined by links between articles.
To better link the pages in the Gene Wiki to other biomedically relevant pages, we systematically created links between gene pages based on known protein–protein interactions in the literature. Interactions were downloaded from the BioGRID database (http://thebiogrid.org). We conservatively filtered for interactions that were supported by two independent techniques or two separate publications. A new section for ‘Interactions’ was created on each Gene Wiki page with at least one entry, which contained both links to the partner’s gene page and inline references to the relevant publications. In total, we added 12 628 links on 3389 gene pages.
Better integration of the Gene Wiki into the larger network of Wikipedia articles greatly improves navigation between related topics. For example, readers can now easily browse from the breast cancer article, to the Gene Wiki page for the commonly mutated BRCA2 gene, to the page for EMSY (C11ORF30), the protein product of which has been shown to interact with BRCA2 and silence its transcriptional activity (17).
Recognizing the importance of structural biology data, we undertook a focused effort to increase links between the Gene Wiki and the PDB (18). We first uploaded thumbnail images of all PDB structures to the Wikimedia Commons, a repository for freely usable media. Images were downloaded from the PDBe (http://www.ebi.ac.uk/pdbe/), and in total, 66 693 images were uploaded to the Wikimedia Commons. To aid in browsing and searching, PDB images were also categorized according to their assignments in the Structural Classification of Proteins (SCOP) database (19). The set of PDB structures can be browsed at http://commons.wikimedia.org/wiki/SCOP.
The easy availability of thumbnail images for almost all PDB structures will encourage their incorporation into relevant Gene Wiki and Wikipedia articles. To begin this process, we added an image gallery of PDB thumbnails to every Gene Wiki page with solved structures. To maintain balance with the rest of the pages’ content, the image galleries were shown in an expandable window at the bottom of each Gene Wiki page. In total, PDB image galleries were added to 2852 Gene Wiki pages with a total of 16 018 links to PDB structures. For example, the PDB gallery for MDM2 (http://en.wikipedia.org/wiki/Mdm2) shows PDB structures corresponding to the unbound protein, as well as structures in complex with its p53 peptide ligand and two small molecule inhibitors.
There are currently approximately 9000 pages in the Gene Wiki collection. To satisfy Wikipedia’s ‘notability’ criterion, we initially limited our effort to genes with the most linked references in PubMed (as indexed in Entrez Gene). However, to enable other Wikipedia editors to easily create Gene Wiki pages for other genes of interest, we created a simple web tool to generate the properly formatted ‘wikitext’ for any arbitrary gene of interest. This tool can either be used to update existing content to the most recent data, or to create a new page where none previously existed.
This Gene Wiki formatting tool has been implemented as a BioGPS plugin, accessible at http://biogps.gnf.org/GeneWikiGenerator. By utilizing BioGPS as the search interface, users can search for their gene or genes of interest using most public identifiers and keywords. Upon clicking on a gene, the web tool returns the wikitext in three distinct text boxes, together with links and instructions on how to create a new Gene Wiki page. To allow programmatic usage of this web service, the output is returned as XML and formatted to HTML using an XML style sheet.
Previously, we suggested that the long-term success of community intelligence resources is dependent on a positive feedback among page utility, readers and editors (11). In the ideal case, each Gene Wiki page provides some baseline level of useful content, which then attracts a certain number of readers. Some (likely small) percentage of those readers will then become contributors, where their contribution could be something as trivial as fixing a typo or as substantial as summarizing a recent paper. Contributions improve the Gene Wiki page, which then draws more readers, and then a larger core of contributors. In other words, usage is directly proportional to utility, contribution rate is directly proportional to usage rate and utility is directly proportional to contribution rate.
The first step in this process, creating article ‘stubs’ that had general utility, was the focus of both our first Gene Wiki effort and the systematic improvements described earlier. In addition, we now have the necessary data on usage and editing patterns to retrospectively assess the other two edges of the positive feedback loop.
Usage was analyzed for the 6-month period between 1 January and 30 June 2009, over the 9678 current Gene Wiki pages. In total, these pages were viewed over 17 million times (3.9 million total page views per month). On a per article basis, the Gene Wiki averaged over 300 pages views per page per month (Figure 2; see also Supplementary Table S1). Closer analysis of these statistics revealed a broad range of usage levels (Table 1). The top-viewed articles are primarily related to genes of general societal interest (e.g. insulin, erythropoietin) and are viewed tens of thousands of times per month, likely dominated by non-scientists. Near the 100th most-popular pages are gene pages that cross many areas of biology (e.g. interleukin 10, c-Met), and these pages are viewed thousands of times per month. Finally, near the 1000th most-viewed pages are genes that are likely of interest to a relatively small population of scientists (e.g. IGSF8, TRPC6), and these pages receive approximately 300 page views per month. We believe that these statistics are indicative of Gene Wiki usage by both scientists and non-scientists.
Supporting the future growth in usage of the Gene Wiki, we also found that >85% of all Gene Wiki pages are found within the top eight Google hits when searching by gene symbol (Figure 3). This figure represents a substantial increase over the ~60% observed shortly after the gene stubs were created (11).
We next examined the third leg of the positive feedback loop by analyzing the editing logs of Gene Wiki pages. During the same period between 1 January and 30 June 2009, there were a total of 6848 edits to 1893 Gene Wiki pages by 1923 unique users or Internet Protocol (IP) addresses. (In addition, automated edits by ‘bots’ accounted for 11 912 edits.) Edits over this period were quite constant at an average of about 1100 edits per month (SD = 171), 263 edits per week (SD = 69) and 38 edits per day (SD = 21). The cumulative effect of these edits was to increase the size of the text in the Gene Wiki by 2.28 MB (4.1%), approximately the equivalent to the text of 19 research articles in PLoS Biology. For individual articles, changes in page size are plotted as a function of current page size and Google rank in Figure 4.
When examining the usage of statistics, we noticed spikes in the viewing of certain genes, especially those mentioned in the popular press. To explore this observation, we identified the 771 Gene Wiki pages with the most recent variability in monthly page views. Of these, 69 had been searched often enough to have data in Google Trends (http://www.google.com/trends), a service that quantifies how many Google searches have been done for a particular term over time relative to the total number of Google searches. The correlation between Gene Wiki page views and Google Trends over time is readily apparent, with 43% of examined pages having significant correlation (R > 0.3; P < 0.01).
In many cases, the strong relationship between page views and Google Trends was driven by articles in the popular press (Figure 5). For example, the Wikipedia article for human chorionic gonadotropin (HCG) is one of the most frequently viewed articles in the Gene Wiki, presumably for its common usage in pregnancy tests. In May 2009, the Wikipedia article for this gene experienced a sharp spike in views (and edits) when Manny Ramirez was suspended for using HCG as a performance-enhancing drug. Similarly, catalase is frequently viewed article for its relevance to many areas of biology including aging and cancer. However, following a scientific report linking catalase function to premature gray hair in February 2009 (20), a prominent spike occurred in the viewing and editing of its Gene Wiki entry. Taken in sum, these data show a dynamic relationship between scientific publications, reports on this science in the popular press and usage of the Gene Wiki. These observations also underscore the potential opportunity and effectiveness of using the Gene Wiki for public outreach and scientific education.
With the explosion in biological wikis, it is clear that the community intelligence model resonates with the biology and scientific community (9–15). Despite the enthusiasm in the potential of this model, it is also clear that realizing this potential is not trivial. Many of these biological wikis appear to suffer from a lack of participation. Establishing a critical mass of users and useful content appears to be the most common obstacle in these efforts.
By integrating directly with Wikipedia, establishing critical mass has not been an issue for the Gene Wiki. Clearly, Wikipedia already had a critical mass of users and articles, and the Gene Wiki has been able to effectively leverage those resources as demonstrated by the usage and editing metrics presented above. Moreover, within the last year, the American Society for Cell Biology, the Society for Neuroscience and the National Institutes of Health have all held workshops or initiated efforts focused on science articles in Wikipedia. However, the Gene Wiki inherited a completely different set of challenges. First and most notably, Wikipedia allows users to remain completely anonymous, which often leads to fears of inaccuracy and bias. And second, Wikipedia is primarily focused on building unstructured articles (free text, images, diagrams, etc.) with relatively little attention to how contributed knowledge can be structured for downstream analyses in the way that Gene Ontology annotations, for example, can be utilized (21).
We intend to focus on these issues in future developments of the Gene Wiki. Although previous studies have suggested that Wikipedia is of comparable accuracy to traditionally curated works (22), other efforts have been developed to explicitly account for trustworthiness of content based on historical editing patterns of each user (23). Moreover, while we still believe that a completely unstructured Gene Wiki article is useful to the community (similarly to a gene-specific review article), we are also investigating methods to integrate community intelligence with data structure using novel technical solutions [e.g. Semantic MediaWiki (24)] and biomedical ontologies (25).
It is essential to emphasize that community intelligence efforts are not a replacement for traditionally curated gene annotation authorities (16,26–28). In contrast, we believe that community intelligence resources are complementary to existing databases and offer a different set of strengths and weaknesses. Certainly, the data generation model is very different, and users of the Gene Wiki need to recognize that the Gene Wiki, like Wikipedia itself, should be treated differently than the primary literature and expert-curated databases.
Ultimately, we believe that a variety of solutions in the area of community intelligence are worth exploring. Future Gene Wiki development will focus on addressing the challenges described above, and we are also very enthusiastic about complementary efforts as they work to build critical mass and encourage participation. Regardless, the usage metrics presented above demonstrate that the Gene Wiki is relevant right now, certainly to the general public and also to a growing number of scientists. We hope that the scientific community embraces this opportunity both to collaboratively annotate gene function and to directly communicate with the public in science education and outreach.
Wikipedia is freely available for viewing at http://wikipedia.org, and the Gene Wiki Portal page can be accessed at http://en.wikipedia.org/wiki/Portal:Gene_Wiki. All text is licensed under the Creative Commons Attribution/Share-Alike License 3.0 (Unported).
Supplementary Data are available at NAR Online.
Funding for this work and for the open access charge was provided by the Novartis Research Foundation and the National Institutes of Health [Grant Number 1R01GM083924 to A.S.].
Conflict of interest statement. None declared.
The authors acknowledge Konrad F. Koehler for helpful suggestions and enthusiastic editing, Jeff Janes and Julia Turner for technical assistance, as well as the entire community of Wikipedia editors and the Molecular and Cellular Biology WikiProject (http://en.wikipedia.org/wiki/WP:MCB) for contributions and feedback.