Gene portals (e.g., Entrez Gene [1] and Ensembl [2]) and model organism databases (e.g., Mouse Genome Database [3], Rat Genome Database [4], FlyBase [5]) are popular and useful tools for researching gene annotation and enforcing data standards. These databases provide a large volume and diversity of information on each gene, including protein and transcript sequences, genome location, genomic structure, aliases, links to literature, and gene function. These sites are considered to be the definitive sources for these types of gene annotation. However, by their very nature as authoritative annotation sources, the data displayed on these sites must be subjected to a high degree of oversight by expert curators. In short, the data model used by gene portals and model organism databases focuses on large contributions from a relatively small number of contributors.
In contrast, the online encyclopedia Wikipedia uses a different model for collaboratively synthesizing knowledge, commonly referred to as the “Long Tail” [6]. Originally coined in reference to the power law relationship observed in Internet commerce, the Long Tail is typified by Wikipedia's relatively open data model that targets small contributions from a large population of contributors. Articles in Wikipedia can be freely edited by all users, including anonymous editors, and any registered user can create new articles. Established in 2001, the English Wikipedia currently contains over two million articles edited by over six million user accounts. A recent study found that the number of contributions from new editors (less than 100 total edits) in total equals the number of contributions from the most established editors (greater than 10,000 edits) [7], illustrating the collective importance of the Long Tail. Equally importantly, previous studies have shown that Wikipedia content on scientific topics rivals the online Encyclopedia Britannica in accuracy [8].
Despite the widespread use of Wikipedia for general interest topics, its use for scholarly subjects has been uneven. The potential power of applying the Long Tail model to gene annotation has been previously noted [9–11]. A loose organization of Wikipedia editors has spearheaded the creation and expansion of several thousand articles related to molecular and cellular biology (the “MCB Wikiproject”), including many gene-specific pages. These articles vary widely in quality, format, and completeness, ranging from relatively complete encyclopedic entries (e.g., “enzyme,” “oxidative phosphorylation,” and “RNA interference”) to very short collections of information called “stubs” (e.g., “amphinase” and “glomus cell”). As an example of the collaborative writing process, the article on RNAi has been edited 708 times by 232 unique editors since its initial creation in October 2002. On the subject of human genes, generally only the most well-characterized of genes and proteins have highly developed entries (e.g., “HSP90” and “NF- B”).
In principle, a comprehensive gene wiki could have naturally evolved out of the existing Wikipedia framework, and as described above, the beginnings of this process were already underway. However, we hypothesized that growth could be greatly accelerated by systematic creation of gene page stubs, each of which would contain a basal level of gene annotation harvested from authoritative sources. Here we describe an effort to automatically create such a foundation for a comprehensive gene wiki. Moreover, we demonstrate that this effort has begun the positive-feedback loop between readers, contributors, and page utility, which will promote its long-term success.



This article has been