WormBase relies on a thorough manual curation pipeline to extract data from the corpus of C. elegans literature. This process ensures high quality and consistency of data and provides an atomized evidence trail for all data entered into the database, but it is time consuming and labor intensive.
We continue to refine our curation strategy for greater efficiency, breadth and depth of coverage. Notably, we have automated nearly 90% of ‘first-pass’ curation, which aims to flag and extract data types contained in each reference. We implemented these improvements using a two-tiered approach that combines programmatic natural language processing with web-based data submission forms targeted to authors. This process now identifies 27 distinct data types (for details, see http://www.wormbase.org/wiki/index.php/Curated_data_types
). Five of these data types (alleles, small-scale RNAi experiments, transgenes, gene interactions and antibodies) are identified automatically through use of the text mining system Textpresso (12
). In addition, we employ Textpresso for fact extraction, notably for curation of Gene Ontology (GO) cellular component terms (13
The identification and flagging of two data types (small and large-scale RNAi experiments, and phenotype analysis) have been automated by Support Vector Machines (SVMs; 14). We are testing SVMs on all curated data types and will adopt this method to classify and index papers for those data types that can be efficiently and reproducibly identified. For those data types that are not amenable to SVMs, we are exploring other statistical methods (such as hidden Markov models and conditional random fields) or rule-based methods. For example, we now use rule-based methods to identify papers that discuss C. elegans homologs of genes associated with human disease.
A second refinement has been to integrate the research community directly into curation. Using our database of public biographical information, we automatically e-mail authors of new C. elegans papers, and ask them to use a concise and time-efficient web form to identify which data types are relevant to their papers. So far, 413 out of 864 authors have responded, giving us a very large pool of expert first-pass curators and greatly speeding curation. Community input will further support our switch to an automated first-pass pipeline because it will help us assess the success and failure rates of automation.