Our usual curatorial workflow (Figure ) leaves out between 75 and 50% of all papers retrieved from keyword searches (although this ratio has fluctuated with time). Figure shows how many abstracts were retrieved with our search strategies each year, and how many of them were actually manually curated. One of the motivations of this work was to estimate if our triage and curation efforts were retrieving all the potential information on the subject of transcriptional regulation on E. coli that was available in PubMed-based literature, and how to better explore for curation the full extent of the literature obtained in our keyword searches.
Figure 3 Curated and retrieved articles for RegulonDB, by year. Comparison between all references initially retrieved from PubMed using RegulonDB curator's search algorithms, and references that were finally reviewed in full to populate the database. Since search (more ...)
A biologist reviewed a random sampling of the most comprehensive set of interactions, which our system was able to extract, but that were not found to be in RegulonDB. This exercise allowed us to explore information that: a) was not retrieved during the manual curation process, but should have been, b) was processed incorrectly or c) although basically correct, was not completely relevant to our current purposes (for example, gene regulation other than transcriptional, although RegulonDB plans to add this information in the future). From a random sample of 96 interactions, we found 19 that represented relevant information that was not present in our reference database, but that merited either a closer look at the sources or further analysis to establish if it should be incorporated into RegulonDB. There were multiple reasons for this information being missing, among them: 1) the source papers had not been retrieved for curation or had not been curated yet, or 2) the genes or TFs were mentioned with unusual synonyms, Ids, references or terms, which made their manual curation difficult, 3) or the evidence presented either was deemed insufficient by curators or was presented with high level of hedging ("the molR gene probably regulates the expression of the chlD operon"). If we extrapolate this figure to the 1,545 interactions from the comprehensive non-enriched 2,649 interactions set that could not be matched with the relevant 3,108 RegulonDB entries, around 290 new interactions could be added to the database after manual review. RegulonDB curators will curate this automatically-generated network to see if they can integrate the data into the database, but also will search for other pertinent information, such as site and distances for gene and promoters.
In order to test the linguistic processing of the system, we did a manual review of a random sample of 96 interactions extracted, and we established 81 of them as having a basically correct semantic interpretation of the sentences, and 76 of them as being biologically correct to the point of including the right activation or repression function, for a 84% overall precision. The network that was gathered from all sources allowed us to obtain 45% of the entire human-curated RegulonDB network, while a more limited 700-plus selection of network-related papers (RN) accounted for 33% of that total. The "artificial" addition of multiple-entity objects like operons and two-component protein systems from RegulonDB (information that was available from our reference database, but would not be for organisms not yet curated) increased the size of the global network by 10 percentile points (324 interactions). In most datasets the increase was less significant, and we believe that as a whole the value of the information added with previous domain knowledge was not overly important.
In a more extensive evaluation simulating full curation of NLP-generated networks, a domain expert reviewed 481 interactions that A) were obtained from processing the 12,059 abstracts retrieved using the RegulonDB search strategies and B) were not found in the RegulonDB database. Again, we wanted to test if the extraction system could find relevant information that was missed at the triage or curation stages (a difficult task for human curators). We found 91 interactions that could be added to RegulonDB, while in the rest there was either an error in the inference made from the text, or even if the inference was correct the data could not be added to RegulonDB because of various reasons, among them: the regulation was not transcriptional, did not correspond to E. coli, there was an error in the gene/protein identification module, etc. In a few of the interactions the data seemed correct but the complete papers were unavailable to check them fully before adding them to the database. The 18.91 % of total interactions that was either immediately useful or seemed correct pending further analysis constitutes a reasonable addition to the curated network that seems to be worth the review effort by the curatorial team (potentially, close to 290 new interactions could be added if we consider the combined network obtained from all our corpora). In any retrieval system, the small "tail" of valuable missed information is always harder to come by than obvious cases, and improvement on Recall and Precision is hard. It is clear that a fully automatic processing of these sentences would require complex Artificial Intelligence inference engines customized for biological interpretation. Data that was not included in the RegulonDB network we tested against (like other regulation mechanisms or plasmid genes) could nonetheless provide important relationships associated with the annotation process. By doing a manual sweep of computer-generated curation, new or relevant information can be garnered that complements, expands or confirms human annotation.
An interesting issue to explore is why our IE system didn't obtain a more complete network from these papers. In other words, what are the limitations of an extraction system such as the one implemented for this task? First, there is the issue of the availability of full-text papers, which contain orders of magnitude more information than abstracts. From around 3,110 RegulonDB PubMed ids as of June 2006, we were able to obtain just 2,475 (79.6% of total). Even with those we actually collected, we also had to deal with incorrect conversion from PDF format, inconsistencies in term usage, etc. Another problem encountered is that not all the information is consistently presented in an explicit manner. Sometimes tables, graphics and illustrations provide what human curators need to generate relevant information, for example, by using some kind of domain inference in ways that our system was not designed to do. The sources used were also decisive factors in how well the system was able to generate a useful network. In order to compare the triage techniques employed in RegulonDB, we estimated what could be termed the "informational density" of the different corpora. We correlated the total size of the network obtained from these sources, the number of distinct documents and the raw size of each one, as shown in Table . This comparison allowed us to compare more accurately the quality and quantity of the network information obtained with each of the document sets. One of the ratios we estimated was the percentage of all interactions obtained that were found in RegulonDB (Column F), while other measures the average size of each document in the corpus (G), and the last one show how many RegulonDB interactions were obtained per document (H). The density of RegulonDB-related information shows that in a set of abstracts with fewer overall interactions than a full-text one with a similar number of documents, the relevant information can be more densely-packed, although we can expect to retrieve a smaller quantity of information. The interesting comparisons here are between documents gathered with different search strategies, and resulting in different numbers of documents. RegulonDB searches (RS) and STRING-IE abstract searches (ST) show a similar global network size, but quite different number of relevant interactions per document, while the curated EcoCyc paper set (EA) compares fairly well with RS. These numbers by themselves are not an explicit guideline of how to obtain a corpus that will maximize the potential information retrieved in Text Mining, but they do allow some degree of insight into diverse sources where document numbers and sizes as well as relevance are very different. Until the automatic paper selection issue is satisfactorily solved, high-throughput Information Extraction techniques can help lessen the impact of this specific problem on total results, since the technology can process equally well a lesser number of more informational papers than a much more extensive set of less-relevant papers, and still retrieve a significant amount of useful interactions.
Informational density of various corpora