Community annotation
The central goal of WikiProteins is community annotation of biomedical concepts and their interactions. The basic principle of community annotation is that computers and experts interact in an iterative process of mining and curation, as pictured in Figure . The various new technologies, terms and approaches adopted to enable this process will be described in more detail below, but first the basic principles of the approach are explained.
The biomedical literature contains pertinent 'facts', that is, statements of relationships between concepts that are generally considered to be scientifically 'accepted'. Each new article contains many repetitious factual statements, with references, along with a limited number of 'novel' facts. New facts will frequently also cause novel co-occurrences. As a consequence of removing factual redundancy, the number of unique facts (and thus the concept space) expands with only a fraction of the total number of sentences in the biomedical literature (Figure ; see the 'Rationale and overview' section).
A growing subset of these relevant facts, such as the described functions of proteins, protein-protein interactions or protein-disease relationships, have already been annotated and curated in open access databases and ontologies, such as the UMLS and UniProtKB/Swiss-Prot, IntAct, and GO Annotation. These and other on-line resources have become indispensable tools for current biomedical research. However, the rate of growth of high throughput data and published information in the life sciences renders comprehensive and timely annotation of the literature for actual facts by any central team of experts an unachievable goal. Computer assistance in the annotation process is, therefore, urgently needed.
Recognizing concepts in free text is not trivial, not even for human readers, let alone for computers. The yeast protein CLB2 is an instructive example. The (incorrectly spelled) term 'Clb2', used as an example in [
2], when typed into UniProtKB/Swiss-Prot, leads to 25 entries. One is the correct concept - the gene coding for G2/mitotic-specific cyclin-2 (see Figure for its WikiProteins page) - but the incorrect synonym used by the original authors is not listed in the corresponding Swiss-Prot record, neither as a synonym of the corresponding gene name nor of its protein. But Clb2 is, for instance, also a synonym for emb-9, which encodes the Collagen alpha-1(IV) chain in
Caenorhabditis elegans.
In the
Saccharomyces Genome Database [
15], the formal name of the gene is CLB2, and the synonym Clb2 is not listed; however, the query term Clb2 leads to the correct gene. A focused database like
Saccharomyces Genome Database can let its internal search engine be case insensitive and find CLB2 based on the query term Clb2, but in a wider context, case insensitivity leads to aggravation of the ambiguity problem. For example, in PubMed, the query 'Clb2' delivers papers on dental self-etching primers such as 'Clearfil Liner Bond 2' [PMID: 9522695, 12601887], on the
Clb1 gene in the fungal pathogen
Ustilago maydis [PMID: 14679309] and on calcineurin B-like proteins, such as CLB1 in
Arabidopsis [PMID: 14617077].
For computational meta-analysis this ambiguity is a severe limitation. In earlier microarray case studies we typically found that roughly 40% of all gene names in our lists have homonymy problems of some sort (unpublished data). Most of the re-writing rules to improve 'fuzzy' recall of gene and protein names have negative effects on precision and only marginal positive effects on recall [
16]. Thus, non-standardized use of terms in the literature induces vast problems of homonymy and these are not easy to solve.
In WikiProteins, various algorithms have been implemented to keep the homonym problem to the minimum achievable with the current techniques for word sense disambiguation [
17]. However, false positives for co-occurrence of two concepts in a sentence based on homonyms still happens occasionally and will be a disturbing factor in WikiProteins also. In contrast to 'read only' sources on the web, in WikiProteins, users are able to enrich the terminology system, thus improving concept recognition in future instances of indexing the same records.
In the natural language of standard scientific literature, the majority of simple facts have been described within one sentence, although in some cases a factual statement may be spread over multiple sentences. Attempting to mine these 'scrambled facts', in early case studies, only marginally increased the recall of actual facts and introduced many errors [
18]. Attempts to mine multiple sentences and paragraphs in the broad biomedical literature for all individual instances of a unique factual statement have met with limited success and, in fact, may have very little added value for meta-analysis of the literature as a whole [
1]. Unless the fact is very new, multiple instances of statements in sequential publications are only of use, from a computational point of view, to increase the likelihood that the statement is a consolidated fact. For well established facts one does not need to find the very last instance of the factual statement in all papers to be able to present the fact correctly in an ontological format such as the Knowlet. We have chosen, therefore, to analyse texts at the sentence level and accept the trade off with optimal recall of individual statements.
For Knowlet construction the number of sentences found affects the value of the C parameter (Figure ), but in many instances where the C parameter is positive, there is either factual or associative information involved in the computation of the semantic association. Logical co-occurrences suggested by the mining technologies as 'potential facts' are actively presented to registered experts for community annotation. Where possible, confirmation of factual status should be reported in the Wiki with references to sentences in the peer reviewed literature as supporting evidence.
An additional major limitation of classic text mining approaches is that much of the relevant text is securely behind the firewalls of publishers and is not easily accessible for automated indexing. This is another reason why it is not possible to exclusively rely on computational text mining as a definitive source for facts. In fact, roughly 60% of protein-protein interactions mined from Swiss-Prot and IntAct cannot be found co-occurring in a PubMed sentence or even an abstract (H van Haagen and A Botelho-Bovo, in preparation). This should not be considered surprising, as much of the information leading to those annotations came from full text articles, and within these from tables and figures, many of which are not suited for computer indexing. Thus, a large, intrinsically motivated community of experts is needed to accelerate the curation and annotation process of mined 'potential facts'. Copying of relevant sentences from full text literature with reference to the original article is one of the goals of WikiProteins. Easy tools for recognition of new co-occurrences (that is, not occurring in PubMed), but only in full text articles, are under development. Digital object identifiers of the underpinning articles can be downloaded in the Wiki environment to support factual statements by registered scientists. As more new relationships are validated, this approach may lead to collaborative knowledge discovery. This iterative human-machine interaction is a perceived central aspect of community annotation.
Based partly on the concerns described above, several attempts have already been made to involve the scientific community in annotation [
19-
22], but so far with limited success. We postulate that this slow adoption of collaboration via web services is due both to the perception of immature applications for annotation and to the fact that distributed annotation is widely perceived by busy scientists as a service to their colleagues only, and much less as a crucial activity for their own research work with immediate positive returns. However, community annotation aims to create and support stable and growing communities of interest around certain concepts, such as genes/proteins, pathways, diseases and drugs, with incentives for keeping information fully up to date.
Several colleagues have recently communicated a spontaneously growing activity in the current Wikipedia environment to annotate protein and RNA related pages (A Bateman, personal communication). WikiProteins is automatically linked to such community annotations in Wikipedia through the on the fly concept recognition. More direct mapping approaches are being developed. This hyper-linking allows annotations in both environments to be captured in the concept space.
It should be emphasized that editing in Wikipedia is not restricted to traceable registered users and that Wikipedia is meant to represent a neutral point of view. WikiProteins is complementary in that it provides a more structured environment where more original data and scientific debate can be accommodated, as well as a direct collaboration with authoritative sources. We anticipate, therefore, a co-existence and complementary role for Wikipedia and WikiProteins.
Knowledge browsing
A second user scenario is the use of WikiProteins to browse quickly through the concept space for interesting relationships.
To demonstrate the current status of the Knowlet based system we will use the following sentence from PMID 15920482: "Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation." Jensen
et al [
2] discussed this example in their review and made the following statement regarding this sentence: "Current
ad hoc IR systems are not able to retrieve our example sentence when they are given the query 'yeast cell cycle'. Instead, this could be achieved by realizing that 'yeast' is a synonym for
S. cerevisiae, that 'cell cycle' is a Gene Ontology term and that the word Cdc28 refers to a
S. cerevisiae protein, and finally, by looking up the gene ontology terms that relate to Cdc28 to connect it to the yeast cell cycle. Although this will not be easy, we see this form of query expansion as the next logical step for
ad hoc IR." WikiProteins is not to be perceived as an information retrieval (IR) system, but it is illustrated below that the concept space may nevertheless serve this stated need.
First, when the full abstract [15920482] is put into the concept recognition window, the ambiguity in the language becomes quite apparent. 'S. cerevisiae' is called 'budding yeast' in the title and the only protein mentioned there is 'Swe1/Wee1'. Furthermore, the authors of this abstract have used several constructs that make text mining difficult as they enter conjugate terms such as 'mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog)', 'Clb2-Cdc28', 'Clb2-Cdc28-phosphorylated Swe1', 'Cdc28/Cdk1', and 'Cdc5/Polo'. Many difficulties are introduced by using non-preferred names for genes and proteins and, particularly, by using dashes and slashes that are not parts of the gene symbol, but are simply separators for conjugated terms. The text further mentions that Wee1 is a protein kinase.
Despite this high degree of ambiguity in the terminology in the test abstract 15920482, the Peregrine indexer recognizes several meaningful concepts in the abstract: the proteins Serine/threonine protein kinase; Wee1 like protein kinase; Protein arginine N-methyltransferase HSL7: Cell division control protein 2, based on the synonyms Cdk1 and Cdc28; the concepts bud neck, and mitotic entry; the GO term cyclin-dependent protein kinase regulator activity; Polo-Box domain, phosphorylation; and the organism Saccharomyces. A click on the PMID 15920482 will lead to the concept web-linked version of the abstract.
Notwithstanding the severe problems in this abstract for automated indexers due to ambiguity, the composite Knowlet that was automatically created from this abstract has the following concepts in the histogram (Figure ): cell division, cell cycle, Saccharomycetes,, kinase activity, yeasts and mitosis. From this first case study it can be concluded, therefore, that the Knowlet of this abstract associates its content very strongly with the query 'yeast' and 'cell cycle', partly due to our thesaurus-based mapping of budding yeast to Saccharomyces. Further improvement of protein recognition and recognition in highly ambiguous text will dramatically improve this output.
When the selected sentence is taken by itself for indexing, only one of the proteins is correctly recognized by the indexer. Nevertheless 'cell cycle' and 'mitosis' are central concepts in the resulting Knowlet. The connection to 'yeast' disappears, which is due to the poor species-specific recognition of proteins in the sentence and the absence of a reference to yeast in the sentence itself.
As a second example, the respective proteins from the case study sentence were mapped with the WikiProteins dictionary look up to the following concepts with the preferred terms: Clb2 = G2/mitotic-specific cyclin-2 (S. cerevisiae) Swiss-Prot P24869; Cdc28 = Cell division control protein 28 (S. cerevisiae) Swiss-Prot P00546; Cdk1 = homolog of Cdc28; Swe1 = Mitosis inhibitor protein kinase SWE1 (S. cerevisiae) Swiss-Prot P32944; Cdc5 = Cell cycle serine/threonine-protein kinase CDC5/MSD2 (S. cerevisiae) Swiss-Prot P32562
The Knowlets of these proteins were aggregated in the concept space. The system creates the Knowlet-output shown in Figure . In discovery mode (Figure ; preference for co-occurrences and associations over facts), the closest factually associated concept in the graph is 'mitosis'. The strong semantic association between 'mitosis' and the four source concepts is mainly caused by factual relations (GO annotation) of all four source proteins (Figure ). In addition, there are co-occurrences (Figure ), and, finally, there are many associative concepts (Figure ). The same Knowlet, presented in background mode, shows the concept 'cell cycle' prominently present for mainly the same reasons.
The main conclusion from this particular example is that the future aim to associate the selected sentence with the concepts 'yeast' and 'cell cycle' is, in fact, not primarily hampered by the fact that the two terms or their synonyms are not mentioned in the sentence. With this level of language complexity and ambiguity, the problem is more related to the lack of adequate computer-recognition of (wrongly spelled) terms (see also the 'Rationale and overview' section). Methods that take context and factual knowledge from databases into account, like the one described here, will relate the case study sentence to the desired terms.
It should be emphasized again that creating a factual and associated concept space around 'yeast cell cycle' with appropriate links to supporting sentences for each edge in the network is a more useful approach to knowledge discovery than the retrieval of a single sentence.
Collaborative knowledge discovery
The third scenario serves to demonstrate the potential for knowledge discovery using the WikiProteins resource and community annotation.
When the composite Knowlet of the concept 'antimalarials' and 46 known antimalarial drugs is viewed in discovery mode with the semantic filter on 'chemicals' only, there are three yellow rings, which represent concepts associated with this space only by indirect association (Figure ). These concepts are 'mdr gene/protein plasmodium', 'dehydrofolate reductase' and the drug 'tegafur'. The first two concepts are logical associations with malaria. Tegafur is not obvious and does not have any co-occurrence in PubMed with 'malaria', 'plasmodium', or 'antimalarials' as checked by a regular PubMed search on 28 December 2007.
The interest of a researcher may be sparked by the enzyme and cell division related concepts in the Knowlet of the anti-neoplastic drug tegafur and this may lead to the construction of the Knowlet depicted in Figure , where the source concept represents 'tegafur'. The most highly associated enzyme in this Knowlet is 'thymidylate synthase' (TS).
When PubMed was consulted, out of 2,991 abstracts on tegafur, several mentioned the enzyme as a target for the drug. An 'AND' query with 'malaria' and TS yields 55 abstracts among which is the article 'Evaluation of the activities of pyrimethamine analogs against Plasmodium vivax and Plasmodium falciparum dihydrofolate reductase-thymidylate synthase (TS) using in vitro enzyme inhibition and bacterial complementation assays' by Bunyarataphan et al. [16954316]. This abstract contains the sentence: "The 50% inhibitory concentrations derived from PvDHFR-TS-dependent bacteria were correlated with their corresponding inhibition constants (Ki) from an enzyme inhibition assay, pointing to the likelihood that the potent enzyme inhibitors will also have potent anti-malarial activities." The procedure described has correctly revealed an indirect association in the concept space that could indicate that tegafur is a candidate anti-malarial drug.
When the connections in the concept space around antimalarials and tegafur are explored further, it becomes immediately obvious how logical it would be to reason that tegafur might indeed inhibit growth of malaria parasites, at least in vitro (Figure ) Obviously, multiple reasons could exist for why the compound may not work, including physical reasons, such as prevention of entrance into erythrocytes based on the molecular size of tegafur. It is beyond the scope of this paper to investigate these associations any further, but it serves as an example of the principle of Knowlet-based discovery.