In order to retrieve scientific abstracts related to green plants that would be related to defense mechanisms, we have used the system MedlineRanker [12
]. Two MeSH http://www.nlm.nih.gov/mesh/
terms (Host-Pathogen Interactions AND Plants) have been used as "training dataset" to rank 10,000 recently-published abstract from the whole MEDLINE database. After the MedlineRanker analysis we retrieved the top 1,000 PubMed IDs from the generated rank to be loaded as "application dataset" for the next steps of our analysis [Additional file 1
LAITOR is optimized to work by analyzing tagged scientific abstracts. For this purpose, we adopted the NLPROT [13
] program as LAITOR's protein tagger. The plain text format (-f txt) must be chosen for the NLPROT output file, where bioentity names present in the text are tagged between "<n>" and "</n>" tags. The tagged protein names are filtered according to a user-defined bioentity dictionary, in our case study: a plant protein name and synonym dictionary.
Two protein dictionaries have been generated for the development of LAITOR. The first (named human proteins dictionary) created for the evaluation of LAITOR performance (explained below) in the BioCreative II Interaction Article Subtask (IAS) [14
]. The second (named plant protein dictionary) has been used in the identification of co-occurring of green-plant protein pairs retrieved for abstracts related to host-pathogen interactions.
The human protein dictionary has been created by using all the protein records deposited for Homo sapiens [NCBI Taxonomy id: 9606] in the UniProt-SwissProt-TrEMBL (UP-SP-TR) database. In this dictionary, the definition(s) and synonym(s) for all human UP-SP-TR proteins are included. Furthermore, for each record, the corresponding NCBI Gene symbol and synonyms were used to enrich the representative terms of said protein. At the end, the human proteins dictionary is composed by 87,537 records (IDs), comprising a total of 112,686 distinct protein terms, which have been completed by the addition of 40,234 supplementary terms from the NCBI Gene database.
Additionally, specific genes names and synonyms for every organism deposited in the NCBI Taxonomy database that have gene records in the NCBI Gene database have been used to create LAITOR readable dictionaries. To use these dictionaries, users must inform the taxonomy identification number (Taxonomy ID) for the preferred organism followed by the extension ".dictionary" (e.g. "9606.dictionary" for "Homo sapiens" genes) during set up, as explained at LAITOR's documentation file.
For the plant dictionary, the complete Gene tab-delimited database from Entrez website has been downloaded (5,317,958 records), which comprises 505,403 different organisms (Taxonomy IDs - TAXIDs). To filter only those records related to green-plant proteins, we used the NCBI Taxonomy database to select from the Gene table only those records with a TAXID corresponding to Viridiplantae organisms, which included 99,488 different records. At the end, the plant protein dictionary contained 148 plants organisms (0.02% of total organisms) and a total of 237,077 Gene records (4.45%), which included 217,224 distinct protein symbols and 62,521 synonyms (see one example for the Gene PR1 of Arabidopsis thaliana
] in Additional file 2
The resulting table displays two columns: one for the bioentity names, and the second with their respective synonyms so that it can exist as lines (records) as synonyms for each bioentity name (Additional file 2
Another aspect explored by LAITOR, is how to handle gene name ambiguity. The strategy of using the Taxonomy database to limit the number of used entries reduced the possibility of inclusion of names of other organisms which would cause ambiguity among terms. However there are terms that commonly occur for more than one organism, or different proteins from the same organism that share the same name or synonym. To cope with this, LAITOR creates a tag file in which the ambiguous terms identified in the analysis are normalized to the same name in the protein dictionary. Such terms that match multiple protein names or that are synonyms of multiple protein names are marked in the LAITOR output. This warns users about the possibility of misinterpretation for such a term.
In order to check the co-occurrence and likely involvement of plant proteins names along with biotic and abiotic stimuli names, a list of previously known stimuli and their synonyms has been provided as Concept Dictionary (for example: Jasmonic Acid, Jasmonate and JA were included as the same concept). Both, Protein and Concept Dictionaries are available as additional material [Additional files 3
Additionally, in order to attend different contexts, we have populated all the sub-headings of NCBI's Medical Sub Headings (MeSH) Trees (available at http://www.nlm.nih.gov/mesh/trees.html
) as LAITOR's concepts dictionaries, as explained at LAITOR's documentation.
A list representing the different types of interactions or relationships between proteins was generated based on previously published list [4
]. It is composed by 76 terms, which have been included together with a total of 886 synonyms as seen in Additional file 5
, Table S2. Considering all terms, the biointeraction dictionary in its entirety is composed of 963 different words.
Once the abstracts to be analyzed had been retrieved and tagged for protein and gene names, biointeractions and concepts, LAITOR was used to perform a co-occurrence analysis [see Additional file 6
At the sentence level, each line of the tagged abstracts was divided at every full stop (".") punctuation sign. We paid special attention to the presence of these full stop marks in alternative positions that did not indicate the end of the period, as in the case of species names (for example: A. thaliana) or protein names (for example: PDF1.2 protein).
Initially the whole abstract is screened to store the occurrence of all bioentity names. After storage of all names, each protein name is checked for its occurrence in each of the separated sentences. If a bioentity term is found, let us name this term as "Pair 1", the script checks the occurrence of a second bioentity name, "Pair 2", different from Pair 1 in the same sentence. To avoid redundancy, the script checks on-the-fly if Pair 2 is a synonym of the previously identified Pair 1 and discards such cases.
It has been previously published that 90% of the bio-interactions among proteins documented in the literature adopts the pattern "Protein-Biointeraction-Protein" [16
], this pattern being chosen by approaches like iHOP [15
] and HomoMINT [17
]. Nevertheless, we adjusted LAITOR to identify other patterns of Protein-Protein or Protein-Concept co-occurrence, as explained below.
The co-occurrences identified by LAITOR are classified into four types. From the most to the least stringent, these types are:
Type 1: Both co-occurring protein names/synonyms must not refer to the same protein (common for all types of co-occurrences), they must be present in the same sentence of the abstract and, additionally, it is required that a term from the Biointeractions Dictionary occurs in between the considered terms. An extra optional step is the identification of a biological stimuli (represented as a term from the Concepts Dictionary) term anywhere in the sentence, which is then associated to the interacting pair;
Type 2: Same as Type 1, except that the biointeraction may occur anywhere in the sentence;
Type 3: Same as Type 1, except that the occurrence of a biological term in the sentence is not required;
Type 4: All the pairs of co-occurring protein names/synonyms mentioned in the abstract are considered, whether they are in the same sentence or not.
Thus, when LAITOR performs under type 4, the other co-occurrence types are included.
Multiple co-occurrences of type 1, 2 and 3, might happen in a given sentence. To cope with this, our system was adapted to perform an overlapped search. This means that in cases where two proteins (A and B) occur along with the same biointeraction, like in the sentence "A and B regulate C", the pairs "A-regulate-C" and "B-regulate-C" are identified as type 1 co-occurrences. Note that the co-occurring pair "A-B" will be assigned type 2. Moreover, in more complex sentences such as "A is regulated by B and activates C", the system will retrieve as co-occurrences of type 1 "A-regulated-B", "A-regulated, activates-C', and "B-activates-C" (together with type 2 "A-regulated, activates-B" and type 2 A-regulated, activates-C) thus over predicting the number of different bio-interactions between the A, B and C proteins. However such complex sentences may not be very frequent. In order to determine if they are a serious problem, we performed a series of manual evaluations of the results of LAITOR's analysis on several abstract datasets.
Protein term co-occurrences at sentence level of scientific abstracts might be potentially useful for the prediction of literature-based protein-protein interactions. Therefore, we have tested the performance of LAITOR to find protein-protein interaction data in abstracts. For this purpose, we have used the BioCreative II test dataset for the Interaction Article Subtask (IAS) as gold standard [14
]. This "performance evaluation dataset" is composed of relevant (3,529) and irrelevant (1,957) abstracts for the curation of protein-protein interactions present in the MINT and IntAct databases [18
]. Once LAITOR identifies a co-occurring protein pair in an abstract, this is considered to be positively (relevant) classified. After the classification of all gold standard abstracts the precision and recall are calculated for each of the four co-occurrence types (1-4), and the performance compared to methods participating in the BioCreative II challenge. A receiver operating curve (ROC) was created by using the package ROCR [19
]. Positive and negative performance evaluation datasets are provided as additional material [Additional file 7
A protein and stimuli co-occurrence analysis created by LAITOR from PubMed abstracts is parsed from a general output file into a tab-delimited text file (extension .co) that is used as input by most network visualization software. As default, LAITOR generate inputs for two of these programs: EMBL Medusa [20
] and EMBL Arena3D [21
], which provide networks in one- and multi-dimensional charts, respectively, enabling the complex output generated by LAITOR to be efficiently handled.