MimoSA prototype design
The primary function of MimoSA is to support the process of annotating functional minimotifs and their metadata from the primary literature. Secondary functions include minimizing user errors and data redundancy, improving annotation efficiency through techniques such as automated motif/activity/target suggestions, and aiding in the identification of papers containing minimotif content through a machine learning-based ranking system. MimoSA features distinct components and algorithms, which streamline these processes.
The general annotation workflow is as follows (see Fig. ): Using the MimoSA client software, the annotator accesses the server housing the MnM database. The user selects a paper for annotation using the Paper List Viewer. Selection of a paper automatically triggers the opening of the Abstract Viewer and the Minimotif Annotation Form and directs an external web browser to online versions of the abstract and full text paper, if available. Based on the information in the viewers, the Minimotif Annotation Form is used to modify an existing or enter a new minimotif annotation, which is then committed to the database. The annotation status of the paper is updated using the Paper Tracker Form.
The components of MimoSA can be broken up into three functional categories: MnM database management tools, minimotif annotation tools, and paper management tools. Descriptions of each component follow.
The database management tools consist of a minimotif browser and a minimotif editor. The minimotif browser shown in Fig. displays all minimotif annotations in the MnM database and associated attributes in a scrollable window that also displays the total number of minimotifs. A Paper Browser is accessed from a tab and gives a list of papers that need annotation. From the paper or minimotif browsers, a Minimotif Annotation Form can be launched by double clicking a row to enter a new or modify an existing minimotif annotation (Fig. ). This opens a tabbed frame where all the minimotif attributes are displayed and can be added or changed. Minimotif annotations can be selected for exportation as Comma-Separated Value (CSV) files for external manipulation. Likewise, an import function allows import from a CSV file. The minimotif annotations in the browser can be sorted based on a number of different attributes from a drop-down menu.
Figure 2 Screen shots of MimoSA application database management windows. A. Motif Browser shows attributes of all minimotifs in the MnM database. B. Minimotif Data editing or entry form for entering information in the MnM database. C. Modification form for entering (more ...)
The minimotif annotation tools consist of the Minimotif Annotation Form, the Abstract Viewer, and the Protein Sequence Validator. Multiple forms can be displayed at once. On the Minimotif Annotation Form, there is a "clone" function, which opens a new instance of the form pre-filled with all of the minimotif-syntactical attributes except the minimotif's sequence and position. This is intended to facilitate more efficient annotation of high-throughput papers for minimotif discovery (e.g. phage display), where several attributes of a minimotif are varied in a controlled fashion, thus generating a broad landscape of similar minimotifs with subtle variations [17
To assist the annotator in filling out the form, multiple types of support are provided. Double-clicking on any entry field in the form will display a context menu that gives the suggested choices based on relevant content in the MnM database. In the Modification tab, selecting a modification from the context menu will populate a different field in the form with a PSI-MOD accession number. The Abstract Viewer (Fig. ) automatically displays the PubMed abstract of paper that has been selected and highlights keywords and terms in different colors based on attribute entries in the database. The coloring scheme is minimotif (purple), activity (blue), target (orange), putative minimotif (red), affinity (yellow), protein domain (green); if the word "motif" is present, it is bolded. Selection of a paper with a right click also opens the abstract on the PubMed web site and a full text version of the paper, if available, in a web browser. This enables efficient access to full text papers and to other NCBI data using the "Links" hyperlink. Linked data of interest to the annotator includes structure and RefSeq accession numbers.
Figure 3 Screenshot of MimoSA abstract and protein sequence viewers. A. Abstract Viewer shows the abstract of the paper selected. Words that match existing minimotif attributes are color-coded. B. Protein sequence viewer window shows the sequence of the protein (more ...)
Another component that assists annotators is the Protein Sequence Validation function (Fig. ). Once an accession number has been entered, the protein sequence is automatically retrieved from a local version of public databases such as NCBI and displayed in the Protein Sequence Window. Once loaded, the position of the minimotif in the protein sequence is bolded. This ensures that the minimotif is indeed present in the selected protein.
The paper management tools consist of the Paper Browser, Paper Status Window, and Paper Ranking components, which are addressed later. The Paper Browser shown in Fig. can be used to manage millions of papers. The Paper Browser displays metadata about the PubMed abstracts of all papers entered into a table of the MnM database. The metadata includes PubMed ID, authors, affiliation, journal, publication year, comments, tracking status, paper score, title, URL, abstract, and database source. A paper score (discussed later) is used as a default sort parameter, although the entire table can also be sorted by PubMed ID, paper status, PubMed identifier, publication year, or journal using a pulldown menu. Since the table containing papers has more than 120,000 tuples, only the first 1,000 results of any sort are shown. When a PubMed identifier is entered and the "Add Paper" button is selected, the associated paper is retrieved from NCBI and inserted into the database. Any abstract can be retrieved for review by selecting the "Launch by PubMed ID.".
Screenshot of MimoSA paper browser and paper tracking windows. A. Paper Browser Window allows display of attributes for all papers in the MnM database. B. Paper Review Status Window shows review event history of papers in the database.
The Paper Status Window, a subcomponent of the Abstract Window, is used to track the annotation status of papers (Fig. ). Each time a paper is reviewed and the user updates the status of the paper, a "review event" is created and appended to the paper's history, which is stored in the database; the review event identifies the annotator and current status of the paper. Papers can be assigned one of a number of statuses shown in Table that correlate with different tracking functions.
Paper tracking status definitions
Modification of the minimotif miner data model and syntax
In order to better exploit MimoSA's functionality and facilitate unambiguous and accurate annotation, we recognized that some changes to the model we previously presented were required [7
]. Our minimotif syntax defined the motif source
as the protein that contains the minimotif. However, a consensus minimotif definition such as [RK]xx[RK] can have multiple occurrences in a minimotif source
so we needed to specify a position for the first minimotif residue relative to the protein sequence start position in the corresponding sequence file specified by a protein sequence accession number. Another change we considered is that experiments, which contribute to minimotif definitions may either use peptides or full length proteins. We think it is important to specify this as an attribute since the two sources represent very different chemical entities. Finally, we have started using PSI-MOD and GO controlled vocabularies for indicating activities and post-translational modifications of minimotifs.
Identification of papers with minimotif content
The MnM database contains many papers that were previously annotated for minimotif content, but many more papers have yet to be annotated. PubMed contains well over 19 million abstracts of scientific papers. Only those papers that have minimotif content are useful for annotation. Our first approach to pare down the paper list used keyword searches to identify papers, which were likely to contain minimotif content; however, this approach was not efficient. Therefore, we developed new strategies and an efficiency metric for the evaluation and comparison of these strategies (see Additional File 1
We initially evaluated six general strategies: Keywords/Medical Subject Headings (MeSH), date restriction, forward and reverse citations, authors with affiliations, and minimotif regular expressions. A detailed description of the strategies and results are presented in Additional File 1
. These strategies were evaluated using a Minimotif Identification Efficiency (MIE) score, which is defined as the percentage of papers that contain minimotifs. Collectively, these strategies provided a list of approximately 120,000 abstracts, of which ~30% were expected to contain minimotifs based on extrapolation.
Design and training of the TextMine algorithm that scores papers for minimotif content
We wanted to score and rank these papers as a means to better identify the ~30% that contain minimotifs and develop a strategy for scoring all PubMed papers that can be used for future maintenance of the MnM database. To rank papers for minimotif content, we designed the Paper Scoring (PS) algorithm and trained the algorithm using structured data for defined paper sets in the MnM database.
The basic problem of interest can be stated as follows: given a research article (or an abstract), automatically rank the article by its likelihood of containing a minimotif. We used a subset of papers as a training set for training the PS algorithm. Each article in a research article collection A
, which is used for training, is read by hand and given a score of either 0, indicating that the paper does not contain minimotifs, or 1, indicating that the paper has at least one minimotif. A similar algorithm has been employed to characterize unknown microorganisms [19
]. A crucial difference between the PS algorithm and that of Goh, et al
., is that the PS algorithm provides an ordering of the papers instead of using a filter threshold.
The workflow for this phase consists of the following steps: We start with disjoint sets P, N, and T of abstracts, which are positive, negative, or not reviewed for minimotif content, respectively. Let W be the ordered term vector found by taking all significant words (e.g. words like "the", "of", "new" etc., that have no discriminatory value between P and N) from the documents of sets P and N. For each word w in W and each article a in P we divide the number of instances of w by the size of a: this is the enrichment of w in a. Then, we sum these enrichments over all P and divide by the size of P to obtain an overall enrichment of w. We repeat this over set N, and subtract the result from wp to arrive upon a "score" for word w, which ranges from -1 to 1. Higher values indicate more positive association with minimotif content. We now have a vector of decimal "scores", which has the same dimension as W, with one entry per term in the term vector. Call this vector S.
Now, we compute a score for each unknown paper by combining word scores. This phase consists of the following steps.
1) Scan through the paper (or abstract) to count how many times each word w of W occurs in this article.
2) Construct a vector v of all values from (1) in which the order corresponds with S.
3) Compute the correlation between v
and obtain a Pearson's correlation coefficient pc
for each paper. If X
are any two random variables, then the Pearson's correlation coefficient between X
is computed as
is the expected value of X
is the expected value of Y
is the standard deviation of X
, and σY
is the standard deviation of Y
4) Thus, we have now computed a "score" of the article, which is the Pearson's correlation coefficient between the scored words from the training set W and respective enrichments of those words in the article n.
The Paper Scoring (PS) algorithm's pseudo code is provided in the Additional File 1
. The correlation coefficients for the lexemes range from -1.000 to 1.000. This score positively correlates with the presence of minimotif content, as expected.
Paper ranking and evaluation of the paper scoring algorithm
The algorithm above is packaged as an independent application, TextMine, which can be used in conjunction with MimoSA (or as a standalone open source java application which can be integrated with any annotation or analysis pipeline). For the test set, we selected 91 new articles, which we determined to either have or not have minimotif content and were disjoint from the training sets. The basis for all testing of the TextMine application was derived from correlations of TextMine scores to this set.
The TextMine website and package provides a test data set which reproduces our analysis for a set of test papers. The current version of MimoSA, utilized for MnM annotation, uses scores from TextMine calculated for 120,000 abstracts for paper sorting.
Paper scoring algorithm and training set size
Since the purpose of the algorithm is not simply to rank papers, but rather, to rank papers with increasing sensitivity over time, we evaluated the increase in the algorithms efficacy with respect to larger training sets. We found that there was a degree of variation depending on training set sizes, but that overall, both positive and negative training elements improved the performance (Table ).
Larger training set sizes (negative, positive) modestly improve algorithm performance
For use in testing TextMine's performance relative to the size of the training set the application package includes an iteration module, which allows for specification of the sizes of positive and negative training sets (this iteration package generated the data in Table ). We recorded the performance for incrementally increased training set sizes, and noted that as the number of either positive or negative training documents increased, a modest performance improvement was observed. The performance of the algorithm is determined by the correlation coefficient between the calculated scores, between -1 and 1, and an actual score, between 0 and 1.
The table indicates that large increases in the number of positive training articles were comparable to small increases in the number of negative training articles, ultimately showing that both had modest increases in value with set size. A positive correlation coefficient between positive or negative training size and the algorithm performance was observed (0.52 and 0.46, respectively). The correlation score between TextMine scores and the training set scores showed modest increases with size (ranging from 0.59 to 0.66 when using 40 negative and 400 positive abstracts).
The Receiver Operator Characteristic (ROC) curve is a standard metric for visualizing the sensitivity and specificity of an algorithm, which differentiates two populations. We have also included a ROC curve for the highest scoring training set, which had 400 positive and 40 negative articles. We found that this proportion was not required, and that significant correlations could also be obtained with smaller data sizes, as previously described. This curve is shown in Fig. . Notably, the area under the curve was above 0.89, indicating a high correlation between the score magnitude and the presence (1) or absence (0) of a minimotif. This data can be generated using the TextMine package. The steps for reproducing this data are described in the TextMine application package.
ROC curve analysis of TextMine results. ROC curve as a measurement of the sensitivity and specificity of TextMine for a disjoint test set of 91 pre-scored papers. Area under curve = 0.89.
Because the general utility of this algorithm far exceeds the field of minimotif annotation, we have released TextMine as a stand-alone application that is cross-platform and database-independent.