There are several steps in the MAA algorithm. The first step is free-text matching of the MeSH vocabulary to the MEDLINE abstracts. To measure the performance of this step and subsequent steps, we compare the results to manual annotations of these abstracts done by the MeSH indexers. In this comparison, we take into account that in some cases the indexer used a more general term (aka a "relative node") than the precise name of the chemical, such as "Benzodiazepines" instead of "Diazepam." Note that it is not possible for the algorithm to find all terms annotated by the indexers as the indexers have access to the complete paper and the algorithm does not.
Subsequent steps in the algorithm attempt to reduce the number of false positives, which are the matches found by the algorithm but not indexers. The first step, the "MeSH Filter" eliminates MeSH records that do not have an associated chemical structure. The second step, "Tokens and Rules", discards partial matches to terms that follow chemical nomenclature rules or have additional chemical name tokens. The third step, "Protein and Gene Names", screens out protein and gene names as these names can contain the names of chemicals. Finally, the "TP filter" eliminates matches using MeSH terms that are also common English terms, such as "lead."
The comparison between the various steps in MAA and manual indexing is depicted in Figure and will be discussed in detail below. From left to right, each group of bars indicates the matching results for different steps in the algorithm. Within each group there are 4 bars, starting from the left:
1. False positive (blue): MeSH terms found in an abstract by MAA but not found by manual indexing.
2. True positive (red): MeSH terms found in an abstract by MAA and also found by manual indexing.
3. In Text, Not found (green): MeSH terms present in the abstract and found by manual indexing but not found by MAA.
4. Not in text (purple): MeSH terms not present in the abstract but found by manual indexing. As mentioned earlier, some terms are found in the body of a paper and not in the abstract. Since the MAA algorithm does not have access to the body of the paper, it is unable to find these terms. The value shown in the figure is likely an upper bound as it is possible that the algorithm may not find a term due to various potential issues (e.g. punctuation, spelling, unknown synonyms, etc.).
1. Free-Text MeSH matching
As displayed on Figure , the free-text MeSH string matching, if used to annotate small molecules directly, generates a significant number of false-positive cases compared to the MeSH indexers' annotations. More than 70% of the MeSH terms appearing in the text were not annotated by indexers (see Table for values and Figure for graphs of the precision, recall and F measure). By inspection, it appears that in most cases the MeSH term may either be a text fragment of another scientific concept (e.g., many protein names include aspects of a chemical name) or the MeSH term is simply present but not relevant to the paper context (e.g., used in a descriptive or comparative sense, as a reagent in an experiment, etc.). Thus, in order to annotate a chemical term accurately, algorithmic filters need to be applied to the free-text matches. In some cases, the indexer missed annotating a valid chemical name. However, this category was not examined as it would have required another standard of truth, which was unavailable for the large set of abstracts considered in this study.
| Table 1The values of precision (P), recall (R) and F value (F) of MeSH Automated Annotation (MAA) on title and full abstract; title, first and last sentences of abstract and title-only annotations respectively, with a series of algorithmic filters added cumulatively. (more ...) |
2. Using relative nodes in the comparison
In comparing results between the MAA and the manual indexers, tree-node expansion is applied to the results from the MAA system. The term "tree-node" comes from the hierarchical tree structure of the MeSH thesaurus. Examination of the MeSH annotations in MEDLINE finds that the indexers will sometimes select a higher level node in the MeSH tree than the nodes that correspond exactly to the chemicals mentioned in an abstract. For example, they may select a more generic term "Penicillins" instead of the "Penicillin G" and "Penicillin V" mentioned in a paper. This use of higher level ("super-concept") nodes also happens for MeSH substances as they are manually mapped to nodes in the MeSH tree. Therefore, for a given MeSH substance or MeSH term, we include its assigned MeSH tree node and/or super-concept node, respectively. This tree-node expansion significantly increases the number of matches between manual indexing and our MAA algorithm (Figure ), while also increasing the recall and precision due to the increase in the number of true positives (Table and Figure ).
3. Improving free-text MAA by adding filters
3.1. MeSH filter: using terms with associated chemical structure From the entire set of chemical MeSH terms, MeSH terms representing small molecules are extracted to generate a chemical dictionary. This constraint is implemented by selecting MeSH terms which can map to PubChem (
http://pubchem.ncbi.nlm.nih.gov/) CIDs, which identify unique small molecule structures. This filter removes almost 65% of the total MeSH terms (see Table , the numbers of MeSH terms change from 521 k to 178 k). Filtered terms include protein names, category names, non-specific names and chemical names which cannot be represented by molecule structures. After filtering out terms, we expand the MeSH dictionary by adding 541 chemical formulas of inorganic compounds as synonyms of associated MeSH concepts. All of these chemical formulas are from Wikipedia [
25] and were manually verified. This addition increases the ability of MAA to recognize chemicals such as 'KOH' and 'NaCl' if no compound common name is mentioned in the text. After updating our dictionary, we performed another free-text match and compared the results with manual indexing. The results are shown in the second group of bars labeled "MeSH Filter" in Figure . Compared to "Relative Nodes", the number of FP cases after adding the MeSH Filter drops from 79 K to 30 K, and the number of TP cases decreases less than 1 K. The performance of MeSH filter, indicated by precision, recall and F measure is displayed in Table . The precision jumps from 0.32 to 0.54, and loses 0.03 recall. As a result, the F measure has an increase from 0.45 to 0.62. The MeSH filter assures that all extracted MeSH terms by MAA are entries in PubChem. It thus implicitly links chemical names in literature to variant features of the PubChem database, such as chemical structures, properties and bioactivities etc.
| Table 2The number of MeSH terms in the dictionary after each filter is applied. |
3.2. Tokens and Rules: removing false positive annotations by syntactic analysis MeSH terms that are sub-strings of another entity name is one reason for false-positive annotation. The chemical tokens and chemical name decision rules (introduced in Methods part 4) were used to decide if matched MeSH terms are full names or substrings.
Some of the applied rules are listed below:
[1]. If two words in front of a matched MeSH term are both chemical tokens, the MeSH term is treated as a FP annotation. This rule by itself yields a 0.25% increase in precision, a 0.34% decrease in recall and a 0.04% increase in F measure;
[2]. If one word in front of a matched MeSH term is a chemical token, the MeSH term is treated as a FP annotation. This rule by itself yields a 1.3% increase in precision, a 1.6% decrease in recall and a 0.24% increase in F measure;
[3]. If one word behind a matched MeSH term is a chemical token, the MeSH term is treated as a FP annotation. This rule by itself yields a 2.2% increase in precision, a 1.9% decrease in recall and a 0.68% increase in F measure. Note that the F measure is the harmonic average of precision and recall, which is why the change in F measure is not exactly the difference between the change in precision and recall.
Using more than two tokens before and after the MeSH term did not yield any improvements. For example, if the algorithm checks 3 tokens before and after the matching MeSH term, the recall decreases 0.72% and precision decreases 0.87%.
In addition to these name decision rules, we also created several prefix and suffix rules to check whether a matched term is FP annotation. For example, if the token 'poly' is the prefix of a MeSH term, this MeSH term is treated as a FP annotation, yielding a 0.12% increase in precision, 0.03% decrease in recall and 0.07% increase in F measure; if 'ase' is the suffix of a MeSH term (except 'release' and 'base') the MeSH term is treated as a FP annotation, yielding a 1.2% increase in precision, 0.15% decrease in recall and 0.75% increase in F measure. These rules were primarily heuristic in nature and were developed by manual examination of the annotations.
It is possible to apply hundreds of rules to increase the precision of MAA. However the recall decreases as each rule applied. It is nontrivial to decide which rules should be used. In the MAA system, we select rules according to the computed F measure. If we obtained a relatively significant positive increment of F measure by applying a rule, the rule was kept.
The following is an actual annotation of a PubMed abstract (PMID 16704345) to show how this filter works:
...Related enzymes are the ATP-dependent benzoyl-CoA reductase and the ATP-independent 4-hydroxybenzoyl-CoA reductase. Ketyl radical anions may also be generated by one-electron oxidation as shown by the flavin adenine dinucleotide (FAD)- and [4Fe-4S]-containing 4-hydroxybutyryl-CoA dehydratase....
The bold words are mapped MeSH terms, and the words underlined are chemical tokens found before or after MeSH terms. For example, according to the rules, "flavin-adenine-dinucleotide" is the complete name and MeSH term "adenine" is just part of this name. Thus, the MeSH term "adenine" is regarded as a false-positive by our MAA program.
In Figure , the fourth group of bars indicates the change after adding the "Token and Rules" filter. Compared to previous group of bars (MeSH Filter), the blue bar (false-positive annotation) dropped more than 5000 and red bar (true-positive) only lost 1300 annotations. Please see additional file
1 for a detailed description of chemical token generation and chemical name decision rules.
3.3. Protein and gene names: removing MeSH terms that are sub-strings of protein, gene and non-chemical MeSH terms Chemical terms are a common part of protein names, such as "benzoyl-CoA reductase" and "4-hydroxybutyryl-CoA dehydratase" shown above. When these protein names are mentioned in text, it is likely that the topic of the paper is the protein instead of the prefix chemicals. To address this issue, we created a group of "negative vocabularies" to collect names that contain MeSH terms as sub-strings. In the MAA algorithm, if a term in the negative vocabularies is found in text, then its sub-string will not be annotated if this sub-string is a MeSH term. The protein and gene names are collected from MeSH and the NCBI Entrez Gene database. The performance of this method depends on the completeness of the negative vocabularies. It is not possible to construct a complete dictionary, as new names are generated every day. In Table , we can see that this filter results in only a small increase of the F measure at best. This is because the "token and rules" filter and the "protein and gene" filter are not mutually exclusive: some rules in section 3.2 already remove many protein names. If "tokens and rule" and "protein and gene name" filters are applied independently on the same corpus, the former will yield 2.6% more precision and 0.5% more F measure. This result is possibly due to the fact that the "token and rule" filter attempts to be a superset of the "protein and gene name" filter. Nevertheless, the protein and gene name rule is still useful in removing false positive matches for certain protein names.
3.4. TP filter: removing MeSH terms with low TP ratios Some MeSH terms, such as the dental sealant "Conclude" (also known as "Concise"), have a high false positive rate due to nonspecific matching. These terms are filtered out to improve match statistics. To do this, we pre-calculated the true positive ratios of each MeSH term using free-text string matching on the training set. A binary value (1 or 0) was assigned to each MeSH term to indicate if it exists or not in the MEDLINE abstract. If a term was mentioned multiple times in an abstract, it was still counted as 1. The ratio of TP annotation for a specific MeSH term was calculated by the number of times the term was applied during manual indexing divided by the count of abstracts with free text matches. This ratio is used to measure the propensity of a MeSH term to be correctly annotated in text. Some MeSH terms with their TP ratios are listed in Table . Common chemicals such as 'water' and 'glucose' tend to have a less than 50% TP ratio. The term 'lead' has only 11% TP ratio, which indicates in only 11 out 100 papers, 'lead' is indexed as a chemical element. Additional term types with low TP ratios include homonyms of common English words, such as 'link' and 'monitor', or acronyms such as 'CI-2' that have only a few characters. Using the TP ratio, one may set up a tunable threshold to eliminate non-specific MeSH terms in automatic annotation.
| Table 3Selected MeSH terms with TP ratios ranked from lowest to highest based on a 230 K abstract corpus (Total 260 K abstracts minus 26 K testing corpus). |
Once a threshold ratio is selected, MeSH terms with a ratio lower than the threshold will not be annotated on the testing data set. Selecting a reasonable threshold will remove false-positive annotations and increase the precision of MAA while not significantly reducing the recall. For example, if the threshold is set to 0.025, there are only 401 total MeSH terms eliminated, but nearly 8297 FP annotations are removed (in Figure , this difference is shown by the blue bar when going from 'Proteins and Genes name' to 'TP ratio 0.0025'), while 2466 TP annotations are lost (In Figure , this difference is shown by the red bar when going from 'Proteins and Genes name' to 'TP ratio 0.025'). In our study, the thresholds are adjusted from 0.025 to 0.4 to show the trade off in recall as precision increases. Thresholds larger than 0.5 were not evaluated, since the MAA will lose more TP annotations than FP annotations. The best threshold ratio by F measure is between 0.1 and 0.2 (see Figure ). This TP ratio filter provides a degree of tunable accuracy for the MAA system.
4. Term position in the text
The title is often a summary of a paper. In the abstract, the author often mentions objectives in the first sentence (FT) and conclusions in the last sentence (LT). The appearance of a chemical name in these parts of an abstract is likely to indicate a high degree of relevance. We performed MAA on the title, FT and LT of the abstracts and then again just on the title of the abstract to see if we could obtain higher precision. The results are presented in Figure and Table . In the title-only annotation, MAA could provide the 96% precision if the TP filter threshold is set to 0.4. At this filter level, there is greater than 27% recall on the corpus. Eliminating the TP filter for title-only annotation yields 91% precision with 33% recall. This may be because all words in the title are relatively important, reducing the necessity of the TP filter to remove non-specific MeSH terms. For an information retrieval task that requires a high degree of specificity, MAA on the title-only is a reasonable selection. Including the FT and LT in MAA yields less precision than title alone MAA, but with better precision than MAA on the entire abstract.
5. Comparison with other studies
In Table , the performance of MAA is compared to similar studies that matched MeSH to text. The top two rows in Table list results from Hettne [
4] and Kolarik [
13]. As a gold standard, Hettne and Kolarik used a text corpus with Kolarik's manual annotations of chemical terms, which were not restricted to the MeSH vocabulary. We applied MAA to Kolarik's testing corpus (2009 version, containing 100 full abstracts). The dictionary used in MAA is a combination of MeSH tree chemical terms (MeSH C) and MeSH substances (MeSH S), but both Hettne and Kolarik separated these vocabularies. We first perform free-text matching of the MeSH dictionary to the Kolarik's corpus. The performance is very similar to Kolarik's, which were also performed using free-text matching. Then we applied each filter cumulatively as we did in our testing set in Section 3 of this paper. Once the TP filter threshold was set to 0.4, the performance we obtained is quite similar to those of Hettne's work, in which he used a "term disambiguation" pipeline to filter out some MeSH terms.
| Table 4The comparison of precision (P), recall (R), and F measure (F) of this work (MAA) with those of other studies. |
MAA has a precision range from of 0.44 ~ 0.79 with different filters, which is better than Kolarik's precision range of 0.34~0.44 (for MeSH C and MeSH S, respectively). However, the MAA gives a higher recall range (0.23 ~ 0.37) than Hettne's (0.22~0.07) or Kolarik's (0.27~0.10). The best F measure which MAA generated is 0.43, which is better than the F-measure taken from either work (maximum of 0.34). Overall, the MAA results are closer to Kolarik's results if no filters applied and Hettne's results if TP threshold set to 0.4. However, as shown in Section 3.4, the higher TP threshold doesn't necessarily produce the better performance as ranked by F measure. When examining our 26123 abstracts testing set, the best performance of MAA was obtained when TP threshold was set to 0.15 and, when examining Kolaik's corpus, it was without applying the TP filter. This is consistent with results on our test corpus; while the TP filter significantly increases precision, it does so at the cost of recall.
The bottom rows of Table show results from our MAA system and the Medical Text Indexer (MTI), which was developed for NLM's Indexing Initiative system (IIS) and whose goal was to provide suggested annotations to MeSH indexers. The results of MTI are not restricted to chemical names, so we cannot directly compare the results of MTI to MAA, but we include the results for reference. When MTI lists up to 25 recommendations for each article from a 273 articles corpus, it provided a recall of 0.55 and a precision of 0.29.