|Home | About | Journals | Submit | Contact Us | Français|
Protein name extraction is an important step in mining biological literature. We describe two new methods for this task: semiCRFs and dictionary HMMs. SemiCRFs are a recently-proposed extension to conditional random fields that enables more effective use of dictionary information as features. Dictionary HMMs are a technique in which a dictionary is converted to a large HMM that recognizes phrases from the dictionary, as well as variations of these phrases. Standard training methods for HMMs can be used to learn which variants should be recognized. We compared the performance of our new approaches to that of Maximum Entropy (Max-Ent) and normal CRFs on three datasets, and improvement was obtained for all four methods over the best published results for two of the datasets. CRFs and semiCRFs achieved the highest overall performance according to the widely-used F-measure, while the dictionary HMMs performed the best at finding entities that actually appear in the dictionary—the measure of most interest in our intended application.
Searching documents for entities that appear in them is a challenging subtask of information extraction (IE), especially when applied to medical and biological papers (Fukuda et al., 1998; Humphreys et al., 2000; Seki and Mostafa, 2003). Biomedical applications have special types of named entities that are different from those typically addressed by existing named entity recognition systems. These include names of genes, proteins, cell types, and drugs. Two basic approaches to entity recognition have been described: dictionary-based and context-based. In the dictionary-based approach, a pattern dictionary is constructed (Soderland and Lehnert, 1994). When a new document is presented, each textual n-gram in the document is scanned looking for matches to the patterns in the dictionary. Context-based extractors are usually based on machine learning. The name extraction problem is reduced to classification of individual words (Bikel et al., 1997; Demetriou and Gaizauskas, 2000). First a classifier determines whether each word is part of a named entity, and then the named entity is extracted by identifying the longest sequence of such words. Statistical machine learning techniques, for example hidden Markov models (Bikel et al., 1997), bootstrapping (Demetriou and Gaizauskas, 2000), and CRFs (Ryan and Pereira, 2004), are used to extract names. Machine learning techniques can also be used to construct context-sensitive pattern matching rules (Califf and Mooney, 1999) to extract named entities from text. Of course, these methods do not have to be used in isolation. Humphreys et al. (2000) used grammar rules as well as a lexicon to tag protein names.
Each approach has advantages and disadvantages when applied to protein name extraction. Dictionary-based extractors typically have low recall, unless they are coupled with a soft-matching scheme that handles variant entity names. Dictionary-based extractors also go out of date when the dictionary changes. In principle they can be updated by loading a new dictionary, but in practice, manual curation of a new dictionary is usually required. (For instance, our dictionary contains the entity AT as a protein name; if case-folding is allowed and this entity is not removed, then it would match the common word “at”). Context-based approaches do not in principle need updating when the set of entities changes. However, learned extractors do depend on the "dictionary" of entities available at training time, and it is unclear how they would perform on test sets containing a different distribution of entities—hence they too may go out of date over time.
A number of recent studies have evaluated different approaches specifically to recognizing protein names in MEDLINE abstracts. Franzén et al. (2002) described the YAPEX system, which achieved an F-measure of 67.1% on a dataset of 200 labeled abstracts. Kazama et al. (2002) compared the performance of Support Vector Machines (SVMs) and MaxEnt using the larger GENIA dataset, and found that SVMs gave better precision and F-measure, while MaxEnt gave better recall. Bunescu et al. (2004) compared the performance of a dictionary based approach, a rule learning system, boosted wrapper induction (BWI), SVMs, and MaxEnt on a dataset of over 700 abstracts. They concluded that machine learning approaches using SVMs and MaxEnt are able to identify protein names with higher accuracy than the other approaches. More recently Ryan and Pereira (2004) described an approach using conditional random fields (CRFs) with lexicon features, and achieved an overall F-measure of 82% on the BioCreative evaluation dataset.
In this paper we describe two new methods for recognizing protein names in abstracts. Our work is part of a larger system for extracting information from both images and text in journal articles (Murphy et al., 2004). This system, SLIF, creates a searchable database by mining on-line papers for fluorescence microscope images that show the subcellular localization of a protein. SLIF then analyzes the images, and associates them with the proteins and cell types in the accompanying caption. Queries to this database are expected to request information about the localization of known proteins (for instance, “find v-SNARE proteins that appear to localize to the early Golgi”). For queries such as the one above, recognizing entities that cannot be matched to the list of known proteins is of little interest; therefore, we have focused on recognizing entities from a fixed list which may change over time. Also, recall is more important than precision, since the end-user can filter out false positives with additional search constraints.
We therefore developed a novel learning method, dictionary HMMs (Dict-HMMs), which combines a dictionary with a hidden Markov model (HMM) to perform a soft match of phrases in text to entries in a dictionary. Dict-HMMs learn how to match words in a large uncurated dictionary. Only a small amount of training data is needed, and the learner is robust—meaning that if the training data is slightly different from the target set, the learner can still do reasonably well. To evaluate the Dict-HMMs approach, we compared its performance to MaxEnt, CRFs, and the recently-proposed semiCRFs, on datasets from different sources. Using a large dictionary from PIR-NREF1, we obtained an improvement over the best published results on two of the datasets.
Our algorithm begins by removing stop words, using the list of Seki and Mostafa (2003), and converting the remaining text to tokens. This tokenization is very important to the performance of Dict-HMMs, since there are many surface clues that indicate protein names. We used the rules of Ryan and Pereira (2004) to transform the original words, except that we also added rules for the suffixes “–in” and “–ase”, and also new rules indicating the mix of punctuation and characters. The rules used for token transformation are listed in Table 1. (For the other learning methods, these transformation rules were used as features, as discussed below.)
Markov models are mathematical models of stochastic processes that generate random sequences of outcomes according to certain probabilities. An HMM is one in which a sequence of emissions is observed, but the sequence of states the model goes through to generate the emissions is not known. An HMM contains the following elements:
Our approach utilizes entries in a protein name dictionary to determine the structure of the HMM model. The new HMM structure is shown in Figure 2. The state GE represents General English, which corresponds to non-protein text. Each sequence of states Si,1,…,Si,mi, which we will call a protein path, corresponds to one protein name from the dictionary—the entry with tokens ai,1,…, ai,mi and length mi. This correspondence is enforced by assigning Si,j, a high probability of emitting ai,j and a high probability of transitioning to Si,j+1. For instance, a token name like “v-SNARE Snc 2” would be associated with a length-three path in the HMM, where the first state Si1 is likely to emit the token “v-SNARE”. To allow soft matches to entries in the dictionary, the transition matrix also allows “jumping head” from state Si,j to Si,j+k or “looping” from state Si,j back to Si,j, as indicated by the grey edges in Figure 2. (For clarity we show only some grey edges) Each path in the HMM is thus similar to a profile HMM (Eddy, 1998).
This HMM will be very large. To reduce the number of parameters that must be estimated, we severely constrain the transition and emission matrices A and B. In the following paragraphs in this section, we describe more precisely the dictionary HMM by specifying its structure, alphabet, and emission and transition probabilities.
For reasons of efficiency, we never construct a complete dictionary HMM: instead, for each text that requires analysis, we will build a smaller HMM that contains only the most relevant protein paths. To select the paths included in a document-specific HMM, we first go through the text to be analyzed, and find all the relevant entries—i.e., dictionary entries which contain some word appearing in the text. Using a dictionary with 500,000 protein names, there will typically be a few hundred relevant entries for a 300-word abstract. The set of relevant entries is further reduced using as follows. First, adjacent tokens are grouped into clusters if they appear together in some protein name in the dictionary. Quite often tokens in a cluster appear together in several protein names. To further reduce the number of paths, for each cluster, we examine every relevant protein name P, and compute the score of P to the cluster. Here score=hits(P)/length(P), where hits(P) is the number of cluster tokens that match a token in P, and length(P) is the number of tokens in P. Finally, we select for each cluster the single protein that has the highest score (breaking ties randomly), and add only that protein’s path to the HMM. For example, if the words ‘signal recognition particle’ formed a cluster, and the relevant protein names were ‘signal recognition particle 54K protein’ and ‘signal recognition particle protein’, then only the latter entry would be selected, to generate one path in the Dict-HMM.
The set of observations include the tokens in the training set, the tokens in the dictionary, and a start symbol (indicating the start of observation sequence). With typical amounts of training data and our large dictionary, there is an imbalance in the number of tokens appearing in each source: the number of distinct tokens in the dictionary is more than 300,000, about 100 times the number of tokens in the training set. This leads to certain problems in smoothing frequency estimates. We thus divide the tokens into 3 categories: tokens only appearing in non-protein text; tokens only appearing in the dictionary, and tokens appearing in both. We then randomly sub-sample the dictionary-only tokens to obtain a smaller set of observations.
The initial probability of state GE is estimated from the training data. After estimating π(GE) = π0, we distribute the remaining possibility among the first states in the N paths, i.e., we let π(Si,1) = (1–π0)/N, where N is the number of paths.
Equation (1) defines the transition probability from state Si,j to a following state Si,j+k in a path. The parameter 0 < α < 1 allows “jumping” but assigns a higher probability to non-jumps and shorter jumps. Equation (2) forces the last state in a path to transition to the GE state. In Equation (3), γ is the probability of a transition from GE to GE, which is estimated from training data. Equation (4) defines the transition probability from state GE to state Si,k in a path. The factor (1–γ)/N means the probabilities for the transition from state GE to states in paths (which should sum up to 1–γ according to Equation 3) is distributed equally among all the N paths. The parameter 0<β<1 forces a higher probability to transitions from GE to the first state in a path, but allows a smaller probability of jumping to a state deeper in the path. The parameters C and Z are normalization factors.
The emission matrix is defined by the following models. First, Pr(wi|GE), the probability of state GE emitting a token wi, is estimated from the training data. For the states Si,j in a path, we divide the tokens it can emit into three categories: the tokens ai in the corresponding protein name entry; tokens only appearing in GE; and tokens appearing in the dictionary sample, excluding the ai’s.. Probabilities of state Si,j emitting symbols in these three categories sum up to 1-ε-δ, δ, and ε, respectively, as defined by Equations (5)–(7):
In Equation (5), ai,j is a token in the corresponding protein name entry, and the total probability 1-δ-ε is divided by mi. This means that states in one path have the same probability of emitting any of the tokens ai,1…,ai,mi, so the order of the tokens in a compound word is not important. In Equation (6), wl is a token appearing in the dictionary, excluding the ai’s, and in Equation (7), wl is a token appearing only in GE. The parameters 0 < ε, δ < 1 control the amount of variation allowed in protein names.
Table 2 summarizes the parameters in our model.
During testing we often encounter words that have not been seen during training: hence Equations (5) ~ (7) for emission probabilities need to be smoothed, by allowing novel tokens to appear with some probability. We used Good-Turing smoothing (Gale and Sampson, 1995): i.e., the frequency of tokens that only appear once in the training set is used to predict the total emission probability of unknown words.
Above we have defined the structure of the Dict-HMMs, as well as the transition and emission probabilities. We have also described how some of the parameters are set: specifically, the initial probabilities π, the parameter γ, which governs transitions from GE to GE, and the probabilities Pr(w|Dictionary) and Pr(w|GE) are all learned from the training data and the dictionary. It remains to learn the transition-matrix parameters α, β, and the emission-matrix parameters δ and ε.
To learn these parameters we use EM, and more specifically a variant of the usual Baum-Welch method. In the usual Baum-Welch method, the emission matrix B and transition matrix A are calculated in the M-step: however there is no guarantee they will follow our assumptions. Therefore after each M-step we compute the best set of values for α, β, δ and ε from A and B by fitting the models associated with each equation to the values associated with A and B. We then re-calculate A and B using these parameters, thus forcing A and B to follow the constraints imposed by equation (1) through (7). We stop updating the parameters when the likelihood of the observed sequence converges. Because the structure of dictionary HMMs is very dependent on the tokens present in the test data, we run the Baum-Welch algorithm on the test sequence, not on the training data, in our experiments. This is feasible since the Baum-Welch algorithm assumes the data are unlabelled.
Two additional variations of the Dict-HMM method were evaluated. One introduces a boosting-like strategy into the protein name finding process, and the other adds more states into the Dict-HMM structure.
In the standard Dict-HMMs, all the relevant paths are integrated into one HMM and protein names are extracted using this HMM. If the optimal parameters to extract protein name are different from those for another, the HMM may perform sub-optimally. To reduce the chance of this sort of interaction among paths, we used the following strategy.
This strategy thus extracts the single most likely protein name at each iteration, and ends when no more protein names are found.
The second variation was to add to the Dict-HMMs additional states, which include more information about the context surrounding a protein name. We added a pre-protein state and a post-protein state into the model, thus creating the HMM structure shown in Figure 3.
The remaining learning algorithms classify sequences of feature vectors, rather than sequences of tokens. Rather than transforming words in special tokens, as was done for the Dict-HMM, we implemented features (detectors) for the corresponding regularities. For example, there are two features associated with the rule of ‘has suffix -in’ for MaxEnt, one of which is:
In addition to such hand-coded features, we also used part-of-speech (POS) tags (from Brill’s POS tagger2) as features. We also included, for each token x, the output of the detectors and POS tags for the three tokens to the left and the right of x.
Maximum Entropy is widely used for inducing probabilistic tagging (Ratnaparkhi, 1996; McCallum, Freitag and Pereira, 2000). Maximum entropy gives a probability distribution of a possible tag y given a token x p(y | x) :
In this definition, each feature fi(x, y) is expressed as a binary function based on the current token x and its proposed classification y, λi is the corresponding feature weight, and Z(x) is a normalization factor.
Conditional Random Fields are another probabilistic tagging model (Lafferty, McCallum and Pereira, 2001). CRFs give the conditional probability of a possible tag sequence y = y1, … yn given the input token sequence x = x1, … xn:
In this definition, each fi is a function that measures a feature relating the state sj at position j with the input sequence around position j, λi is the corresponding feature weight, and Z(x) is the normalization factor.
Two recent papers (Cohen and Sarawagi, 2004; Sarawagi and Cohen, 2004) compared a number of methods for using dictionaries with CRF-like learning methods. The best results were obtained with a new learning method called semiCRFs. SemiCRFs construct and attach classifications to subsequences of a document, rather than to tokens. Since features can measure the properties of these subsequences, a feature measuring the similarity between a candidate segment and the closest element in a dictionary can be introduced. We used TFIDF or cosine similarity as the sole similarity measurement in our work (Cohen, et al., 2003), and otherwise followed the implementation of Sarawagi and Cohen (2004). To our knowledge, semiCRFs have not been previously evaluated for recognizing protein names. We used Minorthird3 for the implementation of MaxEnt, CRFs and semiCRFs.
Our dictionary was constructed by extracting the ‘protein name’ field from the PIR-NREF database. The extracted dictionary contains nearly 500,000 protein names. We used three datasets to evaluate our methods. The University of Texas, Austin dataset4 contains 748 labeled abstracts; the GENIA dataset5 contains 2000 labeled abstracts; and the YAPEX dataset6 contains 200 labeled abstracts.
The performance of Dict-HMMs, MaxEnt, CRFs and semiCRFs were compared for the three datasets. Table 3(a) shows results for all methods, along with published results for the same datasets. With respect to F-measure, the CRF variants improve over the best previous performance on two of the three datasets, and are competitive on the third (YAPEX). The Dict-HMM has lower F-measure performance, but unlike the other methods, appears to emphasize recall over precision.
Bunescu et. al. (2004) explored a wide range of techniques for combining dictionaries and machine learning techniques on the U Texas dataset. One of these, a MaxEnt method that uses a “dictionary tagger”, achieved the previous best result. The dictionary tagger used a set of hand-coded generalization rules to convert entries in a dictionary to “canonical” forms, and then tagged a word sequence as a protein name only if it is matched a known “canonical” protein name. Their dictionary was constructed by extracting protein names from Human Proteome Initiative of EXPASY and Gene Ontology Database. Their results are compared to our Dict-HMMs in Table 3(a). The comparison reveals that our Dict-HMM approach is competitive: it has a lower precision but higher recall than Bunescu et al’s dictionary lookup algorithm, and achieves a slight improvement in F-measure. On two of these three problems, semiCRFs—which make use of dictionary information as a feature—improve over conventional CRFs. However, the gains are modest. This is consistent with previous observations that, as measured by F1, the performance gain from dictionary features is largest for small training sets (Sarawagi and Cohen, 2004). This may be because for larger training sets, the most common protein names will be seen in the training data.
Table 3(a) also shows results for enhanced Dict-HMMs. With the boosting-like strategy, the F1-measure is improved by about 4%, the recall is improved by about 5%, the precision is improved by about 3%. With the additional states, the F1-measure is improved by about 4%, the recall is improved by about 2%, and the precision is improved by about 5%. With either of these improvements, the F1-measures for the U. Texas and GENIA datasets are higher than the best previously published results, but still lower than our implementations of MaxEnt and CRFs.
Though widely-used, F-measure is often not the best performance measure for specific applications. In our application, we are primarily concerned with finding protein names that can be matched to a known protein from the dictionary. To explore performance with respect to this goal, we calculated the TFIDF similarity score between each extracted protein name and the closest entry in the dictionary. Not surprisingly, the Dict-HMMs method finds only proteins with high similarity scores, while the CRF-based methods do not. For instance, we observed that on the U Texas data, the lowest similarity score for the Dict-HMMs was 0.89, while 26% of the names extracted by CRFs had a similarity score less than 0.89. In our particular application, these “novel”, dissimilar proteins are of less interest.
As a quantitative measure of performance in finding dictionary proteins, we calculated precision, recall and F-measure considering only those extracted protein names with a similarity score 0.9 or higher. This is shown in Table 3(b). According to this measure, performance is similar for the CRFs and Dict-HMMs methods, but CRFs had higher precision while Dict-HMMs had higher recall. The enhanced Dict-HMMs achieved the best performance according to the measure with TFIDF scores. For applications such as SLIF, in which non-dictionary entities have less benefit, the Dict-HMM is thus the preferred method.
Experiments were carried out to see how the performance of CRFs and Dict-HMMs depends on the size of training dataset. The 2000 abstracts in the GENIA dataset were split into a testing set of 300 abstracts and training datasets of between 50 and 500 abstracts. The curves for the F-measure, precision and recall are shown in Figure 4, averaged over 10 repetitions. Dict-HMM has comparable recall and precision, even with very small training sets. CRF also learns surprisingly fast. For small training sets, it is still true that the DictHMM has higher recall (and lower precision) than CRFs.
In practice, it is not easy to get labeled data. Therefore it is important for an algorithm to be able to work well when trained on one dataset and tested on another slightly different dataset. We refer to this property as robustness. Experiments were carried out to test the robustness of CRFs and Dict-HMMs. Each of the three datasets were split into a separate training set and a testing set, and a merged training set was obtained by mixing the three training sets. This merged training set was used to learn a model and then the model was applied to the testing sets from each source. The performance (for all protein names) is summarized in Table 4. CRFs and Dict-HMMs performed comparably.
Protein name recognition is recognized to be a challenging task. In this paper, we evaluated two new learning methods which make use of large dictionaries. One, Dict-HMM, represents a dictionary as a large hidden Markov model with many shared parameters, and learns to set those parameters to optimize the set of "soft" matches that are recognized. The second, semiCRFs, learns a semi-Markov variant of a conditional random field that uses distance of a phrase to a dictionary entry as a feature.
We compared the performance of Dict-HMMs, semiCRFs, MaxEnt and ordinary CRFs on three test datasets. For two of the datasets, we obtained better F-measure performance than the best previously published. The two CRFs variants also gave comparable results to the best previously obtained for the third dataset. While CRFs or semiCRFs gave the best performance according to F-measure, the boosted Dict-HMMs had a significantly higher recall than any previous system, and also extracted only names that are highly similar to ones in the dictionary. If “novel”, non-dictionary names are discounted, as in our intended application Dict-HMMs have the best performance overall.
Dict-HMMs have two additional advantages: parameters for the model can be learned from a small amount of training data, and the Viterbi path through the dictionary HMM can help identify the best-matching record in the dictionary. There is still much room for improvement in systems for addressing the protein recognition problem. We are currently exploring using filtering rules, and also an abbreviation finder, in the hopes of improving performance.
The work described here was supported in part by research grant 017396 from the Commonwealth of Pennsylvania Department of Health and by NIH K25 grant DA017357-01.