|Home | About | Journals | Submit | Contact Us | Français|
Background. We are witnessing an exponential increase in biomedical research citations in PubMed. However, translating biomedical discoveries into practical treatments is estimated to take around 17 years, according to the 2000 Yearbook of Medical Informatics, and much information is lost during this transition. Pharmaceutical companies spend huge sums to identify opinion leaders and centers of excellence. Conventional methods such as literature search, survey, observation, self-identification, expert opinion, and sociometry not only need much human effort, but are also noncomprehensive. Such huge delays and costs can be reduced by “connecting those who produce the knowledge with those who apply it”. A humble step in this direction is large scale discovery of persons and organizations involved in specific areas of research. This can be achieved by automatically extracting and disambiguating author names and affiliation strings retrieved through Medical Subject Heading (MeSH) terms and other keywords associated with articles in PubMed. In this study, we propose NEMO (Normalization Engine for Matching Organizations), a system for extracting organization names from the affiliation strings provided in PubMed abstracts, building a thesaurus (list of synonyms) of organization names, and subsequently normalizing them to a canonical organization name using the thesaurus.
Results: We used a parsing process that involves multi-layered rule matching with multiple dictionaries. The normalization process involves clustering based on weighted local sequence alignment metrics to address synonymy at word level, and local learning based on finding connected components to address synonymy. The graphical user interface and java client library of NEMO are available at http://lnxnemo.sourceforge.net .
Conclusion: NEMO is developed to associate each biomedical paper and its authors with a unique organization name and the geopolitical location of that organization. This system provides more accurate information about organizations than the raw affiliation strings provided in PubMed abstracts. It can be used for : a) bimodal social network analysis that evaluates the research relationships between individual researchers and their institutions; b) improving author name disambiguation; c) augmenting National Library of Medicine (NLM)’s Medical Articles Record System (MARS) system for correcting errors due to OCR on affiliation strings that are in small fonts; and d) improving PubMed citation indexing strategies (authority control) based on normalized organization name and country.
In recent years, many systems have been developed for disambiguating author names from PubMed abstracts (1; 2). Torvik et al.(2) uses surface features such as title words, journal name, co-author names, medical subject headings, and language to estimate the probability that two PubMed articles that share the same author name were written by the same individual. Such an equivalent process for organization information is not available for PubMed abstracts.
Disambiguated author names are found to be directly useful for measuring research productivity of universities (3). However, such approaches have limitations (3) due to a) “extreme variance in the indication of affiliation information (full or abbreviated names in the original language or in English), or, conversely, identical indications for different institutions, for example universities located in the same city”, i.e., lack of highly precise systems for extraction of organization names; and b) “indication of the affiliation to research centers which naturally fall within a larger institution (for example, laboratories, departments, institutes, medical centers, etc.)”, i.e., lack of systems that disambiguate organizations and map them to the naturally fitting largest organization or organization group.
Considering that author disambiguation systems have achieved close to 100% accuracy (98% for Torvik et al.’s Author-ity system (2)), the missing piece for linking organizations to articles and their authors is a system that accurately extracts organization information from affiliation strings provided in PubMed, disambiguates them to canonical or most popular names, and maps the laboratories, departments, institutes, medical centers, etc. to a naturally fitting larger institution. Disambiguating and precisely mapping of institutions together will be referred to as normalization. Yu et al. (4) obtained detailed information about investigators by automatically extracting organizations and related entities from affiliation strings of the articles. Their system had an accuracy of 87% for extracting organization names. In addition, NextBio (5) and BioMedExperts (6) possess systems for extracting organization names from affiliation strings, but accuracy of these systems is not publicly known. Extracting organization names from free text is a well-studied problem (7). Recognizing organization names from the affiliation strings of PubMed abstracts is a different problem. Though the problem of normalization of organization names has been studied in open domains like Wikipedia and news articles as shown in (8), those systems had an accuracy of less than 80%. Thus, specialized tools are built for restricted domains such as gene mentions (9) and malignancy mentions (10). The aim of this research is to build a system that uses affiliation strings provided in PubMed to extract organization names with close to 100% accuracy and disambiguate the results.
Classification: There are mainly two relevant hierarchies of named entity types that have been proposed in the literature. The Bolt Beranek and Newman (BBN) categories (11), proposed in 2002 for a Question Answering task, consist of 29 types and 64 subtypes. Sekine's extended hierarchy (12), also proposed in 2002, consists of 200 subtypes. According to the BBN Hierarchy, our specific problem of extracting organization related named entities breaks down into identifying 4 major types of entities: a) Organization name – for classifying the names of the actual research groups, b) Geo-Political Entity name – names of country, state and city, c) Facility name – names of buildings and other man-made structures (Example: 1) Health Sciences West Bldg, 2) Room HSW1601), and d) Contact Information – Address, email and URL. While there has been much Named Entity Recognition (NER) work reported by researchers (for instances, in Mutual Understanding Conference - 7, Automatic Content Extraction - 2005 and Automatic Content Extraction - 2008), few (this work and (4-6)) have looked at the task of organization NER from the authors' affiliation strings.
Approaches: Early NER systems were mostly based on hand-crafted rules which in general have good performance, but they are labor intensive insofar as they require regular inputs from experts familiar with the text. Many NER systems for different entities have been replaced by supervised learning systems which need less effort from the developers; in such systems, rules are automatically created from the positive and negative examples. Unfortunately, such methods need large amounts of annotated data. To our knowledge, there are no publicly available annotated corpora for the different organization-related named entities discussed above. This and the need for a highly accurate system encouraged us to build a rule-based system.
Data source: A journal submitting papers to PubMed is required to send a single field representing the organization name (http://www.ncbi.nlm.nih.gov/entrez/query/static/spec.html). Usually this is the organization of the first author, and in some cases the organization which appears first in the journal abstract. In general, each article is associated with one organization. Currently, there is no standard style for listing authors' affiliation strings; it is a free form text field with some moderate cultural preferences to list them in the following order: institution, city, state, and country. However, there are wide variations in their style and format. For example, a person may list Department first and then Institution or vice-versa.
Previous work: In a previously known attempt at recognizing organization names from PubMed abstracts, Yu et al. (4) assumed that most of the affiliation strings in PubMed articles adhere to the following format: [address component], [address component], …, [country].[email], where an address component according to them could be an organization name, city, state or facility name. Yu et al. assumed that the affiliation strings have the name of the country explicitly and one of the address components contains one of the institution character n-grams to indicate that component to be an organization. However, there are many affiliation strings which do not obey these assumptions. In the affiliation string, “Centaur Science Group, 1513 28th St NW, Washington, DC 20007 USA. (PubMed ID: 16796054)”, Centaur Science Group is an independent company that doesn’t contain any of the key words of Yu et al. In addition, there are many organizations like Mayo Clinic which are present in multiple places like Rochester (Minnesota), Jacksonville (Florida), Scottsdale (Arizona), Phoenix (Arizona) and Fairfax (Virginia). For the purposes of deeper analysis, these organizations should be considered as different, which means that it is also necessary to extract the city and state names. Thus, the problem of NER for organization names and its attached features is non-trivial, but the output should have high precision and recall for being useful for large scale statistical analysis.
To create a system that recognizes organization names with high accuracy, we propose to apply rules at multiple levels, with each level gradually converting the unstructured input text into structured fields. This is discussed in section 2.1 and illustrated in Figure 1. Although we are dealing with citations written in or translated to English, about 10% of the institution names (such as Universidad Central del Caribe and Carl von Ossietzky Universität Oldenburg)remain in their native language.
Until recently, there were quite a few algorithms for normalizing affiliation strings (13;14;15;16). Most of these algorithms are aimed at authority control to improve indexing of online bibliographic databases. Such applications have not been applied to PubMed which is the largest bibliographic database for biomedical domain. Although our main goal is to make such a system available in biomedical domain, our approach for normalization is also novel. In addition to dealing with variations because of spelling and abbreviation while resolving synonymy, we address synonymy through language translation and transliteration and we automatically identify basic hierarchical structure of organizations. We extend the edit-distance approach proposed by French et al. (13) by a) processing the affiliation strings at two levels – first to normalize variation caused by non-standard words and then the variation caused by non-standard combination of words, and b) using weighted local sequence alignment to disambiguate words and affiliation strings.
Summaries of our NER and normalization methods have been published (17;18); however, the present study adds details to make the algorithm reproducible. In addition to adding new methods and dictionaries, we have made the tool applicable to countries other than USA. Finally, we have released NEMO as a public web service/online application/graphical user interface which is available at http://lnxnemo.sourceforge.net .
Step 1 constitutes the Information (affiliation strings) Retrieval stage where the affiliation strings are retrieved using NCBI eutils; steps 2 and 3 constitute the Named Entity Recognition stage; steps 4-6 constitute the training stage of Normalization; step 7 constitutes the post-processing stage of normalization (manual data cleaning); step 8 constitutes the testing (actual execution) stage of Normalization. Steps 4-7 are executed only once during the initial training stage where 103,557 random affiliation strings are used for unsupervised clustering to find canonical names for the organizations and map them to a parent organization name.
Figure 1 illustrates the architecture of NEMO. Starting from accepting a query for a biomedical topic, NEMO retrieves the affiliation strings using NCBI eutils (eutils.ncbi.nlm.nih.gov) [module 1], extracts the names of organizations and their locations [modules 2 and 3] and culminates in normalizing (uniquely identifying) it with an organization name, and associating it with a cluster of organizations that it collaborates with or is part of [modules 4-8].
The Vedic Neti, Neti golden rule [One must find the soul by analysis saying, "This is not it. This is not it."] holds true for extracting the organization names from the affiliation string. The other 3 types of named entities – geopolitical entity name, facility name and contact information – conform better to a general pattern than the most important entity, organization name. This is mainly because of the inherent volatility in the naming of an organization and its usage compared to that of a country, state, or city in which naming conventions are governed by social and political laws. Thus, we first find the phrases which represent only the country, email address, URL, state, city, and street address1 (in that order) and then consider the left-over phrases to verify if they represent organization names. We store the geopolitical entity and other properties for future analysis. While this is the general framework, each sub task such as determining the name of the country requires use of multiple rules and multiple dictionaries. In total there are 30 different manually verified dictionaries or lists such as the dictionaries in Geoworldmap database (19), Internet domain – country mapping [IANA - http://www.iana.org/domains/root/db], stop words list (20), keywords for organizations (Table 1), keywords for addresses (Supplementary File 1) and zip code dictionary [http://www.zip-codes.com/zip-code-database.asp]. Table 2 details the steps in our process to extract organizations and related entities.
There are many more subtle rules and discussing all of them is beyond the scope of this paper. For example, the affiliation string: “USDA, ARS, Aquatic Animal Health Research Unit, Auburn University, AL, USA.” may appear to be associated with Auburn University, but it is actually from ARS of USDA. The confusion is caused because Auburn University is the name assigned to a city in the USA with the zip code 36849. But there are many affiliation strings where Auburn University is an organization entity. In this case, when there is an ambiguous phrase that could be a city or an organization, we check for other phrases in the sentence that may represent a city. If this is true, the ambiguous phrase is considered as an organization name, otherwise it is treated as a city name.
Our dictionaries were built initially for English and wherever applicable we translated them to additional languages using Google Translate (21). However, each dictionary item may have many synonyms in different languages and not all of them exist in the dictionary. If no information is extracted from a phrase in any of the steps in the extraction algorithm (Table 2), we translate individual phrases in the affiliation strings to English (using Google Translate) and execute all the steps (Figure 1, 3a-3f) again for that phrase. This approach is useful in cases where an address or organization keyword is missing from our dictionaries, but the English translation of the keyword exists within the dictionary list.
Described and Descriptor entities: There are two distinct types of named entities related to an organization – those which uniquely identify with real world organizations (we refer to these as “described entities”) and those which do not uniquely identify with real world organizations unless in the presence of a described entity. The primary role of the latter is to give more specific information about a described entity (thus, we refer to them as “descriptor entities”). All organizations containing a person name (recognized from an online dictionary), a place name (recognized from the dictionary of all major places in the world), or a directional modifier are recognized as “described entities”. The rest of the organizations are “descriptor entities”. There is at least one described entity per affiliation string. Examples of described entities are Jerome Lipper Center for Multiple Myeloma,University of Texas and North-western University. Examples of Descriptor entities include School of informatics and Dept. of Biomedical Informatics. Our glossary of person names is built using the following website: http://names.whitepages .com. This website is queried to learn whether a token is a person name. Both the positive and negative results are stored in our dictionary. After processing 103,557 random affiliation strings for extracting person names, the ambiguous names are manually removed. The process of updating the glossary continues perpetually while the software is being used. The glossary of places is built using the GeoWorldMap database (19).
Resolving polysemy: We addressed polysemy (same word having multiple senses) within the entity class of organization by mapping the extracted organizations to the least generic concept possible. For example, if an affiliation string is found to have Mayo Clinic as an organization and the only Geopolitical entity (GPE) recognized is USA(forthe subtype country), we associated it with Mayo Clinic group of organizations in USA. However, if our NER process also recognized the city (say Rochester), we associated it with the Rochester Branch of the Mayo Clinic group. The normalization process identified each described entity with a unique real world organization or a unique organization group as in the examples above. Since PubMed generally stores only the primary affiliation string of the first author, we associated each article with the normalized described entity that is estimated to have the highest number of articles in PubMed. For example, the article with PubMed ID – 15607955 has the following described entities after normalization: Yale University School of Medicine, and Boyer Center for Molecular Medicine. Out of the two, Yale University School of Medicine has higher number of articles; therefore, this article is associated with Yale University School of Medicine.
Resolving synonymy: One major challenge in normalizing organization names is to identify and replace Non Standard Words (NSWs). NSWs can be broadly classified as (22):
1. Miscellaneous – these are made of unconventional word and phrase boundaries, intentional informal spelling, URL and formatting abnormalities. The unconventional boundaries and URLs were dealt with in the NER phase. Informal spellings which might appear in conversational text do not appear in the organization names submitted to PubMed. Some affiliation strings often have formatting abnormalities such as the following: From the *Division of Pediatric Cardiology and, daggerDepartment of Pediatrics, Ataturk University, Faculty of Medicine, Erzurum, Turkey. (PubMed ID:19262418, Date: 2009 Feb 28. [Epub ahead of print]), where the symbol † from the original affiliation string in the journal isreplaced with the sequence “dagger” instead of getting deleted. While such abnormalities might arguably be useful for indicating the presence of multiple affiliations in the journal abstract, they prevent the abstracts containing them from being retrieved. These kind of mistakes are present even in recently manually indexed abstracts such as Plant Polymer Research, USDA,(dagger) ARS, National Center for Agricultural Utilization Research, 1815 N. University St., Peoria, IL 61604, USA. (PubMed ID: 19111748, Feb 2009). The NER stage does not remove these errors. These errors mainly appear at the stage of automatic scanning by the Medical Articles Record System (MARS) after escaping the notice of the Seek Affiliation program (23) and the manual supervisors of the National Library of Medicine’s indexing section.
2. Misspellings and abbreviations – Misspellings are dealt with at a later phase along with formatting abnormalities. Abbreviations need to be replaced by their full forms. For example, University of California at Los Angeles Medical School needs to replace UCLA Medical School. This problem is automatically solved (at the NER step – Table 2, step 2) by making sure that all the acronyms are expanded to their full form before tagging a phrase as an organization. For example, the organizations identified for the affiliation string “Baylor College of Medicine, USDA/ARS Children's Nutrition Research Center, Department of Pediatrics-Nutrition, Houston, Texas, USA. (PubMed ID: 12401707)” was expanded to Baylor College of Medicine, United States Department of Agriculture/Agriculture Research Service Children's Nutrition Research Center and Department of Pediatrics-Nutrition.
Thus, spelling variations and formatting mistakes can cause NSWs to appear in organization names that are extracted automatically. These two along with the lack of consensus in the choice of words when referring to the organizations are responsible for synonymy (different words for the same concept) in organization named entities. Table 3 shows an example of synonymy. The task of Normalization is to map all these 10 organization mentions to the same concept – Washington University School of Medicine. One common approach to solving this synonymy is to compare the recognized organization name against a list or dictionary of organization names. This approach is used in gene normalization systems (24) where the systems find the Entrez Database Gene identifiers for human genes or gene products mentioned in PubMed abstracts. Unlike gene names, organization names are volatile. Many organizations get renamed and some organizations become non-operational every few years. We did not notice a public database of organizations that is also maintained. In the present study, we propose a mechanism to automatically build a database of organization clusters, OrgDB, from 103,557 randomly selected affiliation strings from PubMed published between the years 1998 and 2008. Then, we used automatic techniques to match the database entries in order to normalize the organization names.
Two other important causes of synonymy in the non-English names of organizations are a) they might be referred sometimes by their English translation (Example: University of Chile in PMID: 20737243) and other times by the original name (Example: Universidad de Chile in PMID: 20735270); and b) the diacritic marks might be preserved (Example: Universitat Autònoma de Barcelona in PMID: 20735775) or removed (Example: Universitat Autonoma de Barcelona in PMID: 20734978). To disambiguate this synonymy, we translated each non-English organization name to English using Google Translate (21) and removed all diacritic marks.
After the NER step and translation and transliteration to English, we use clustering to automatically build a dictionary for organizations and their synonyms [modules 4-7], and then dictionary based matching [module 8] using the dictionary we built in the previous step.. We store the thesaurus of organizations as a database, OrgDB, starting from the organization names parsed from the affiliated sentences. Each entry in the OrgDB database is a cluster that has the following features: a) a centroid string, b) a list of all organizations in the cluster, c) a matrix containing inter-component distance using the string similarity metric, d) the PubMed IDs of the articles containing at least one organization name from the cluster, e) the city, state and country of the cluster.
Clustering: Our approach to clustering is agglomerative and partitional. OrgDB is initially empty and each new organization name along with the related information is added to OrgDB and to one of the clusters already present in the database if the organization name is sufficiently close to its centroid (according to a threshold of edit distance as defined below). After adding a new organization to a cluster, the centroid is recomputed. Usually centroids are calculated by taking the average of the vector representations of the elements in the cluster. In a Euclidean vector space, it is an easy proof using Fermat’s theorem on stationary points that the centroid is the point that has the smallest sum of the squares of the Euclidean distances from each of the points in the cluster. To represent organization names in a continuous vector space similar to the document-term vector space used in information retrieval and also encode the order of the terms in the name of an organizations, one may need to use an O(NL) dimensional space, where N is the size of the vocabulary used in organization names (which is of the order 1000) and L is the maximum number of words in an organization string being considered(which is of the order 10). Since a vector space of dimensionality in the order of 1030 is prohibitively huge, we are using edit distance between strings as our kernel function for calculating distance between two organization names without representing the organization names in a vector space. Since we do not have a numerical representation for the organization names, an organization name from the cluster which is the best approximation for an ideal centroid is chosen as centroid.
The centroid is chosen to be the organization whose name has the least sum of edit-distances from the names of all organizations in the cluster. The GPE of the cluster is the set union of the GPE of all organizations; i.e., if the parsing process is not able to identify the city of one of the organizations while being able to identify the city in another organization within the same cluster, then the city of the latter organization becomes the city associated with the cluster.
Each affiliation string is processed through the NER mechanism in order to get all the organization names along with their GPE. Each described entity among those organizations is compared with the centroid strings of the clusters from OrgDB which have the same GPE. The distance metric that is used in this case is discussed below. If one of the subtypes of the GPE – city, state, or country is missing, then all the clusters in OrgDB with the same set of remaining subtypes are used for edit-distance comparison. If the distance metric suggests that an organization is sufficiently close to the centroid of a cluster, then the organization is added to the cluster. If no cluster is close enough to the organization, a new cluster is added to OrgDB having the organization as the only component. Thus, we obtained clusters of organizations with each cluster having components with minor variations at a lexical level. This clustering step is performed for a sufficiently large number of affiliation strings (N=103,557) so that most of the organizations and their statistical distribution are known in a reasonable amount of time.
String Similarity: For comparing the variants in the organization names, we are inspired by the biological sequence alignment algorithms that have been used recently for text mining applications such as sentence paraphrasing (25;26). There are two categories of sequence alignment: Global sequence alignment as implemented by the Needleman-Wunsch (NW) algorithm (27), and Local sequence alignment as implemented by the Smith-Waterman (SW) algorithm (28). Needleman-Wunsch algorithm is a dynamic programming algorithm for pair-wise global sequence alignment, i.e., it is used to find the best possible alignment between two sequences. The Smith-Waterman algorithm is similar to the Needleman-Wunsch algorithm except that it seeks optimal sub-alignments instead of a global alignment and, as described in the literature, it is well tailored for pairs with considerable differences in length and type. Table 4 demonstrates the use of these algorithms; the first string is the string that we are interested in normalizing while the second string is from OrgDB. We are calculating the NW and SW scores as implemented using Neobio software (29). We use the basic scoring scheme of an award of 1 for match and a penalty of 1 for both mismatch and gap. The second example demonstrates that, (also shown by Corderio et al. (26) in the context of paraphrasing sentences), global sequence alignment may not be suitable for the purpose of comparing organization names since it can classify related strings as different. Although local sequence alignment may seem to suit our purpose, there are cases where it can classify unrelated strings to be similar as in Table 4, c&d. The local sequence alignment wrongly identified WOMEN AND CHILDREN HOSPITAL LOS ANGELES with CHILDREN HOSPITAL LOS ANGELES instead of WOMEN’S & CHILDREN'S HOSPITAL. This is why we need a different mechanism of comparison that initially has a strict scheme for comparing organization names (i.e, it holds off on the matching process until enough information is available to make a concise and reliable match call). This kind of approach can be aptly called “recalculation through self-training” and is recently adopted in building an efficient natural language parser (30) which is currently one of the best parsers available in biomedical domain. Such a method of using local information from the training data to further enhance the value of the training set is referred to as “local learning” in the field of Artificial Intelligence (AI) (31).
Tight String Similarity (TSS): We are using the Levenshtein distance (32) (the most commonly used edit distance metric) between the two organization names, not at the character level but at the word level. We remove all the words that are defined as stop words in our dictionary (20), as the presence or absence of a stop word does not change the identity of an organization. Two given words of non-zero length are considered same if they score more than 0.85 on the word similarity score (WS, defined in equation 1). The threshold of WS was chosen optimally to be 0.85 in order to prevent mismatching of more than1 letter for every 7 letters (1-1/7~0.85)
For the Levenshtein calculation, the penalty for a gap of a word is the length of the word and the penalty for a mismatch between two words is the sum of their lengths. This penalty was chosen because larger words in general have more information; hence the penalty should be proportional to the length of the words. Since the current step is “Tight” String Similarity, we need to have a similarity match that is tight (i.e, strict) enough to prevent classifying different organizations as similar. Thus, we chose a conservative threshold of 4 (i.e., two organization phrases are similar if their Levenshtein scores are not more than 4). Using this similarity metric, we associate the described organizations identified from the NER process to one of the clusters in OrgDB if such a cluster exists, otherwise a new cluster in OrgDB is formed.
Recalculation: Because TSS assures standardization only on the words and not on the whole sentence, it mainly addresses the synonymy caused by NSWs. The organizations represented by two or more different clusters might still represent the same organization because of the lack of consensus in the choice of words while referring to the organization. Example: “The David Geffen School of Medicine at the University of California” and “DGSchool of Medicine at the University of California at Los Angeles”. So, in the recalculation step we find all the organizations related to the centroid of the present cluster. The algorithm is based on finding the connected component containing a vertex (33).
OrgDB is equivalent to an undirected graph, OrgG, with the vertices as the different clusters. An edge exists between two vertices (clusters) only if they are not from different cities or states and their corresponding centroids a and b score more than 0.90 on the Extended Smith-Waterman Score (ESS, defined in equation 2). This means that two organization names are related, if roughly for every 10 letters in the smaller name, only 1 letter can mismatch. This is still a conservative threshold, thus minimizing false positives or type-1 error.
If one of the two strings contains most of the other string (e.g, David Gaffen School of Medicine and DGSchool of Medicine), then their ESS would be more than 0.90 and there will be an edge between the vertices corresponding to the clusters of these two strings in OrgG. Our recalculation step is equivalent to finding the “connected component” that contains the vertex corresponding to the cluster that contains the organization being normalized. In this paper, we will use the phrase “connected component” to refer to “connected component containing the vertex we are currently interested in”. The connected component is calculated by the breadth-first approach as elaborated in Table 5.
Cleaning process: OrgDB is manually cleaned by: 1) removing all the clusters where the centroid is a descriptor entity, 2) merging of two clusters if the output of normalization of the centroids refers to the same organization, 3) expanding unambiguous abbreviations, 4) correcting the spelling mistakes in the names of centroid of clusters, and 5) repeating the Recalculation step.
Running/testing stage: An example of the dictionary matching process [Figure 1, module 8] is exemplified in Figure 2 and Table 6. The input affiliation string is “Duke University Medical Center and Duke Clinical Research Institute, Durham, NC 27710, USA.”. We get the organizations O1-O20 through the process of first clustering and then finding the connected components till a depth of 2 from the root vertex O1. To demonstrate the need for the pruning step, we consider expanding O19 which is a leaf. The organizations adjacent to it are: Durham Veterans Affairs Medical Center and Veterans Affairs Medical Center. An examination of the PubMed IDs associated with articles in the clusters that are in the connected component of the above two organization vertices revealed that they did not collaborate (in any of the 103,557 publications we analyzed) with any organization in the connected component. Thus, we did not add these organizations to the connected component. Step 4 is continued for the rest of the organizations. The connected component gives the set of all the synonyms of the organization to be normalized. This set (Table 6) is further sorted in the decreasing order of number of components in the corresponding cluster in OrgDB. Depending on the objectives of normalization, the criterion to choose the representative organization varies. Currently, we chose the centroid string of the cluster with the largest number of publications to be the normalized name so that the normalized organization name is the name that is used most frequently. According to this criterion - Duke University Medical Center becomes the normalized name for Duke University Medical Center and Duke Clinical Research Institute.
As discussed above, we initially built the system for articles associated with USA organizations only. We then systematically extended our system to handle articles from more than 100 countries. A list of these countries is provided in Table 7. If an affiliation string is from USA, the component of NEMO built for USA is used; otherwise the component built for other countries is used. Thus, we have separate evaluations for: a1) determining whether the country is USA or not and extracting organizations and related entities from USA affiliation strings (stage 1); a2) normalization of organizations extracted from USA affiliation strings (stage 2); b1) determining the name of the country from the affiliation string (stage 3); b2) extracting organizations and related entities from all affiliation strings (stage 4); and b3) normalization of organizations extracted from all affiliation strings (stage 5). Once the corresponding component of NEMO is implemented, an evaluation is performed. Thus, rather than building the system all at once, the lifecycle of NEMO is divided into 5 separate stages. Each stage ends with an evaluation phase which is equivalent to the testing phase in software development life cycle. To make sure that our results are not high only because of over-fitting, we used different sets of affiliation strings for each of the different evaluations. Table 8 presents the results for different stages of the evaluation process assessed using precision, recall and f-score measures. Overall, the results (precision, recall and f-score) are excellent. Table 9 gives examples of false positives and false negatives for the NER stage. We did not quantitatively measure the recall of normalization. NEMO was often capable of normalizing organization names better than our human annotators as demonstrated in Table 10 for USA organizations and in Table 11 for organizations for non-USA countries.
All the above discoveries were solely based on the organization information in OrgDB. Because we are using the sophisticated “connected components based recalculation” as opposed to the straightforward string similarity, we enjoy the advantage of discovering a richer set of synonyms than naïve approaches. For example, Harvard Medical School appeared in the synonym sets of most of its affiliates listed in http://hms.harvard.edu/admissions/default.asp?page=affiliates. Since Harvard Medical School has the highest rank in OrgDB, according to our chosen criterion of Normalization, all these organizations got automatically identified with it.
Evaluation methods: Each of our evaluations involved at least two expert data analysts (employees of Lnx Research – http://www.lnxpharma.com). Since the evaluation was done at 5 different time intervals over the last one year, we did not necessarily have the same set of evaluators for each stage of evaluation. Each analyst judged the system’s output on each category as either true positive, false positive, false negative, or true negative. When there was a disagreement in a judgment, the analysts discussed it until all of them finally agreed. We believe that biasing the evaluators with the output of the system was reasonable because even in cases where there are multiple correct outputs, it is acceptable for the system to generate one of the acceptable outputs. Creating a gold standard for the sake of evaluation would be much more expensive, especially in cases when there can be more than one correct output. We calculated the precision, recall and f-score values (34) using the true positive, false positive and false negative values for each category obtained after consensus of the analysts.
Clustering for disambiguation: Since the centroid is the point that has the least sum of the squares of the distances from each vertex, we choose the organization that has the least sum of edit distances with each organization name in the cluster as the approximation for centroid. While this training process has a time complexity of O(k*N) where k is the number of clusters (same time complexity as k-means, a standard partitional clustering algorithm), identifying whether a new affiliation string belongs to one of these k clusters (dictionary matching phase [module 8 in Figure 1]) has a time complexity of only O(k). Compared to k-means, our algorithm has the major advantage that k needn’t be known apriori. We can obtain the same clusters using agglomerative hierarchical clustering (time complexity = O(N2) >> O(k*N)) and splicing the dendrogram at a depth chosen to maintain the threshold of edit distance we fixed. In our approach, the value of the threshold of the edit distance needs to be known apriori.
Limitations: The limitation of the organization name extraction component is the expansion of abbreviations using a glossary. The two-character and three-character abbreviations are usually very ambiguous. For example, in “Department of Experimental Therapy of ARS” ‘ARS’ was initially being replaced with ‘Agricultural Research Service’. We currently deal with ambiguous acronyms by not expanding them. When we realized that the ARS for this phrase refers to an organization in China, we deleted ARS in the glossary. In the next version of NEMO, we wound expand each acronym differently based on the Geopolitical location such that ARS gets expanded for Agricultural Revenue Service when it is from USA, but not when it is from China. Secondly, the glossaries we used for concept extraction and the thesaurus we constructed during the training process of normalization are being corrected manually at regular intervals as new entries are being automatically added through the use of NEMO by Lnx Research and others. This is labor-intensive and causes errors if there is a delay in correction. In the future, we plan to run NEMO on a much larger proportion (currently we trained on about 100K affiliation strings), correct the glossaries and stop expanding them.
Potential Impact: Identifying the organization of articles improves author disambiguation as it is an important indicator to know if two authors with the same name are the same person. Author disambiguation has been used for diverse applications such as building social networks, normalizing gene names and analyzing collaborations. Farkas (35) was successful in using authors' information for improving the accuracy of a baseline gene normalization system from 80% to 97%. Large scale social network analysis of disambiguated author information is useful for finding key scientific leaders who are “low publishers” in scientific journals (36). The extracted and normalized organization names can be used for similar applications. A side application for our Normalization process is in relation to the Seek Affiliation program implemented as part of the Medical Articles Record System (MARS) by NLM (23) which tries to solve the sub-problem of normalizing the affiliation string using string matching to correct errors made by OCR. This program achieved a precision of 86% and a recall of 88% on its test set of 519 articles because the affiliation strings are printed in small font and often includes superscripts with even smaller font. Since NEMO accurately extracts different fields of affiliation strings and provides a rich synonym set for the extracted organization, it could potentially supplement the Seek Affiliation program through better matching. NEMO could also be used to automatically index the PubMed citations with the normalized organization name and country for improving information retrieval.
In this study, we presented NEMO (Normalization Engine for Matching Organizations) for extracting organization names from bibliographic databases such as PubMed. We constructed a process to automatically build a database of normalized organization names using affiliation strings in PubMed abstracts. This could be useful for the analysis of organization social networks and other large scale analyses of scientific organization. The graphical user interface and java client library of NEMO is available at http://lnxnemo.sourceforge.net .
Siddhartha Jonnalagadda and Philip Topham designed NEMO. Siddhartha Jonnalagadda implemented NEMO in Java and wrote the manuscript.
Attachment 1: Keywords_for_Addresses.xls
We sincerely thank Senthil Purushothaman, Ryan Peeler, Prachi Aghi, Surendar Swaminathan, Divya Shakti, Gus Cho, Shweta Dwivedi, and Debbie Gordon (Lnx Research) who participated in the evaluation of NEMO and Sogol Amjadi (Lnx Research) who diligently copy-edited the manuscript. Thanks also to Drs. Graciela Gonzalez (Arizona State University), Dina Demner-Fushman and George Thoma (National Library of Medicine), and João Cordeiro (University of Beira Interior, Portugal) for their valuable suggestions while building NEMO. This work was entirely funded intramurally by Lnx Research.
AI Artificial Intelligence
BBN Bolt Beranek and Newman
ESS Extended Smith-Waterman Score
GPE Geo Political Entity
MARS Medical Articles Record System
MeSH Medical Subject Headings
NER Named Entity Recognition
NEMO Normalization Engine for Matching Organizations (Name of the proposed system)
NLM National Library of Medicine
NSW Non Standard Word
TSS Tight String Similarity
URL Uniform Resource Locator
The authors declare that they have no competing interests.
1 We don’t distinguish between “facility name” [Examples: 1) Health Sciences West Bldg, 2) Room HSW1601 ] and “street address”, and refer to both of them as address.