The major component of the UMLS is the Metathesaurus, a repository of inter-related biomedical concepts. The two other knowledge sources in the UMLS are the Semantic Network, providing high-level categories used to categorize every Metathesaurus concept, and lexical resources including the SPECIALIST lexicon and programs for generating the lexical variants of biomedical terms. The Metathesaurus is the only resource presented in detail in this paper. Unless otherwise specified, the version described here is 2003AB (July 2003). The UMLS knowledge sources are updated quarterly.
Terminology of interest for bioinformaticists
Although the UMLS was not specifically developed for the needs of bioinformaticists, it includes terminologies used in bioinformatics. For example, recently integrated terminologies include the NCBI taxonomy, used for identifying organisms, and Gene Ontology, used for the annotation of gene products across various model organisms. The Metathesaurus also covers the biomedical literature with the MeSH, the controlled vocabulary used to index MEDLINE. Core subdomains such as anatomy, used across the spectrum of biomedical applications, are also represented in the Metathesaurus with the Digital Anatomist Symbolic Knowledge Base. Finally, the subdomain represented best is probably the clinical component of biomedicine, with general terminologies such as SNOMED® International (and soon SNOMED-CT®), Clinical Terms Version 3 and the International Classification of Diseases, to name a few. Clinical genetics resources include the Online Mendelian Inheritance in Man™ (OMIM™), represented in part, and the Online Multiple Congenital Anomaly/Mental Retardation (MCA/MR) Syndromes©. Other categories of terminologies in the Metathesaurus include specialized disciplines (e.g. nursing, psychiatry) and components of the clinical information system (e.g. diseases, drugs, procedures, adverse effects). Figure illustrates how the UMLS Metathesaurus, by integrating these various terminologies, can serve as a link between not only the vocabularies, but also the subdomains they represent.
The various subdomains integrated in the UMLS.
Terminology integration principles
In the UMLS, knowledge is organized by concept (i.e. meaning) (8
). Synonymous terms are clustered together to form a concept and concepts are linked to other concepts by means of various types of relationships, resulting in a rich graph. Inter-concept relationships are either inherited from the structure of the source vocabularies or generated specifically by the editors of the Metathesaurus. Symbolic relationships can be hierarchical (e.g. ‘is a kind of’ or ‘isa’, ‘part of’) or associative (e.g. ‘location of’, ‘caused by’). Statistical relations between concepts from the MeSH vocabulary are also present, derived from the co-occurrence of MeSH indexing terms in MEDLINE citations. Finally, each Metathesaurus concept is broadly categorized by means of the semantic types (i.e. the 135 high-level categories found in the Semantic Network), assigned by the Metathesaurus editors.
Such a structure makes it easy for users to perform tasks such as:
(i) collecting the various terms used to name a concept;
(ii) extracting the relations of one concept to other concepts, either hierarchical or associative, symbolic or statistical; and
(iii) obtaining a set of concepts for a given category, using the list of concepts that were assigned a given semantic type.
More formally, synonymy is the lexical relation used to cluster biomedical terms into concepts. Hyponymy (‘isa’) and meronymy (‘part of’) relations provide the hierarchical framework on which the concepts are organized. Associative relations, including co-occurrence relations, extend this framework laterally, providing links across various subdomains. The categorization by semantic type can be thought of as redundant with some of the hierarchical relations. In practice, because the categorization is independent of the structure of the source vocabularies, it provides a simple and stable means of semantic orientation in the Metathesaurus.
Biomedical terminologies often contain more information than the mere terms and their inter-relations. Beside definitions, additional information may include cross-references, either internal (e.g. ‘See also…’ in MeSH, treated as associative relations) or external (i.e. cross-references to other terminologies or databases). In most cases, this information is represented in the Metathesaurus. For example, MeSH supplementary concept records include many proteins for which a GenBank identifier is provided. Similarly, whenever it is relevant, concepts from the Online MCA/MR Syndromes point to the related diseases in MeSH and OMIM, even when the corresponding OMIM concept is not in the Metathesaurus. Examples of such cross-references are provided below.