OMIM is one of the most well-known and utilized resources for detailed information about human genetic diseases (11
). We were initially drawn to OMIM because it is familiar to our users and its data are indexed with NCBI Gene records, providing a wealth of genetic disease terms that could be easily integrated into CTD via shared gene accession identifiers (IDs). OMIM, however, is a flat list of different concepts (phenotypes, genes, phenotypes without genes, genes with phenotypes, etc.), which does not provide connections between similar diseases. For example, a query at OMIM with ‘breast cancer’ retrieves ‘BREAST CANCER’ (OMIM:114480) annotated to 21 genes, as well as ‘BREAST-OVARIAN CANCER, FAMILIAL, SUSCEPTIBILITY TO, 3’ (OMIM:613399) annotated to one gene not currently associated with the ‘BREAST CANCER’ record. For CTD, we needed a way for our users to come to one umbrella term (e.g. breast neoplasms) and find information associated with individual and related diseases. While OMIM efficiently catalogs genetic diseases corresponding to mutations, CTD is also interested in environmental diseases, which are not necessarily associated with gene mutations, so we required a vocabulary that included non-genetic disorders as well.
We also needed a way to allow users to navigate between broad and specific disease levels. For example, instead of selecting data exclusively for ‘ALZHEIMER DISEASE’ (OMIM:104300), a CTD user might want a broader perspective for all neurodegenerative diseases, including Alzheimer, Parkinson and Lou Gehirg diseases. The flat OMIM structure does not provide a way to view aggregate information from such higher levels.
OMIM contains a mixture of different types of information, identifiable by a character prefix in front of the record ID. Since we wanted to avoid using OMIM gene pages as part of our disease vocabulary, we excluded in our initial mapping all OMIM records prefixed with an asterisk that identifies records for gene descriptions. We only collected records prefaced with a number sign (# phenotype description, molecular basis known), a percent sign (% phenotype description, molecular basis unknown), a plus sign (+ gene and phenotype combined) or no symbol (phenotype description, Mendelian basis not clearly established). We also excluded deleted OMIM records, identifiable by a caret symbol, as well as terms that seem to be more of a trait instead of a disease, such as ‘BLOOD GROUP, P SYSTEM’ (OMIM:111400).
To streamline its initial creation, MEDIC only included OMIM terms that were associated with an NCBI Gene accession ID. Since its inception, MEDIC is updated by including new OMIM records as they are assigned new gene annotations.
MeSH is a controlled vocabulary thesaurus composed of over 26
000 primary terms that are used to index and annotate scientific abstracts in MEDLINE (7
). Currently, the MeSH hierarchy is divided into 16 branches. The ‘Diseases’ [C] branch of MeSH, like other branches, is structured as a hierarchy that can be navigated between broad and specific terms (14
). Hierarchies are extremely valuable in curation, as they allow associated data to be viewed at various levels of granularity, with data annotated to children of a branch to be aggregated at each higher level of the hierarchy. As an indexing source at PubMed, MeSH provides an efficient way to triage the literature for specific articles to be used in disease curation. However, MeSH does not include genes that are known to be associated with their disease terms, it is deficient in many detailed diseases (especially complex syndromes), and it contains some idiosyncrasies that present challenges to data navigation and analysis. For example, ‘Autistic Disorder’ (MESH:D001321) is not a child in the ‘Diseases’ [C] branch, but rather maps to the ‘Psychiatry and Psychology’ [F] branch. As such, CTD would need to include both the entire ‘Diseases’ [C] branch (and its supplementary concept terms) and the [F03] ‘Mental Disorders’ (MESH:D001523) sub-branch since our users would expect autism spectrum disorders (and other mental disorders) to be listed in a manner similar to other diseases.
For CTD's needs, we wanted to take advantage of both disease vocabularies: the familiarity and immediate genetic data offered by OMIM terms associated with NCBI Gene IDs, combined with the navigation utility and PubMed indexing feature of MeSH terms. An obvious solution was to create a merged vocabulary that integrated both OMIM and MeSH disease terms. In December 2006, two CTD biocurators spent three weeks manually reviewing, integrating and merging the appropriate OMIM disease terms (see above) into the MeSH disease hierarchy using a spreadsheet to form the basis of MEDIC.
MEDIC is updated on a monthly basis, and is freely available to download in a variety of formats from CTD (). As of October 2011, MEDIC contains 9706 unique diseases (plus 58
074 disease synonyms), composed of 6197 primary MeSH terms and IDs, 1845 primary OMIM terms and IDs (made leaves of MeSH terms) and 1664 MeSH terms that contain 2593 OMIM terms merged to them ().
Figure 1. MEDIC is freely available from CTD. To obtain the most recent version of MEDIC, use the ‘Downloads’ menu tab. The vocabulary can be downloaded in various formats including CSV, TSV (red circle and inset), XML and OBO. We encourage other (more ...)
Figure 2. Components of MEDIC. As of October 2011, MEDIC contained 9706 unique disease primary terms and 58074 synonyms. It includes 6197 MeSH primary terms, 1845 OMIM primary terms (as leaf nodes) and 1664 MeSH primary terms (that have 2593 OMIM primary (more ...)
By combining the primary terms, synonyms and IDs from both OMIM and MeSH into a single resource, MEDIC becomes a flexible solution that can be mapped to other disease vocabularies or ontologies. For example, the current version of the DO also includes some terms, synonyms and IDs from OMIM, MeSH and SNOWMED-CT, allowing groups that use the DO to migrate to MEDIC via term and ID mapping. Vice versa, groups that start out by initially adopting MEDIC will have the flexibility to migrate to a more robust DO or other disease vocabulary in the future by similar term and ID mapping. Data management tools such as the interactive Ontology Lookup Service could help streamline and enhance the cross-platform analysis and mapping of these shared vocabularies (15
MEDIC mapping guidelines
The MeSH disease hierarchy is used as the backbone of MEDIC, with OMIM terms either merged to a MeSH term or added as a leaf (child) to one or more MeSH terms. Where the same disease is represented in OMIM and MeSH, the OMIM name, synonyms and ID all become synonyms of the equivalent MeSH term. This fusion gives our users more power to query diseases at CTD. OMIM primary terms and synonyms are kept in their capitalized format on CTD web display, thereby allowing biocurators and users to readily distinguish between OMIM and MeSH terms.
We used the following guidelines in our manual mapping of OMIM terms to MeSH terms in the initial construction of MEDIC. In our analysis, we considered a number of factors, including: the semantic similarity of the OMIM disease term to a MeSH term as determined by the biocurator (e.g. OMIM ‘LUNG CANCER’ is similar to MeSH ‘Lung Neoplasms’), OMIM synonyms, the disorders described in the OMIM report, its accompanying cited literature and the MeSH terms annotated to its cited literature.
- An OMIM primary term is either merged directly to the most appropriate MeSH term or else is made a leaf (child) of one or more MeSH terms. Example: ‘LUNG CANCER’ (OMIM:211980) is merged to ‘Lung Neoplasms’ (MESH:D008175), while ‘MYELOPROLIFERATIVE DISORDER, CHRONIC, WITH EOSINOPHILIA’ (OMIM:131440) is made a leaf of two terms: ‘Myeloproliferative Disorders’ (MESH:D009196) and ‘Eosinophilia’ (MESH:D004802). An individual OMIM term cannot be both merged to and made a leaf of MeSH terms. An OMIM term cannot be made the leaf of another OMIM term.
- If an OMIM disease term uses the word ‘susceptibility’ in its name, then that term is merged to the MeSH disease term that is concordant with the core name of the OMIM term. Example: ‘ASTHMA, SUSCEPTIBILITY TO’ (OMIM:600807) is merged to ‘Asthma’ (MESH:D001249). However, if the OMIM ‘susceptibility’ term is a complex of different diseases that do not match a single MeSH term, the OMIM term should be added as a leaf beneath all the appropriate MeSH terms. Example: ‘BREAST-OVARIAN CANCER, FAMILIAL, SUSCEPTIBILITY TO, 2’ (OMIM:612555) is added as a leaf node beneath both ‘Breast Neoplasms’ (MESH:D01943) and ‘Ovarian Neoplasms’ (MESH:D010051).
- If an OMIM primary term uses a phrase describing heritability (e.g. ‘hereditary’, ‘autosomal’, ‘X-linked’, etc.), then the term is added as a leaf beneath the most appropriate MeSH term(s). Example: ‘DEAFNESS, AUTOSOMAL DOMINANT 12’ (OMIM:601543) is added beneath ‘Deafness’ (MESH:D003638).
- If an OMIM primary term uses a numeral, then it is merged to the concordant MeSH term. Example: ‘SCHIZOPHRENIA 12’ (OMIM:608543) is merged to ‘Schizophrenia’ (MESH:D012559).
- If an OMIM primary term uses the word ‘type’, then the term is added as a leaf beneath the most appropriate MeSH term(s). Example: ‘SYNDACTYLY, TYPE 1’ (OMIM:185900) is added beneath ‘Syndactyly’ (MESH:D013576).
- For OMIM primary terms that describe syndromes, the biocurator first checks to see if that same syndrome exists in MeSH, and if it does, then the OMIM term is merged to the MeSH term. Example: ‘CHROMOSOME 5q DELETION SYNDROME’ (OMIM:153550) is merged to ‘5q- syndrome’ (MESH:C535323). If the OMIM syndrome is not in MeSH, then the OMIM term will become a leaf beneath one or more MeSH terms. Example: ‘ALOPECIA-MENTAL RETARDATION SYNDROME 2’ (OMIM:610422) is a leaf to both ‘Alopecia’ (MESH:D000505) and Intellectual Disability’ (MESH:D008607).
Updating and maintaining MEDIC
MEDIC is updated by CTD on a monthly basis. Since both OMIM and MeSH are constantly refining their own respective databases, it is inevitable that MEDIC will fall out of synchronization from time to time. To ensure the continued completeness and high quality of MEDIC, we implemented a two-tiered quality control process.
From CTD’s perspective, the completeness of the MEDIC vocabulary is defined by its ability to capture OMIM-to-gene associations. To that end, we run a quarterly process that reads through the latest OMIM ‘mim2gene’ file and attempts to identify diseases that do not currently exist in MEDIC either as a discrete or merged term. All OMIM diseases are candidates for inclusion, with the exception of OMIM entries that are designated as no longer existing (i.e. carat prefix) and those designated as genes of known sequence (i.e. asterisk prefix). As the process reads through the ‘mim2gene’ file, if an OMIM disease is encountered that is not accounted for in MEDIC (and is considered valid for inclusion in CTD as defined above), it is checked against a list of OMIM terms that CTD has been unable to match to a MeSH term in the past (e.g. traits such as ‘BLOOD GROUP, P SYSTEM’). If the disease is not contained in the unmatched list, it is included in a report for CTD biocurators to review as the basis for entry of new terms into MEDIC.
The most recent MeSH and OMIM vocabularies are loaded from their respective databases to CTD each month. To ensure that MEDIC is synchronized with any changes in these vocabularies, CTD biocurators are notified of all disease name changes (whether by MeSH or OMIM) for all mapped terms. This notification is determined by computationally comparing the disease names that were used when the OMIM–MeSH mappings were originally made to the name of the disease in the most recent monthly download. The biocurators research the definitions of the terms in this list to determine if the semantics of the disease (and therefore potentially its association in MEDIC) have changed. Changes in accessions and/or dropped terms are also checked to ensure that they are properly addressed each month.
We have not yet resolved all quality control issues, including, for example, when OMIM changes the character prefix for an OMIM ID. This change can sometimes result in a phenotype report now becoming a gene page (identifiable by an asterisk), something we exclude from MEDIC. We are working on ways to identify and resolve such records in MEDIC. Even with its limitations, however, MEDIC has been a practical vocabulary to implement at CTD in the absence of a more formal, stable, and mature disease ontology.