Distribution of itemized entries and mapped phrases based on the UMLS
There are 36m itemized entries extracted from 14.7m documents that contain summary level data, with an average of 2.43 entries per document. The number of unique itemized entries is 9.16m with an average of 3.93 occurrences per entry. About 7.4m entries occur only once in the corpus. Entries with over 100K occurrences in our corpus are shown in the second column of (round to the closest thousands). There are 170m occurrences of mapped phrases corresponding to 164k unique normalized phrases. The last column of shows the total number of occurrences of the corresponding normalized phrases in the corpus (i.e., including ones that occurred alone as itemized entries and those that co-occurred with other phrases). shows the statistics of entries and the statistics of mapped phrases. The x-axis represents 16 occurrence groups, where group 1 and group 2 include those occurring once or twice, respectively, followed by groups [2i+1, 2i+1], for i from 1 to 13, and the last group includes those occurring more than 214=16284 times. The y-axis is the logarithm base 2 of the number of itemized entries or mapped phrases. From , we can clearly see the distribution of itemized entries follows Zipf’s law almost perfectly (R2 is close to 1). The distribution of mapped phrases also follows Zipf’s law.
Most frequent itemized entries in summary level data.
Statistics of itemized entries and the number of mapped phrases.
Distribution of SNOMED-CT phrases in the corpus
There are 199,720 normalized UMLS phrases that have corresponding SNOMED concepts. Among them, 99,261 (49.7%) occurred at least once in our corpus. The most frequent phrases are various qualifiers including “history of”, “right”, “after”, “status post”, “left” etc. The most frequent finding is “pain” and the most frequent disorder is “hypertension”. lists the number of mapped phrases (column 2) and their average number of occurrences (column 3) for SNOMED-CT semantic tags with at least 500 normalized phrases. For example, the second row indicates there are 44,116 normalized disorder phrases with an average number of occurrences as 1,221. Disorders, findings, and procedures top the number of normalized phrases and the average occurrences of qualifier value and attribute phrases are highest in the corpus.
Distribution statistics of SNOMED Semantic Tags.
SNOMED-CT coverage statistics
There are 68.3m tokens in our corpus and 56.8% of them are covered by mapped phrases. Most of the tokens that failed to be mapped (corresponding to 88k unique tokens) are prepositions or conjunctions or numbers such as “by”, “and”, “with”, “of”, “the” etc. Over half of the 88k unique tokens occurred less than three times, potentially typos (). We notice a significant number of words are clinically relevant but appear in adjective form (e.g., “diarrheal”, “dystrophic”, “diabetic”, “mycotic”, “neuropathic”, “posttraumatic”, or “premenopausal”). When their associated concepts are not included in SNOMED-CT, they are considered as unmapped. The phrases “posttraumatic arthritis” and “diabetic ulcerations” are not included in SNOMED-CT, the best mappings found for them are “arthritis” and “ulcerations”. Therefore, they are considered as unmapped tokens for the corresponding entries. The creation of dictionary that maps the adjectival forms to the noun forms present in SNOMED-CT can resolve the mappings for adjectival forms. There are 19.5k unique unmapped tokens corresponding to a total of 2.45m tokens ending with three popular adjective suffixes (e.g., “ic”, “al”, or “ive”). Additionally, some of the undefined tokens are synonyms of known terms which will require manual curation for mapping them to correct codes.
Token distribution in the corpus. Mapping of distribution of total tokens to the distribution of unique tokens is shown using dashed lines.
After ignoring stop words or non-functional words, 5.55m (60.5%) unique entries corresponding to a total of 28.9m (80.3%) itemized entries can be mapped to a set of SNOMED-CT concepts (). If we allow one unmapped token for an entry, 8.11m (88.5%) unique entries corresponding to a total of 34.5m (96%) itemized entries can be mapped. We notice that 32k unique entries corresponding to 306k itemized entries could not be mapped to any SNOMED-CT code. The most frequent one with no code is the string, “PAME”, which stands for “pre-anaesthetic medical evaluation”.
Compositional level statistics
and show the statistics of the number of SNOMED-CT normalized phrases for representing each itemized entry. Most of the entries can be represented by one to three SNOMED-CT normalized phrases. There are 565k unique entries corresponding to a total 16,522k itemized entries with an average of 29.22 occurrences per unique entry that can be represented using one SNOMED code. 83% of the entries were mapped to 3 or fewer concepts. The proportion of phrases that can be encoded into three or less concepts would be much larger, given the fact that many itemized entries in the problem lists consists of several phrases. A limitation of the composition analysis we have performed is that we have not considered post-coordination rules described in SNOMEDCT, but have simply combined the concepts found by the dictionary lookup.
Statistics of the composition level for the entries.
Compositional statistics. The x-axis shows the number of SNOMED CT phrases needed to encode an entry. The y-axis is the number of unique entries and the total number itemized entries.
There are a total of 4.03m pairs of concepts with co-occurrences at least 100 times in the corpus. Only a very small portion (16,500 out of 1.44m or 1.14%) of pairs from the relationship table are found among those 4.03m pairs. Note that the actual coverage of SNOMED-CT relationships for co-occurred pairs can be much higher than 1.14% since certain relationships can be obtained through ontological propagation. On filtering out pairs with χ2 scores less than 10000, 0.86m pairs are kept including 14,499 (87.9%) out of the 16,500 pairs from the SNOMED-CT relationship table. We manually examined the top ranked pairs and found that they are semantically related. For example, the procedure “Tylectomy” and the disorder “Intraductal carcinoma in situ of breast” have a χ2 score of 217,318. When concepts co-occur significantly in itemized entries, it can indicate novel relationships among those concepts since physicians tend to group closely related problems as a single itemized entry.
One limitation of the study is that we excluded terms with more than 10 words or those with fewer than three letters or more than 100 letters during the dictionary lookup. Therefore, our study does not account for one or two-letter terms (mostly abbreviations) and very long phrases. We feel one or two-letter terms are highly ambiguous using our dictionary lookup procedure and it is not easy to disambiguate them. Another limitation of the study is that we use mapped phrases instead of mapped concepts for coverage statistics. Due to the fact that one string can be mapped to several concepts in SNOMED-CT (most of the time, those concepts are related concepts), it is sometimes infeasible or un-realistic to map one phrase to one SNOMED-CT code.
The list of adjectival forms that were mot mapped to SNOMEDCT can be utilized for improving the sensitivity of automated mapping tools. The concept co-occurrence method may be useful to discover concept relations for augmenting SNOMED-CT. Overall the statistics presented in this paper would benefit researchers to enhance use of SNOMEDCT for coding summary level clinical information.