We use data from the Stanford Clinical Data Warehouse (STRIDE) to extract drug indications (and co-morbidities) from the clinical record. STRIDE contains EHRs for 1.6 million patients, 15 million encounters, 25 million coded ICD9 diagnoses, and a combination of pathology, radiology, and transcription reports totaling over 9.5 million unstructured clinical notes over a period of 17 years. After filtering out patients to satisfy HIPAA requirements (e.g., rare diseases, celebrity cases, mental health), we annotated 9,078,736 notes for 1,044,979 patients. The gender split is roughly 60% female, 40% male. Ages range from 0 to 90 (adjusted to satisfy HIPAA requirements), with an average age of 44 and standard deviation of 25.
We also use drug indication data provided by the Medi-Span® (Wolters Kluwer Health, Indianapolis, IN) Drug Indications Database™ for evaluation purposes. The Medi-Span Drug Indications Database is linked to both RxNORM and SNOMED-CT. Drug indications are classified according to its source, including: FDA approved label, accepted use, and limited evidence. We consider all of these as “known” uses. Ones classified as FDA approved are known “on-label” uses. There are 8,253 distinct on-label drug-indication pairs (normalized by ingredient), 2,944 distinct accepted uses, and 3,849 distinct uses having limited evidence.
In addition to using Medi-Span as validation data, we use the National Drug File (NDFRT) ontology, which specifies drug indications via the may_treat relation. NDFRT is also directly linked to RxNORM and SNOMED-CT via the UMLS Metathesaurus. We also consider these as “known” uses. There are 5,429 drug–indication pairs (normalized by ingredient) specified by NDFRT may_treat.
Overall, Medi-Span and NDFRT contribute 18,218 distinct drug–indication pairs constituting known usage. There are 2,004 ingredients and 2,674 indications from these pairs. In STRIDE: 145,498 distinct drug-terms normalize into 4,871 ingredients and we keep all ingredients; 281,844 disease-terms from the ontologies we use aggregate into (i.e., are subsumed by) the 2,674 indications from Medi-Span and NDFRT (e.g., Amok becomes mania) and we discard all other terms.