Seeking drug-related information is one of the major activities of today's online healthcare professionals and consumers. To date, there are a wide variety of different drug-related resources including but not limited to: the biomedical literature in PubMed®
], clinical trials in ClinicalTrial.gov [2
], adverse drug effects in FDA's Spontaneous Reporting System, and consumer-level drug monographs in MedlinePlus®
] and PubMed Health [4
]. Owing to the heterogeneous nature of each individual resource, they are not currently linked to each other. On the other hand, their contents are often complementary to each other so that users would benefit from an integrated access to all sources relevant to a single drug. Thus, this poses an increasing need to build cross-links between these different resources for the same drug entity so that users from one site can be informed by relevant information in other sites. To this end, a critical step is to be able to identify the drug entity from the corresponding narrative text.
Biomedical named entity recognition (NER) is a challenging task but it serves as a prerequisite for many subsequent tasks like relationship extraction [5
]. Over the years, most NER tools have been developed for automatically recognizing gene and gene products from free text using one of the three approaches: dictionary-based, rule-based, and machine-learning based. By contrast, less work involved drug entity identification. Partly, this may be due to the difficulty in defining a drug entity in text. In the earlier work that involved automatic drug entity identification [6
], a drug was simply defined by its generic name/active ingredient. Such approximation may be appropriate for those applications but to formally define a drug, other important specifications should be considered. For instance, a drug's dosage form (DF) indicates the physical form in which a drug is produced and dispensed. It is one of the most important specifications of a drug because it affects the way a drug is administrated in a patient. Drugs with the same ingredients but in different dosage forms can have different uses. For example, if timolol
comes as ophthalmic solution (eye drops), it is used to treat glaucoma. If timolol
comes as an oral tablet, it is used to treat high blood pressure, to improve survival after a heart attack, and to prevent migraine headaches. Furthermore, dosage forms affect drug absorption and drug distribution in the human body. So in order to confirm drug efficacy and optimize drug therapy, drug pharmacokinetic and pharmacodynamic properties should be modeled and experimented across different dosage forms. For example, tacrolimus
, a macrolide with potent immunosuppressive effects, can come as oral capsule and injectable solution. Its pharmacokinetic properties were studied across intravenous, oral, and intramuscular dosage forms [10
]. Because of the importance of dosage form in drug development and consumption, it is crucial to provide accurate dosage form information in drug-related information resources. Therefore, the goal of this work is to automatically identify drug dosage form information from free text and subsequently normalize it using a standardized nomenclature.
Our proposed method is rule-based and is related to two particular areas of previous studies. In the work of clinical drug normalization, Peters et al
., examined the complexity, ambiguity and variability of clinical drugs (e.g
., 'Metoprolol Succinate 200 mg sa Tab
]. They processed the clinical name as a string, and defined a set of rules like expanding abbreviations (e.g., tab to tablet
) to normalize it. However, in their study, dosage form was not segmented from the clinical drug name for further normalization. The other related area is medication information extraction for clinical narratives in electronic medical records (EMRs). Recent studies have focused on extracting both drug names and related attributes such as strength, route, frequency, form, and duration. In 2009, the task of i2b2 challenge was to identify mentions of drugs and drug-related information like dosages and routes of administration from discharge summaries [12
]. However, there was no requirement for normalizing mentions to any standardized nomenclature. More recently, several Natural Language Processing (NLP) systems such as MedEx [13
] and MTERMS [14
] have been developed to automatically normalize identified drugs to concepts in one or multiple terminologies. These research efforts contribute to the field of automated medication reconciliation across the care continuum [15
]. They successfully applied NLP techniques to summarize and encode the medication data with high performance (F-Measure > 90%). In these systems, mentions of drug dosage form are captured from clinical narratives but not further normalized to a standard controlled vocabulary.
In this study, we present a computational method to identify dosage forms from full-text drug monographs, and normalize them to a standardized nomenclature. Specifically, we used the American Hospital Formulary Service®
(AHFS) drug monographs provided by the American Society of Health-system Pharmacists®
(ASHP) as our corpus and RxNorm [17
] as the standardized nomenclature. To our best knowledge, this is the first work on drug dosage form identification and normalization. For evaluation, we first randomly selected approximately 10% of the AHFS drug monographs and produced human annotations (the gold standard). In addition, to evaluate our method on the entire test data, we further developed a silver standard by automatically extracting the known dosage forms from the drug products that are currently listed in the drug monographs.