|Home | About | Journals | Submit | Contact Us | Français|
RxNorm is a standardized nomenclature for clinical drug entities developed by the National Library of Medicine. In this paper, we audit relations in RxNorm for consistency and completeness through the systematic analysis of the graph of its concepts and relationships.
The representation of multi-ingredient drugs is normalized in order to make it compatible with that of single-ingredient drugs. All meaningful paths between two nodes in the type graph are computed and instantiated. Alternate paths are automatically compared and manually inspected in case of inconsistency.
The 115 meaningful paths identified in the type graph can be grouped into 28 groups with respect to start and end nodes. Of the 19 groups of alternate paths (i.e., with two or more paths) between the start and end nodes, 9 (47%) exhibit inconsistencies. Overall, 28 (24%) of the 115 paths are inconsistent with other alternate paths. A total of 348 inconsistencies were identified in the April 2008 version of RxNorm and reported to the RxNorm team, of which 215 (62%) had been corrected in the January 2009 version of RxNorm.
The inconsistencies identified involve missing nodes (93), missing links (17), extraneous links (237) and one case of mix-up between two ingredients. Our auditing method proved effective in identifying a limited number of errors that had defeated the quality assurance mechanisms currently in place in the RxNorm production system. Some recommendations for the development of RxNorm are provided.
Terminology development in biomedicine largely relies on the manual work of human editors (sometimes called modelers) [e.g., 1, 2]. Although sometimes facilitated by the use of knowledge representation formalisms such as description logics, this process is known to be error-prone [3, 4]. Many approaches have been proposed for analyzing large biomedical terminologies, based on the property of their terms [3-6], on their structure [7-10] and on their semantics [3, 4, 11, 12]. Most approaches focus on auditing hierarchical relations, which form the backbone of biomedical terminologies [7, 8, 10]. Many terminology developers include quality assurance and quality control processes as part of the development cycle . However, such mechanisms fail to capture many errors and independent researchers and the user community play an important role in identifying and reporting errors in biomedical terminologies.
From a structural perspective, most biomedical terminologies can be seen as directed graphs in which nodes are concepts and links are semantic relationships. A path between two concepts can be characterized by the sequence of relationships that need to be traversed in order to reach a target concept from a source concept. While broad terminologies (e.g., SNOMED CT, NCI Thesaurus) usually have a complex model of meaning (or T-Box in description logics-based terminologies), specialized terminologies such as RxNorm (presented in detail later) only define a few major categories (or types) in the domain and their interrelations. In such terminologies, most of the assertions hold among instances of these categories (rather than among the categories themselves) and are associative rather than hierarchical. The graph of types provides a model against which graphs of instances can be validated. For example, all paths defined in the model are expected to be instantiated, allowing checking for completeness (of nodes and links at the instance level.) Similarly, alternate paths between two types known to be consistent at the type level are also expected to be consistent at the instance level.
The objective of this study is to audit relations in RxNorm for consistency and completeness through the systematic analysis of the graph of its concepts and relationships (at the instance level, in reference to the type level.) More specifically, we hypothesize that the traversal of equivalent paths yielding different results is indicative of errors in the graph of instances, including missing links and erroneous links, and possibly missing nodes.
RxNorm is a standardized nomenclature for clinical drug entities developed by the National Library of Medicine . RxNorm is one of a suite of designated standards for use in U.S. Federal Government systems for the electronic exchange of clinical health information. RxNorm has been used as part of a mediation strategy to exchange medication data between the Veterans Affairs (VA) and the Department of Defense (DoD) clinical information systems  and as a drug vocabulary for personal health records . It is also expected to become an enabling resource for applications such as e-prescribing  and medication reconciliation .
The RxNorm data set is organized around eight major categories, called “term types” in RxNorm parlance (presented in bold, sans serif typeface, while instances of these categories are shown in italic typeface.) There are four categories for generic drugs and four equivalent categories for branded drug entities. The four categories for generic drugs (referred to hereafter as generic concepts) are for ingredient alone (ingredient), ingredient plus strength (clinical drug component), ingredient plus dose form (clinical drug form) and ingredient plus strength and dose form (clinical drug.) Analogously, the four categories for branded drug entities (referred to hereafter as branded concepts) are brand name (alone), branded drug component (brand name plus strength), branded drug form (brand name plus dose form) and branded drug (brand name plus strength and dose form)1. Table 1 lists the eight major categories2 and some instances. The dataset under investigation in this study (April 1, 2008) comprises, after excluding obsolete data, 3,460 ingredients (ignoring specific salts), 9,740 brand names, 13,362 clinical drug components, 13,868 branded drug components, 18,097 clinical drugs, 14,539 branded drugs, 8,160 clinical drug forms and 11,376 branded drug forms.
As shown in Figure 1, relations are defined among branded concepts and among generic concepts. For each brand name concept, there exists one or more branded drug components, branded drugs and branded drug forms. Each ingredient is associated with one or more clinical drug components, clinical drugs and clinical drug forms. Moreover, the RxNorm drug entities are related to each other by a well-defined set of named relationships (presented in italic, sans serif typeface.) For example, brand name concepts are related to branded drug component concepts by the relationships ingredient_of and has_ingredient, the latter being the inverse relationship. Examples of relations at the instance level include:
Figure 1 shows all relationships between the various kinds of drug entities. It must be noted that the relationship isa defined between branded drug and branded drug form and between clinical drug and clinical drug form does not have the usual semantics of the subsumption relation of the same name (e.g., as defined in ), but simply links an entity with ingredient (resp. brand name), strength and dose form to the corresponding entity with ingredient (resp. brand name) and dose form, but no strength.
In addition to relations among branded concepts and among generic concepts, RxNorm also defines relations between branded concepts and generic concepts. As illustrated in Figure 1, most relations are between entities at the same level (e.g., ingredient plus strength to brand name plus strength.) This relationship is called tradename_of from branded concepts to generic concepts, the inverse relationship being has_tradename. Additionally, RxNorm defines the relationship consists_of between branded drugs and clinical drug components, with constitutes as its inverse. Examples of relationships at the instance level include:
It should be noted that all relations in RxNorm are systematically mirrored by inverse relations. As shown in Figure 1, for each link between two type nodes (e.g., ingredient_of between ingredient and clinical drug component), there is an inverse link (e.g., has_ ingredient between clinical drug component and ingredient.) At the instance level, all relations in RxNorm are also represented bidirectionally, i.e., for each relation (e.g., Cetirizine ingredient_of Cetirizine 5 MG), the inverse relation (i.e., Cetirizine 5 MG has_ ingredient Cetirizine) is also recorded in the RxNorm dataset. For this reason we often represent the links between drug entities as undirected (instead of bidirectional.) The (undirected) representation of Zyrtec 5 MG Oral Tablet is shown in Figure 2.
While all branded concepts stand in a relation to some generic drug concepts, some generic drug concepts are not linked to any branded concepts. For example, there is no branded concept corresponding to Cetirizine 10 MG Extended Release Tablet, which means that this particular ingredient, strength and dose form combination is not commercialized under a particular brand, but rather available as a generic drug.
For single-ingredient drugs there is a strict correspondence in RxNorm between branded and generic drug entities. To each branded drug entity (e.g., brand name) corresponds one generic drug entity of the equivalent type (e.g., ingredient.) Additionally, as shown in Figure 2, each branded drug entity is related to only one brand name and similarly each generic drug entity is related to only one ingredient. In contrast, there is no such correspondence between generic and branded concepts for multi-ingredient drugs. Namely, while each multi-ingredient branded drug, branded drug component and branded drug form is related to only one brand name, multi-ingredient clinical drugs and clinical drug forms are related to multiple ingredients and clinical drug components. In addition, multi-ingredient brand names are related to multiple ingredients, multi-ingredient branded drug components and branded drugs are related to multiple clinical drug components.
As shown in Figure 3, the branded drug (Sulfamethoxazole 400 MG / Trimethoprim 80 MG Oral Tablet [Bactrim]) is linked to one clinical drug (Sulfamethoxazole 400 MG / Trimethoprim 80 MG Oral Tablet.) However, the branded drug is linked to one branded drug component (Sulfamethoxazole 400 MG / Trimethoprim 80 MG [Bactrim]), whereas the corresponding clinical drug is linked to two clinical drug components (Sulfamethoxazole 400 MG and Trimethoprim 80 MG), one for each ingredient (Sulfamethoxazole and Trimethoprim) of this multi-ingredient drug.
The number of relations asserted at the instance level in the dataset under investigation in this study (April 1, 2008) is listed in Table 2. The counts are given in reference to the normalized representation described in section 3.1, so that the number of inconsistencies can be related to these counts.
A browser called RxNav3 was developed in 2004 to access the RxNorm dataset and display graphically all related concepts and the relations between them . RxNav uses web services to access the RxNorm data. In early 2008 the web services that access the RxNorm data were enhanced and made available publicly . The current application programming interface (API) comprises functions for resolving drug names and codes into RxNorm identifiers, for accessing the properties of drug concepts, and for getting the related concepts of RxNorm entities. Here, we take advantage of the latter set of functions for exploring the RxNorm graph computationally.
The methods used in this study can be summarized as follows. We start by creating a normalized representation of multi-ingredient drugs. Then, we identify all meaningful paths between two categories, for all the instances of the source category. Finally, we assess the consistency of alternate paths between pairs of categories by comparing sets of instances reached through the various alternate paths. These three steps are presented in detail below.
As explained in the last paragraphs of section 2.2, the representation of multi-ingredient drugs differs in RxNorm for generic concepts compared to branded concepts. For example, as shown in Figure 2 for single-ingredient drugs and in Figure 3 for multi-ingredient drugs, each multi-ingredient branded drug, branded drug component and branded drug form is related to only one brand name, whereas multi-ingredient clinical drugs and clinical drug forms are related to multiple ingredients and clinical drug components.
This representation is adapted to common uses of RxNorm as there is no such thing in practice as a combination of ingredients. However, we found this difference to be a hindrance to our auditing endeavor. Instead of using different algorithms for auditing single- and multi-ingredient drugs, we chose to modify the schema of RxNorm so that the same algorithm could be used on both single- and multi-ingredient drugs.
The normalization process we propose only affects multi-ingredient drugs. As illustrated by the differences between Figure 3 and Figure 4, normalization occurs at the level of ingredients and clinical drug components and their relations to other generic concepts, namely clinical drugs (for clinical drug components) and clinical drug forms (for ingredients), as well as to the corresponding branded concepts, namely brand names (for ingredients), and branded drug components and branded drugs (for clinical drug components.) The normalization process simply reifies multi-ingredient entities (i.e., transforms multi-ingredient entities into single-ingredient-like entities.)
In practice the normalization process creates new ingredient concepts for combinations of ingredients and new clinical drug component concepts for combinations of clinical drug components. For example, as shown in Figure 4, the two ingredients of the brand name Bactrim, Sulfamethoxazole and Trimethoprim, are grouped into the new ingredient concept Sulfamethoxazole / Trimethoprim. Similarly, the two clinical drug components of the branded drug component Sulfamethoxazole 400 MG / Trimethoprim 80 MG [Bactrim], Sulfamethoxazole 400 MG and Trimethoprim 80 MG, are grouped into the new clinical drug component concept Sulfamethoxazole 400 MG / Trimethoprim 80 MG. The relations of the newly created concepts are adapted accordingly. A single link is created from the new ingredient Sulfamethoxazole / Trimethoprim to both the clinical drug form Sulfamethoxazole / Trimethoprim Oral Tablet and the brand name Bactrim. Similarly, a single link is created from the new clinical drug component Sulfamethoxazole 400 MG / Trimethoprim 80 MG to both the clinical drug Sulfamethoxazole 400 MG / Trimethoprim 80 MG Oral Tablet (ingredient_of) and the branded drug component Sulfamethoxazole 400 MG / Trimethoprim 80 MG [Bactrim] (tradename _of.) Finally, a single link is also created between the new ingredient and the new clinical drug component (ingredient_of.) All links are represented bidirectionally. The original links are removed, and so are the original ingredients and clinical drug components if they do not participate in any other single- or multi-ingredient drug entities.
A path between two drug concepts can be characterized by the sequence of relationships that need to be traversed in order to reach a target drug concept from a source drug concept. For example, one path between clinical drug component (SCDC) and branded drug component (SBDC) is SCDC → SCD → SBD → SBDC, through the relationships constitutes, has_tradename and consists_of. Because all relations are mirrored with inverse relations in RxNorm, an inverse path can be found between SBDC and SCDC (i.e., SBDC → SBD → SCD → SCDC), traversing the inverse relationships in reverse order, i.e., going through the relationships constitutes, tradename_of and consists_of.
Moreover, after normalization of the representation of multi-ingredient drugs, the exploration of any path is functionally equivalent to the exploration of the inverse path. For example, auditing the path SCDC → SCD → SBD → SBDC from Cetirizine 5 MG is equivalent to auditing the path SBDC → SBD → SCD → SCDC from Cetirizine 5 MG [Zyrtec]. For this reason, of the 56 pairs of drug entities, only half of them (28) need to be considered for auditing purposes (Figure 5.)
For these 28 pairs of drug entities in RxNorm, we want to explore all paths between source and target drug concepts at the instance level. Most of the paths between a source and a target drug concept are expected to be equivalent. For example, as shown in Figure 2, there are multiple possible paths between the ingredient Cetirizine and the clinical drug form Cetirizine Oral Tablet, including IN → SCDF and IN → SCDC → SCD → SCDF. These two paths are expected to be equivalent, i.e., to reach the same set of clinical drug form target concepts from the source ingredient concept.
As it is the case for graph traversal in general , we allow each node of the RxNorm graph to be traversed only once in order to avoid infinite recursion (e.g., SCDC → SCD → SCDC → SCD→ ….) More importantly, the traversal of the RxNorm graph also is influenced by the nature of drug information. The following elements restrict how the RxNorm graph may be traversed. First, some generic concepts do not have any associated branded concepts (e.g., there is no branded drug corresponding to the clinical drug Cetirizine 10 MG Extended Release Tablet.) Second, some generic concepts are associated with several branded concepts (e.g., Coumadin, Jantoven, Marfarin and Warfin are brand names for the ingredient Warfarin.) Third, only a limited number of strength and dose form combinations exists for a given ingredient or branded drug (e.g., 1 MG/ML is an appropriate strength for the dose form oral solution, but 10 MG is not.) And fourth, not all brands produce all strengths and dose forms of a given drug (e.g., Warfarin is available in various strengths for the dose form Oral Tablet, but the only strength available for the brand name Marfarin is 4 MG.) For these four reasons, we know that some paths will predictably be different from paths with the same source and target concepts. In the auditing process, we want to ignore such predictable differences and focus on identifying discrepancies among paths expected to be equivalent.
Based on our knowledge of the subject matter and our experience in defining rules for traversing the RxNorm graph in RxNav, we defined a priori four constraints that allow us to avoid processing meaningless (predictably inconsistent) paths.
These constraints were easily implemented through a regular expression applied on the sequence of transitions for a given path and to the sequence of states (i.e., lists of properties) for all the nodes in a path, emulating a finite state automaton.
A total of 230 meaningful paths remain after all constraints have been applied. Since all relations in RxNorm are bidirectionally recorded, there exist 115 pairs of inverse paths. Only one copy needs to be explored for each path pair. As shown in Table 3, these 115 paths can be grouped into 28 classes with respect to source and target nodes in the path.
The RxNorm API was used to explore the paths. In particular, the function getRelatedByRelationship() was used for querying the instances of a given type that could be reached from a given RxNorm entity (instance) through a given link.
Each of the 115 meaningful paths (of categories) was explored as follows. Starting from the category corresponding to the first node in the path (source category), all instances of this node were retrieved. For each instance of the source category, we recorded the set of instances of the target category which could be reached, following the links indicated in the path of categories. The complete set of instances reached for a given path is the union of the sets of target instances reached from each source instance.
For example, the path SCDC→SCD→SBD→SBDC is explored as follows. The list of instances of SCDC (source instances) includes Warfarin 1 MG. As shown in Figure 11, the only SCD instance that can be reached from Warfarin 1 MG through the relationship constitutes is Warfarin 1 MG Oral Tablet. From this SCD instance, following the relationship has_tradename, two SBD instances can be reached: Coumadin 1 MG Oral Tablet and Jantoven 1 MG Oral Tablet. The SBD Coumadin 1 MG Oral Tablet leads to the SBDC Warfarin 1 MG [Coumadin] (target category) through the relationship consists_of. Similarly, the SBD Jantoven 1 MG Oral Tablet leads to the SCDC instance Warfarin 1 MG [Jantoven]. In summary, the source SCDC instance Warfarin 1 MG leads to two target SBDC instances Warfarin 1 MG [Coumadin] and Warfarin 1 MG [Jantoven] through the path SCDC→SCD→SBD→SBDC. The source SCDC instance Warfarin 1 MG therefore contributes two target SBDC instances to the path SCDC→SCD→SBD→SBDC. Overall, this path yields 13,868 target instances.
Alternate (meaningful) paths between a given source entity and a given target entity are expected to be equivalent. Alternate paths are equivalent if the same set of target instances is reached from a given set of source entities. A set of alternate paths is consistent if all alternate paths in the set are equivalent.
For example, there are three alternate paths between clinical drug component (SCDC) and branded drug component (SBDC), through entities including clinical drugs (SCD) and branded drugs (SBD):
The three alternate paths between SCDC and SBDC yield the same sets of 13,868 target instances and are deemed equivalent. The set of paths between SCDC and SBDC is deemed consistent.
The results of the exploration of the 115 meaningful paths are summarized in Table 3. In order to reduce the amount of information in this table, we only display one typical path for each set of equivalent paths. For example, from the three equivalent paths presented above for the start node SCDC and end node SBDC, column 3 confirms that there are indeed three paths, although only one of them (SCDC→SBDC) is actually listed in column 4. In fact, column 5 indicates that there are two other unlisted equivalent paths for this path. Column 6 lists the number of target instances reached for each set of equivalent paths in the April 2008 version of RxNorm. The remaining columns present information pertaining to the evaluation and will be discussed later. Each of the 28 rows of Table 3 presents the list of alternate paths between a given pair of start and end nodes and shows which alternate paths contain inconsistencies. Paths free of inconsistencies – one for each group – are called reference paths and are indicated in bold.
Of the 28 groups of alternate paths expected to be consistent, 9 groups contain only one path and could not be checked for inconsistencies. Of the 19 groups having more than one path, all alternate paths are equivalent in 10 groups (53%), while 9 groups (47%) exhibit inconsistencies. Overall, 28 paths (represented by 20 typical paths) are not equivalent to the reference path from the same group. These 28 inconsistent paths represent 24% of the 115 meaningful paths.
All inconsistencies identified by our method were reported in September 2008 to our NLM colleagues in charge of RxNorm, who provided feedback on our findings. Their assessment is presented in the section below, along with the analysis of inconsistencies. Additionally we repeated the experiment on the January 2009 version of RxNorm in order to determine whether any of the inconsistencies reported had been corrected (Table 3.)
The analysis of Table 3 reveals that inconsistencies in four paths (BN→ SBDF, IN→SCDF, SCDF→SBDF and IN→BN) are actually responsible for the inconsistencies observed in the 12 of the 20 inconsistent (typical) paths. The reason for this is that these four paths are included as proper subpaths in the other eight paths. For example, SCDF→SBDF is a proper subpath of IN→SCDC→SCD→SCDF→SBDF from the group IN-SBDF.
The degree of inconsistency observed among alternate paths (i.e., the difference in number of target nodes reached, compared to the reference path) was generally small. For example, for the path IN-SCDF, the reference path yields 8,104 target instances, while the inconsistent alternate path yields 8,160 target instances. The 56 differences represent 0.7% of the target instances for this path.
Through manual analysis of the inconsistencies observed among alternate paths, this study revealed three major types of issues at the origin of the inconsistencies. The various types of inconsistency identified in the paths are presented in Table 3.
These inconsistencies involved clinical drug form (or branded drug form) entities linked to some ingredient (resp. brand name), but not linked to a clinical drug (resp. branded drug) entity. A total of 93 such inconsistencies were identified, affecting nine of the 20 paths exhibiting inconsistencies.
According to the RxNorm team, these inconsistencies do not necessarily violate the RxNorm editorial rules and can be justified by the fact that these clinical drug forms and branded drug forms are active concepts in at least one of the source vocabularies integrated in RxNorm. However, these entities might have an active status only in those source vocabularies updated with a lesser frequency (compared to most source vocabularies updated on a monthly basis.) Therefore, we believe these clinical drug forms and branded drug forms should have a special status as they are part of incomplete RxNorm graphs and might cause problems in applications (e.g., in computerized prescription systems.)
The three following subtypes of inconsistency can be distinguished based on the analysis of inconsistent paths.
Overall, of the 93 inconsistencies of type 1, 28 had been corrected in the January 2009 version of RxNorm.
These inconsistencies involved ingredient and brand name concepts linked to one another in a manner different from that used to relate the corresponding clinical drug and branded drug concepts. A total of 254 such inconsistencies were identified, affecting five of the 20 paths exhibiting inconsistencies.
In all three cases, the direct path IN→BN is inconsistent with alternate paths, such as IN→SCDC→SBD→BN. (We consider IN→SCDC→SBD→BN to be the reference path as it ensures that there is some SCD, through the SCDC, or SBD linked to the IN and BN.) According to the RxNorm team, these inconsistencies correspond to errors and are in the process of being corrected, when they have not been corrected already.
The three following subtypes of inconsistency can be distinguished based on the analysis of inconsistent paths.
Overall, of the 254 inconsistencies of type 2, 186 had been corrected in the January 2009 version of RxNorm.
What looks like a mix-up between two ingredients causes one inconsistency that is reflected in seven of the 20 paths exhibiting inconsistencies. In this case, although the alternate paths sometimes exhibit the same numbers of target instances, the sets of target instances are actually different. The two ingredients involved in the mix-up are Omega-3 Acid Ethyl Esters (USP) and Fatty Acids, Omega-3. As nothing general is to be learned from this error, we do not report it here in detail. This problem had been corrected in the January 2009 version of RxNorm.
Overall, the major types of inconsistency identified in the RxNorm dataset include extraneous nodes (type 1 inconsistencies), missing relations (type 2a inconsistencies) and extraneous relations (type 2b inconsistencies.) Of the 348 inconsistencies identified in the April 2008 version of RxNorm, 215 (62%) had been corrected in January 2009.
The number of inconsistencies identified among alternate paths and the number of inconsistencies identified through their analysis is relatively modest (348 for 92,602 drug entities and 192,773 relations), which is a testimony to the high quality and careful curation of the RxNorm database. However, we believe this study is significant, because the underlying errors are difficult to identify. In fact, these inconsistencies had obviously defeated the quality assurance mechanisms currently in place in the RxNorm production system and had not been reported to (and acted upon by) the RxNorm team by the user community in the several years RxNorm has been available. We believe that only a systematic, principled analysis can identify such errors in a large dataset. The list of inconsistencies we identified was shared with the RxNorm developers.
RxNorm relations link together the various kinds of drug entities. Exhaustiveness and correctness of the relations are important parameters if RxNorm is to be used in applications, such as electronic prescription systems and in conjunction with decision support systems. For example, in a prescription system, physicians should not be presented with ingredients for which no branded drugs are available. For decision support systems relying on links between brand names and ingredients to check drug interactions, it is critical that all necessary relations be consistently implemented.
The method we developed is fully automated and performs a systematic evaluation of the entire RxNorm dataset. The availability of the RxNorm API allowed us to reduce low-level programming to a minimum. Unlike other auditing methods, the graph-based process we developed for analyzing RxNorm characterizes inconsistencies and groups them in categories according to their origin. As a result, the inconsistencies reported to the RxNorm team can be processed in groups and the appropriate quality assurance mechanisms can be added to the production system.
Because of the specificity of RxNorm among biomedical terminologies (limited domain, absence of hierarchical structure, strong underlying graph model), traditional approaches to auditing terminologies are not directly applicable to RxNorm. Conversely, the graph-based approach developed for auditing RxNorm is not easily applicable to other biomedical terminologies. However, this study illustrates the need for automated, scalable methods, applied systematically to a terminology by an independent group of researchers.
Other limitations include the need for modifying the schema of the RxNorm database prior to running the auditing experiment. However, we see this limitation as minor, because the transformation is fully automated and more importantly, it enables us to use the same simple algorithm for processing both single- and multi-ingredient drugs. Although applied to the entire RxNorm dataset, this study deliberately focuses on the eight major drug categories and ignores categories including drug forms (DF), generic pack (GPCK) and branded pack (BPCK.) However, these eight major categories represent more than 99% of all RxNorm entities at the instance level. In future work, we plan to audit the remaining categories as well. Inherent to this method is the impossibility of auditing paths between pairs of entities for which only one single path is available.
Finally, other approaches could be used to address the same issue, including role composition in a description logic-based environment. However, RxNorm is not available in any native description logic representation format and we found RxNorm to be amenable to graph-based approaches.
The normalization process developed for this study, which makes the representation of generic concepts compatible with that of branded concepts, is a critical element of our method. However, we do not recommend that the RxNorm developers change the current representation. In fact, reified combinations of ingredients and clinical drug component entities are artificial constructs, with no equivalent in the real world, and would therefore not be useful to most users of RxNorm.
Some of the inconsistencies detected in this study call for additional quality assurance processes to be implemented in the RxNorm production system. For example, it would be easy to check if a given clinical drug form with links to an ingredient is also linked to at least one clinical drug.
This study forced us to formalize what constitutes a meaningful path for traversing the RxNorm graph. Although a small number of constraints are required for ensuring meaningful traversal of the graph, we found it difficult to formulate these constraints. As the use of RxNorm increases, we suggest that guidance be added to the RxNorm documentation regarding traversal of the RxNorm graph.
Finally, the RxNorm graph contains some redundancy, but redundancy is not present systematically throughout the graph. On the one hand, it might be better to provide users with the minimal number of relations necessary for traversing the graph in a meaningful way. This option would call for removing the direct relation between ingredient and clinical drug form, for example, as it can be reconstructed through the path IN→SCDC→SCD→SCDF. On the other hand, it might be useful to some users to have a fully saturated set of relations. This option would call for adding a direct relation between ingredient and clinical drug, mirroring the relation between brand name and branded drug on the brand side.
Through the graph-based method we developed for auditing RxNorm and applied to the entire RxNorm dataset (April 2008), we identified 348 inconsistencies, including extraneous nodes (93), missing links (17), extraneous links (237) and one case of mix-up between two ingredients. We shared our findings with the RxNorm team. A large proportion of the underlying errors had been corrected in the January 2009 version of RxNorm and the remaining inconsistencies are under review. Our auditing method proved effective in identifying a limited number of errors that had defeated the quality assurance mechanisms currently in place in the RxNorm production system, despite the high quality and careful curation of the RxNorm dataset in general. Based on our analysis, we recommended some changes to the RxNorm quality assurance process, as well as additions to the RxNorm documentation.
This study illustrates the need for principled, automated, scalable methods, applied systematically to the entire content of a terminology by an independent group of researchers. The lessons learned from this auditing experiment can be summarized as follows. Auditing needs to be grounded in domain knowledge (e.g., the constraints defined for selecting meaningful paths.) Because curation is a labor-intensive process, auditing methods need to have good specificity if they are to be used to focus the attention of the editors of the terminology on particular areas. It is also useful that auditing methods characterize the errors they identify in order to facilitate the work of the editors. Auditing methods need to be automated and scalable, so they can be repeatedly applied to the entire content of the terminology as necessary (e.g., when updates become available.) Independent auditing is important, because close proximity to the production process – including its tools, constraints (e.g., time and resources), culture and traditions – makes it difficult to imagine or implement solutions that deviate from the production routine (e.g., modify the database schema for auditing purposes.) Finally, the result of the auditing process should be used not only to identify areas of the content of the terminology in need of review, but, more importantly, to inform the quality assurance process implemented as part of the terminology production environment. In other words, quality assurance has to be thought of as a proactive, not reactive process in the life cycle of a terminology.
This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM.) The authors wish to thank the RxNorm development team at NLM. In particular, we wish to thank Tammy Powell for providing useful feedback on the inconsistencies identified in this study and Stuart Nelson for his support.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1RxNorm provides a “semantic normal form” for drug names and the RxNorm documentation refers to the 8 major categories as ingredient (IN), semantic clinical drug component (SCDC), semantic clinical drug form (SCDF), semantic clinical drug (SCD), brand name (BN), semantic branded drug component (SBDC), semantic branded drug form (SBDF) and semantic branded drug (SBD). For readability in this article, we drop the “semantic” qualifier from these names.
2Other categories in RxNorm include drug forms (DF), generic pack (GPCK) and branded pack (GPCK). We deliberately focus on the 8 major categories, which represent more than 99% of all RxNorm entities at the instance level.