Data transformation summary
A total of 43 096 800 person records across both databases were included in this analysis; of this total, 76.1% are PHARMetrics records. The number of persons remained unchanged during the transformation process and is the same in the native and transformed data. The rofecoxib cohort comprises 1.07% of the total PHARMetrics persons, and 1.82% of GE persons. For the acute myocardial infarction cohort, 0.67% of PHARMetrics persons and 0.52% of GE persons are included. provides summary statistics regarding the number of persons found within each native and transformed source file.
Person counts in native and transformed data for each database
Terminology Dictionary mapping
In total, 97.91% of PHARMetrics and 96.68% of GE drugs associated with prescription or medication records found within the Person data mapped to Concepts in our Terminology Dictionary (instance annotations). This represents 79.39% of PHARMetrics and 69.79% of GE unique Product Identifiers found within the Drug Reference files (distinct annotations). A review of the unmapped Product Identifiers reveals items such as ‘100 CC SYRINGE’ and ‘ATTENDS BRIEFS LARGE’ contained within the medication files of the source data; it is appropriate that these are not annotated for our purposes.
In both PHARMetrics and GE, multiple Vioxx Product Identifiers correctly annotate to the Drug Concept Vioxx within the Terminology Dictionary and GE references to the generic Product Identifier Rofecoxib correctly annotate to the Rofecoxib NOC Concept.
For Conditions, 84.56% of PHARMetrics and 84.11% of GE ICD-9 codes found within actual Person data (instance annotations) annotate to MedDRA Concepts, representing 75.41% of PHARMetrics and 85.47% of GE unique ICD-9 codes in the Condition Reference files (distinct annotations). The ICD-9 E and V codes account for approximately 12.01% and 13.80% of unmapped, unique codes in the Condition Reference file for PHARMetrics and GE, respectively, since there is no corresponding concept within MedDRA for a majority of these health and service indicator codes. The remaining 12.58% (PHARMetrics) and 0.73% (GE) of unmapped, unique ICD-9 codes found in the Condition Reference files are not represented in the current ICD-9 coding dictionary; this may be due to transcription errors during data entry or represent local modifications to the official ICD-9 coding scheme. These non-standard and/or invalid ICD-9 codes represent only 0.16% and 0.11% of the conditions found within the actual PHARMetrics and GE Person data, respectively. All ICD-9 codes starting with ‘410’ appropriately annotated to the Condition Concept Acute Myocardial Infarction.
provides detailed statistics produced from the annotated Terminology Dictionary.
Standardized Terminology Dictionary mapping performance
Data aggregation results for drugs are consistent with the transformation rules which aggregate multiple occurrences of the same Drug Concept within the allowable persistence window of 30 days. The aggregation process reduces the overall number of Person drug records in both databases. For PHARMetrics, the total number of Drug Eras found in the transformed data is only 43.11% of the total number of Person drug records within the native data; for GE this number is 41.74%. The reduction for rofecoxib Drug Eras is similar, at 35.84% and 40.58% of the number of native rofecoxib drug records for PHARMetrics and GE, respectively.
Within the native data, the average number of drug records per person is 21.9 in PHARMetrics and 22.52 in GE. The aggregation process reduces this number to an average of 9.44 Drug Eras per person in PHARMetrics and 9.40 Drug Eras per person in GE. The consistency of results across both database types is also seen for rofecoxib, although the average number of rofecoxib drug records per person is lower in both the native and transformed data.
The average length of a Drug Era in the transformed data is longer in GE than in PHARMetrics, although the length of rofecoxib exposure in GE is only slightly longer. In PHARMetrics, the overall average length of a transformed Drug Era is 2.55 times larger than the average length of a prescription record in the native data (as measured by days supply); a rofecoxib Drug Era is on average 2.91 times larger than a native prescription. Length of exposure is not directly obtainable from the GE raw data but requires a defined set of rules to infer utilization, such as those applied here to enable estimation.
Condition Eras within the CDM represent the concept of an ‘episode of care’, which does not exist in the native data but is constructed from diagnoses information during the transformation process. The Condition aggregation process produces different results between the two types of databases, reflecting differences in the underlying motivation for recording a condition in each data source. For PHARMetrics the number of Condition Eras in the transformed data is only 22.19% of the number of Conditions found in the native data; for GE this number is 75.64%, representing less aggregation of Condition Eras. For Acute Myocardial Infarctions, these aggregation reductions were consistent (Acute Myocardial Infarction Condition Eras represent only 5.28% and 62.01% of the number of original diagnoses of acute myocardial infarction found within PHARMetrics and GE, respectively).
Within the native data sources, the average number of diagnosis records per person is 110.52 in PHARMetrics, and 7.95 in GE. The aggregation process reduces this number to 24.53 Condition Eras per person for PHARMetrics and 6.01 Condition Eras per person for GE. Condition aggregation also reduces the number of Acute Myocardial Condition Eras per person in the transformed data in both databases.
The average length of a Condition Era within the transformed data is significantly longer in GE than in PHARMetrics, both in the overall data and for Condition Eras representing acute myocardial infarctions.
presents the results of the data aggregation steps for the entire database as well as for rofecoxib and acute myocardial infarction.
Impact of data aggregation on drugs and conditions
The CDM performance metrics were produced from a single analysis program executed against the transformed data in both databases. provides summary statistics produced by the analysis program, regarding the total number of persons within each transformed source file, as well as the count of persons in the rofecoxib and acute myocardial infarction cohorts. The transformed person counts match the statistics produced from the native data.
and provide a summary of the cohort demographics for the rofecoxib () and acute myocardial infarction () cohorts compared to the database background for the transformed GE and PHARMetrics data. and compare the concomitant medications found within each transformed database for the rofecoxib () and acute myocardial infarction () cohorts. Although the underlying data within each source database were recorded in different ways and for different reasons, the CDM has allowed us to execute one systematic analysis across both databases utilizing standard definitions and assumptions and to present the results in a consistent format enabling common interpretation.
Demographic summary of rofecoxib cohort in each transformed database.
Demographic summary of the acute myocardial infarction (AMI) cohort in each transformed database.
Comparison of concomitant medications within each transformed database for the rofecoxib cohort.
Comparison of concomitant medications occurring during acute myocardial infarction Condition Eras, within each transformed database.
These results highlight many similarities, as well as a few differences, in the underlying populations captured by administrative claims versus electronic medical records databases. For instance, 19.9% of the persons in the GE database are 65 years old or older, while only 8.8% within the PHARMetrics database are 65 or older. This is a characteristic of claims databases reflecting the fact that the elderly population in the USA is eligible for healthcare coverage provided by the government versus private coverage. And, although the concomitant medications reported among comparable cohorts are strikingly similar across both databases, one major discrepancy—concomitant aspirin use in both cohorts—highlights differences in the underlying data capture purposes of the two data sources. Over-the-counter medications such as aspirin are not typically reimbursed by health insurance providers but their use is recorded by healthcare providers in an electronic medical record (EMR).