The Pharmacogenomics Evidence Mapping for Reasoning with Individualized Clinical Data (PEMRIC) model incorporates an evidential approach to calculating phenotype scores from patient-specific genotype data, in the context of making prescribing decisions. Six different approaches to calculating phenotype scores are described in this manuscript that use the following methods:
- Method of selecting evidence: enzyme-oriented or drug-oriented (ie, KB contains all ‘allelic variant’–‘enzyme activity’ association facts reported in the literature, or KB contains only drug-specific ‘allelic variant’–‘enzyme activity’ facts)
- Method of calculating phenotype score: enzyme activity or metabolizer activity (ie, calculations incorporate ‘allelic variant’–‘enzyme activity’ facts or ‘genotype’–‘metabolizer activity’ facts, respectively)
- Method of calculating phenotype score: unweighted or weighted (ie, weights assigned to enzymes according to their level of involvement in metabolic activities).
Below, we provide an overview of PEMRIC model methods for selecting and reasoning with evidence to provide phenotype scores.
A model for reasoning with pharmacogenomic evidence and patient clinical data
The PEMRIC model builds on the approach used to develop the DIKB that incorporates an EB
, a KB
, and use of the DIKB evidence taxonomy. The PEMRIC model extends the DIKB model to include evidence sources
and clinical data sources
(see ). Evidence sources
utilized in this study are SuperCYP, PharmGKB, and a review article that reports ‘genotype’–‘metabolizer activity’ associations for CYP enzymes involved in tamoxifen metabolism.2
The clinical data source
used in this study was produced by the Specific Estrogen Receptor Modulator Pharmacogenetics (SERM) group (now called the Consortium on Breast Cancer Pharmacogenomics, or COBRA)15
and made publically available through the PharmGKB website. The DIKB model and the PEMRIC model both represent knowledge as frames and take a rule-based strategy for reasoning. With a frame-based approach, objects are represented as classes, classes have attributes (or slots), and slots have assigned values.
We represent the following objects for the EB and KB: patient (patient class), patient genotype data (patient-enz-genotype class), evidence of medication PK mechanism (med-metabolite class), evidence of ‘allelic variant’–‘enzyme activity’ association (enz-allele-activity-publication class), evidence of ‘genotype’–‘metabolizer activity’ association (metabolizer-activity-publication class), and evidence of enzyme contribution in a medication metabolic pathway (enz-contribution class). lists examples of each class and describes components of the frame-based representation. Each object has a class name and one or more ‘slots,’ where each slot represents an important attribute of the object. In the PEMRIC model, each object represents an assertion (or fact).
Example Java Expert System Shell (JESS) facts derived from the clinical data source, PharmGKB, SuperCYP, and review article sources
The PEMRIC model also incorporates rule-based reasoning using forward chaining inference. Rules are if–then statements, where a given set of conditions will lead to a set of results. Assertions are used to determine whether conditions defined in a rule are true. Given the conditions defined in a rule are true, an action will take place. Forward chaining inference specifically starts with a collection of assertions used to infer new assertions until a goal is reached or until nothing new can be inferred. Therefore, the action is often to infer a new assertion, which is then added to the EB or KB. Reasoning concludes when all relevant assertions are considered in calculating a phenotype score. Rules included in the PEMRIC EB and KB are described in .
Pseudocode description of evidence base (EB) and knowledge base (KB) rules
The PEMRIC model incorporates a subset of DIKB taxonomy evidence types to specify the belief criteria for including evidence of ‘allelic variant’–‘enzyme activity’ associations in the KB. Evidence types specified for publications include in vitro (in vitro experiment evidence type in box 1
) and in vivo (retrospective and clinical trial study evidence types in box 1
) evidence. A belief criterion distinct from the DIKB evidence taxonomy was specification of the drug of focus in a published study. We took two approaches to implementing the PEMRIC model to facilitate comparison of the predictability of phenotype scores calculated using an enzyme-oriented approach (without the criterion) and using a drug-oriented approach (with the criterion). Taking an enzyme-oriented
approach, any publication reporting an ‘allelic variant’–‘enzyme activity’ association for a gene is included in the KB. For the enzyme-oriented approach, the following was one of three rules (a rule for each evidence type) included in the EB specifying whether an ‘allelic variant’-‘enzyme activity’ association is believed to be sufficient evidence for inclusion in the KB: IF publication q
reports that allelic variant d
AND publication q
has-evidence-type ‘in vitro experiment’ THEN allelic variant d
. With the drug-oriented
approach, evidence provided by a publication is included in the KB if the drug of focus in the study is a drug being taken by the patient. For the drug-oriented
approach, the following rule was one of three rules included in the EB: IF publication q
reports that allelic variant d
AND patient p
is taking medication m
AND publication q
AND publication q
has-evidence-type ‘in vitro experiment’ THEN allelic variant d
The model also has flexibility to incorporate other evidence types. A new evidence type can be incorporated into the system by adding a rule to the EB, for example, by adding a rule specifying that an ‘allelic variant’–‘enzyme activity’ assertion of a non-traceable drug-label statement
evidence type is believed to be sufficient evidence to be included in the KB: IF publication q
reports that allelic variant d
AND publication q
has-evidence-type ‘a non-traceable drug-label statement’ THEN allelic variant d
, where a non-traceable drug-label statement is an assertional statement found in a drug label that does not provide any traceable citations for its evidence support.8
Prototype reasoning system design in a tamoxifen case study context
Our reasoning system is a prototype implementation of the PEMRIC model for reasoning with pharmacogenomics knowledge and clinical data. The initial trigger for the system is retrieval of a medical record number of a patient who is prescribed tamoxifen. Given the patient is being prescribed tamoxifen, the patient's genomics data (clinical source data), and KB facts, our reasoning system calculates a phenotype score. System components are summarized in .
Pharmacogenomics evidence sources and tamoxifen case study data
Data and knowledge sources from which we derive assertions include two evidence sources, one review article, and one clinical data source. The clinical data source
produced by COBRA includes data for 30 subjects who received 20 mg/day tamoxifen. Genotype information utilized in this work includes the results from CYP3A5,
genetic tests. The mode of ascertaining genotypes is specified in the PharmGKB dataset.16
Supplementary phenotypic information was obtained directly from the COBRA group. Specific phenotypic information that was utilized in this work includes measured amounts of endoxifen and NDM at 4 months after initiation of tamoxifen. Two patients were excluded from analyses because they did not have recorded values for endoxifen and NDM plasma levels. We considered metabolite plasma levels at 4 months because tamoxifen serum concentrations reach a steady state by 4 months.17
The primary phenotype we wished to predict using our scoring algorithms was the endoxifen/NDM plasma concentration ratio (as a marker for drug metabolism efficacy).
Evidence sources include PharmGKB and SuperCYP. We derived computable assertions from evidence of the drug PK pathway reported in PharmGKB. Also, with a focus on CYPs, we derive ‘allelic variant’–‘enzyme activity’ association assertions from SuperCYP (and in some cases PharmGKB). Since our evaluation is focused on patient genotype in the context of a drug metabolism, we assume all PK pathway knowledge from PharmGKB to be true. In our evaluation, in vitro experiments, and retrospective and clinical trial (referred to as ‘in vivo’ studies in the SuperCYP database) evidence types are considered acceptable evidence to support a given ‘allelic variant’–‘enzyme activity’ association assertion.
Currently, information needed to define ‘genotype’–‘metabolizer activity’ associations are not captured in PharmGKB or SuperCYP (ie, metabolizer activity levels associated with various genotypes). CYP metabolizer activity levels in patients taking tamoxifen that result from CYP2C9,
genotypes are described in a single review article,2
but CYP3A5 is not covered. Therefore, we derived assertions for CYP2C9,
from the review article, and a conservative approach was taken to define ‘genotype’–‘metabolizer activity’ associations for CYP3A5
(ie, designation as ‘extensive metabolizer’ with at least one wild-type allele, and ‘intermediate metabolizer’ otherwise). Similar to metabolic properties reported in PharmGKB, these assertions are assumed to be true.
Collecting pharmacogenomics evidence for tamoxifen case study
Using tamoxifen as an example, we performed the following manual steps:
- Step 1. Define assertions for tamoxifen metabolism properties according to tamoxifen PK pathway details available on the PharmGKB website.3
18 As mentioned previously, these assertions are assumed to be true. Therefore, they are directly included in our KB.
- Step 2. Perform a SuperCYP polymorphism search for each enzyme involved in the tamoxifen PK pathway. Results include reports of ‘allelic variant’–‘enzyme activity’ associations and references the PubMed ID of publications containing evidence of each relationship.
- Step 3. Define gene ‘allelic variant’–‘enzyme activity’ association assertions and classify evidence types for each publication using the DIKB evidence taxonomy.
- Step 4. Enter evidence items into the EB.
- Step 5. Define ‘genotype’–‘metabolizer activity’ association assertions according to reports summarized in the review article2 (see ). These assertions are assumed to be true and are directly included in our KB.
Genotypes and their associated metabolizer phenotypes
Reasoning rules and objects
Our prototype system performs reasoning over PK pathway knowledge, and ‘allelic variant’–‘enzyme activity’ or ‘genotype’–‘metabolizer activity’ association knowledge to calculate a phenotype score. EB ‘allelic variant’–‘enzyme activity’ association facts are included in the KB if there is at least one primary research article classified as having drug metabolism identification experiment results to support the fact and according to whether an enzyme-oriented or drug-oriented approach is being taken. All PK pathway knowledge facts are included in the KB. We define both facts and rules using the Java Expert System Shell (JESS),19
a Java-based rule engine and scripting environment. When contradictory evidence was observed (eg, CYP2D6
*10 has reports indicating non-functional and decreased enzyme activity), we included the most commonly reported value in the KB (eg, CYP2D6
*10 leading to decreased enzyme activity). If there was a tie, facts were included in the KB according to the following priority: increased>wild-type>decreased>non-functional. Once reasoning concludes, the system assigns a patient their phenotype score.
We will describe the main approaches taken to select evidence for phenotype score calculations (enzyme-oriented and drug-oriented approaches), to define facts used for calculations (‘allelic variant’–‘enzyme activity’ and ‘genotype’–‘metabolizer activity’ association assertions), and to incorporate into phenotype score calculations weighting values indicating the involvement of enzymes in metabolic activities (weighted and un-weighted approaches).
Six phenotype scoring algorithms are investigated in total: (1) enzyme-oriented, un-weighted, ‘allelic variant’–‘enzyme activity’ scoring algorithm; (2) drug-oriented, un-weighted, ‘allelic variant’–‘enzyme activity’ scoring algorithm; (3) drug-oriented, un-weighted, ‘genotype’–‘metabolizer activity’ scoring algorithm; (4) enzyme-oriented, weighted, ‘allelic variant’–‘enzyme activity’ scoring algorithm; (5) drug-oriented, weighted, ‘allelic variant’–‘enzyme activity’ scoring algorithm; and (6) drug-oriented, un-weighted, ‘genotype’–‘metabolizer activity’ scoring algorithm. An ‘enzyme-oriented, un-weighted, ‘genotype’–‘metabolizer activity’ scoring algorithm’ and a ‘enzyme-oriented, weighted, ‘genotype’–‘metabolizer activity’ scoring algorithm’ are not investigated because the review article2
utilized to determine metabolizer activity is specific to studies involving tamoxifen.
Evidence selection: enzyme-oriented approach
As described above in Step 3, in our evidence collection methods, we derive gene ‘allelic variant’-’enzyme activity’ assertions from publications. Using the enzyme-oriented approach to reasoning, the following objects were captured for each ‘allelic variant’–‘enzyme activity’ fact: the PubMed ID of the publication containing the evidence; gene allelic variant (eg, CYP2C9*2); allele activity (eg, decreased); and evidence type (eg, in vitro) (see pubmed-id, enz-allele, enz-activity, and evidence-type slots for enz-allele-activity-publication object in ).
All evidence of gene ‘allelic variant’–’enzyme activity’ was included in the KB as long as the evidence criterion described in the previous section was satisfied. With the drug-oriented approach, information about the drug studied is also incorporated in our reasoning algorithms.
Evidence selection: drug-oriented approach
The drug-oriented approach to reasoning required adding a drug slot to the enz-allele-activity-publication object for each ‘allelic variant’–’enzyme activity’ fact representing the drug studied (see ). Evidence was then included in the KB if: (1) defined evidence criteria are satisfied; and (2) the study involves the drug of interest (ie, tamoxifen).
A subset of the phenotype scoring algorithms considered in this work incorporate weighting enzymes in the calculation. A weighted approach incorporates a numeric value indicating the involvement of an enzyme in the metabolic activities of the drug (ie, allelic variants in genes encoding major drug metabolizing enzymes are weighted higher than minor enzymes). All enzymes involved in the metabolism of the drug are considered equal contributors in the un-weighted approach.
Phenotype score calculation: un-weighted approach
The phenotype score is calculated according to equations (1)
is the phenotype score calculated as the sum of allele activity levels (a
) or metabolizer activity levels (m
), across all genes (g
). The ‘allelic variant’–‘enzyme activity’ scoring system and ‘genotype’–‘metabolizer activity’ scoring system are described below. We implemented both un-weighted and weighted approaches because we were interested in whether accounting for the relative contribution of enzymes involved in a drug metabolic pathway would affect our ability to predict drug metabolism efficacy.
Phenotype score calculation: weighted approach
With the weighted approach the phenotype score is calculated according to equation (3)
. These differ from equations (1)
in the inclusion of a weight factor (w
). Each allelic or genotype activity value is multiplied by the weight factor assigned to that gene. Genes encoding major and minor metabolizing enzymes for a given drug are assigned different weight factors. We added a new object to describe the enzyme contribution in the drug PK pathway (see enzyme-contribution object in ). An enzyme was identified as major or minor according to PharmGKB PK pathway evidence. In this case, enzymes CYP2D6 and CYP3A5 were described as major metabolizing enzymes, and CYP2C9 and CYP2C19 as minor metabolizing enzymes. The numeric major/minor values assigned are as follows:
- Gene encodes major metabolizing enzyme: 1.0
- Gene encodes minor metabolizing enzyme: 0.5
Phenotype scoring algorithms use one of two methods leveraging existing pharmacogenomics knowledge to assigning numeric values to patient genetic variant status. One method assigns numeric values according to ‘allelic variant’–‘enzyme activity’ associations, and the other method assigns numeric values according to ‘genotype’–‘metabolizer activity’ associations.
Phenotype score: ‘allelic variant’–‘enzyme activity’ scoring system
Allelic activities (a
) utilized in the phenotype score calculations are assigned according to a convention consistent with other studies10–12
- Increased allele activity: 1.5
- Wild-type allele activity: 1.0
- Decreased allele activity: 0.5
- Non-functional allele activity: 0.0
Phenotype score: ‘genotype’–‘metabolizer activity’ scoring system
Patient genotypes are assigned to metabolizer phenotypes as described in (see enz-allele-activity-publication object in ). Metabolizer activities (m
) utilized in the phenotype score calculations are assigned according to a convention similar to the allele activity scoring system:
- Ultrarapid metabolizer (UM) metabolizer activity: 1.5
- Extensive metabolizer (EM) metabolizer activity: 1.0
- Intermediate metabolizer (IM) metabolizer activity: 0.5
- Poor metabolizer (PM) metabolizer activity: 0.0
PEMRIC model implementation: reasoning rules
Rules for reasoning with assertions defined in the KB are summarized in . Rules for including facts about ‘allelic activity’–‘enzyme activity’ associations, taking an enzyme-oriented approach, are described. Rules to assign metabolizer activity values based on an individual's genotype (two allelic variant activity values), that is, ‘genotype’–’metabolizer activity’ associations, are not shown. Neither are the rules for applying a drug-oriented approach shown.
PEMRIC model implementation: reasoning facts
We took two main approaches (enzyme-oriented and drug-oriented) to encoding EB facts that were relevant to our tamoxifen case study dataset.
Taking the enzyme-oriented approach, our EB included 85 ‘allelic variant’–‘enzyme activity’ facts (from SuperCYP). However, there is currently only one publication reporting results from a tamoxifen study within SuperCYP. Therefore, in order to facilitate our drug-oriented approach to reasoning, we supplemented ‘allelic variant’–‘enzyme activity’ facts derived from SuperCYP, with facts derived from curated publications in PharmGKB. PharmGKB has curated drug–gene relationships as well as curated gene variant annotations. Two authors of this manuscript reviewed publications curated by PharmGKB as having evidence of gene–tamoxifen relationships for CYP2D6,
CYP2C19, and CYP3A5 genes (32 publications). This set of publications was narrowed down to nine publications that are vitro experiment, or retrospective or clinical trial study evidence types, and define activity level for at least one gene allele.
For each PharmGKB publication, the authors read the publication and manually recorded all slot values for enz-allele-activity-publication object facts (example values are shown in ). Multiple enz-allele-activity-publication facts can be derived from a single publication (ie, one publication can report results on multiple populations and multiple allelic variants).
The facts directly included in the KB are: seven PK pathway facts (PharmGKB), 140 facts from the clinical data source (facts representing the existence of 28 test patients, and facts for the results of five genetic tests for each test patient), and 40 ‘genotype’–‘metabolizer activity’ association facts (one for each combination of increased, wild-type, decreased, and non-functional for two alleles of each gene). Fifteen of the 40 were ‘genotype’–‘metabolizer activity’ facts derived from the review article and 25 were assigned using a conservative approach (see ‘Activity scores for phenotype prediction: tamoxifen case study’ section). Example facts are shown in .
Tamoxifen case study evaluation: statistical methods
We evaluated whether phenotype scores (incorporating information about multiple enzymes) are predictive of differences in the endoxifen/NDM plasma ratio (as a marker for drug metabolism efficacy) for 28 patients. For each of six phenotype scoring algorithms, we performed both linear and quantile linear regression where the independent variable was the phenotype score and the dependent variable was the endoxifen/NDM plasma ratio. We report the more conservative of the two approaches (the quantile regressions). In addition, based on a suggestion from a reviewer, we performed quantile regression with bootstrapped standard errors. The number of bootstrap replicates was set at 2500. We found that one approach to calculating a phenotype score passed tests for significance, compared to three approaches without bootstrapping.
We also investigate how our knowledge-based approach to calculating phenotype scores compares to a simple metric representing the genotype of (1) a single gene, and (2) multiple genes. For the simple metric, for each patient, the genotype of each gene is designated as having two wild-type alleles (Wt/Wt), two variant alleles (Vt/Vt), or one variant and one wild-type allele (Wt/Vt). Simple metrics for individual genes were assigned as follows: Wt/Wt=0, Wt/Vt=1, Vt/Vt=2. Although not discussed in detail in this manuscript, we also evaluated a dominant model (Vt/Vt=0, Wt/Vt=1, Wt/Wt=1) and a recessive model (Vt/Vt=1, Wt/Vt=0, Wt/Wt=0) for representing the genotypes of multiple genes. A simple metric for multiple genes was the sum of these values across all genes. See for frequency counts. To evaluate the predictive power of individual genes, quantile linear regression was performed where the independent variable was the simple metric of an individual gene and the dependent variable was the endoxifen/NDM plasma ratio. To evaluate the predictive power of multiple genes, quantile linear regression was performed where the independent variable was the simple metric for multiple genes, and the dependent variable was the endoxifen/NDM plasma ratio.
Distribution of genotypes: CYP3A5 Wt=*1, Vt=*3,*6; CYP2D6 Wt=*1, Vt=*4,*6; CYP2C9 Wt=*1, Vt=*2,*3; and CYP2C19 Wt=*1, Vt=*2.
The predictive performances of the phenotype scores and of the simple metrics on the endoxifen/NDM plasma ratio were assessed with R2, and their significances are evaluated with p values. All statistical analyses were performed using Stata V.11.2 (StataCorp LP).