|Home | About | Journals | Submit | Contact Us | Français|
Drug pharmacokinetics parameters, drug interaction parameters, and pharmacogenetics data have been unevenly collected in different databases and published extensively in the literature. Without appropriate pharmacokinetics ontology and a well annotated pharmacokinetics corpus, it will be difficult to develop text mining tools for pharmacokinetics data collection from the literature and pharmacokinetics data integration from multiple databases.
A comprehensive pharmacokinetics ontology was constructed. It can annotate all aspects of in vitro pharmacokinetics experiments and in vivo pharmacokinetics studies. It covers all drug metabolism and transportation enzymes. Using our pharmacokinetics ontology, a PK-corpus was constructed to present four classes of pharmacokinetics abstracts: in vivo pharmacokinetics studies, in vivo pharmacogenetic studies, in vivo drug interaction studies, and in vitro drug interaction studies. A novel hierarchical three level annotation scheme was proposed and implemented to tag key terms, drug interaction sentences, and drug interaction pairs. The utility of the pharmacokinetics ontology was demonstrated by annotating three pharmacokinetics studies; and the utility of the PK-corpus was demonstrated by a drug interaction extraction text mining analysis.
The pharmacokinetics ontology annotates both in vitro pharmacokinetics experiments and in vivo pharmacokinetics studies. The PK-corpus is a highly valuable resource for the text mining of pharmacokinetics parameters and drug interactions.
Pharmacokinetics (PK) is a very important translational research field, which studies drug absorption, disposition, metabolism, excretion, and transportation (ADMET). PK systematically investigates the physiological and biochemical mechanisms of drug exposure in multiple tissue types, cells, animals, and human subjects . There are two major molecular mechanisms of a drug’s PK: metabolism and transportation. The drug metabolism mainly happens in the gut and liver; while drug transportation exists in all tissue types. If the PK can be interpreted as how a body does on the drug, pharmacodynamics (PD) can be defined as how a drug does on the body. A drug’s pharmacodynamics effect ranges widely from the molecular signals (such as its targets or downstream biomarkers) to clinical symptoms (such as the efficacy or side effect endpoints) .
Drug-drug interaction (DDI) is another important pharmacology concept. It is defined as whether one drug’s PK or PD response is changed due to the presence of another drug. PD based drug interaction has a wide range of interpretations (i.e. from molecular markers to clinical endpoints). PK based drug interaction mechanism is very well defined: metabolism enzyme based and transporter based DDIs. Pharmacogenetic (PG) variations in a drug’s PK and PD pathways can also affect its responses . In this paper, we will focus our discussion on the PK, PK based DDI, and PK related PG.
Although significant efforts have been invested to integrate biochemistry, genetics, and clinical information for drugs, significant gaps exist in the area of PK. For example DrugBank (http://www.drugbank.ca/) doesn’t have in vitro PK and its associated DDI data; DiDB (http://www.druginteractioninfo.org/) doesn’t have sufficient PG data; and PharmGKB (http://www.pharmgkb.org/) doesn’t have sufficient in vivo and in vitro PK and its associated DDI data. As an alternative approach to collect PK from the published literature, text mining has just started to be explored [1-4].
From either database construction or literature mining, the main challenge of PK data integration is the lack of PK ontology. This paper developed a PK ontology first. Then, a PK corpus was constructed. It facilitated DDI text mining from the literature.
PK Ontology is composed of several components: experiments, metabolism, transporter, drug, and subject (Table (Table1).1). Our primary contribution is the ontology development for the PK experiment, and integration of the PK experiment ontology with other PK-related ontologies.
Experiment specifies in vitro and in vivo PK studies and their associated PK parameters. Table Table22 presents definitions and units of the in vitro PK parameters. The PK parameters of the single drug metabolism experiment include Michaelis-Menten constant (Km), maximum velocity of the enzyme activity (Vmax), intrinsic clearance (CLint), metabolic ratio, and fraction of metabolism by an enzyme (fmenzyme) . In the transporter experiment, the PK parameters include apparent permeability (Papp), ratio of the basolateral to apical permeability and apical to basolateral permeability (Re), radioactivity, and uptake volume . There are multiple drug interaction mechanisms: competitive inhibition, non-competitive inhibition, uncompetitive inhibition, mechanism based inhibition, and induction . IC50 is the inhibition concentration that inhibits to 50% enzyme activity; it is substrate dependent; and it doesn’t imply the inhibition mechanism. Ki is the inhibition rate constant for competitive inhibition, noncompetitive inhibition, and uncompetitive inhibition. It represents the inhibition concentration that inhibits to 50% enzyme activity, and it is substrate concentration independent. Kdeg is the degradation rate constant for the enzyme. KI is the concentration of inhibitor associated with half maximal Inactivation in the mechanism based inhibition; and Kinact is the maximum degradation rate constant in the presence of a high concentration of inhibitor in the mechanism based inhibition. Emax is the maximum induction rate, and EC50 is the concentration of inducer that is associated with the half maximal induction
The in vitro experiment conditions are presented in Table Table3.3. Metabolism enzyme experiment conditions include buffer, NADPH sources, and protein sources. In particular, protein sources include recombinant enzymes, microsomes, hepatocytes, and etc. Sometimes, genotype information is available for the microsome or hepatocyte samples. Transporter experiment conditions include bi-directional transporter, uptake/efflux, and ATPase. Other factors of in vitro experiments include pre-incubation time, incubation time, quantification methods, sample size, and data analysis methods. All these info can be found in the FDA website (http://www.abclabs.com/Portals/0/FDAGuidance_DraftDrugInteractionStudies2006.pdf).
The in vivo PK parameters are presented in Table Table4.4. All of the information are summarized from two text books [1,8]. There are several main classes of PK parameters. Area under the concentration curve parameters are (AUCinf, AUCSS, AUCt, AUMC); drug clearance parameters are (CL, CLb, CLu, CLH, CLR, CLpo, CLIV, CLint, CL12); drug concentration parameters are (Cmax, CSS); extraction ratio and bioavailability parameters are (E, EH, F, FG, FH, FR, fe, fm); rate constants include elimination rate constant k, absorption rate constant ka, urinary excretion rate constant ke, Michaelis-Menten constant Km, distribution rate constants (k12, k21), and two rate constants in the two-compartment model (λ1, λ2); blood flow rate (Q, QH); time parameters (tmax, t1/2); volume distribution parameters (V, Vb, V1, V2, Vss); maximum rate of metabolism, Vmax; and ratios of PK parameters that present the extend of the drug interaction, (AUCR, CL ratio, Cmax ratio, Css ratio, t1/2 ratio).
It is also shown in Table Table44 that two types of pharmacokinetics models are usually presented in the literature: non-compartment model and one or two-compartment models. There are multiple items need to be considered in an in vivo PK study. The hypotheses include the effect of bioequivalence, drug interaction, pharmacogenetics, and disease conditions on a drug’s PK. The design strategies are very diverse: single arm or multiple arms, cross-over or fixed order design, with or without randomization, with or without stratification, pre-screening or no-pre-screening based on genetic information, prospective or retrospective studies, and case reports or cohort studies. The sample size includes the number of subjects, and the number of plasma or urine samples per subject. The time points include sampling time points and dosing time points. The sample type includes blood, plasma, and urine. The drug quantification methods include HPLC/UV, LC/MS/MS, LC/MS, and radiographic.
CYP450 family enzymes predominantly exist in the gut wall and liver. Transporters are tissue specific. Table Table55 presents the tissue specific transports and their functions. Probe drug is another important concept in the pharmacology research. An enzyme’s probe substrate means that this substrate is primarily metabolized or transported by this enzyme. In order to experimentally prove whether a new drug inhibits or induces an enzyme, its probe substrate is always utilized to demonstrate this enzyme’s activity before and after inhibition or induction. An enzyme’s probe inhibitor or inducer means that it inhibits or induces this enzyme primarily. Similarly, an enzyme’s probe inhibitor needs to be utilized if we investigate whether a drug is metabolized by this enzyme. Table Table66 presents all the probe inhibitors, inducers, and substrates of CYP enzymes. Table Table77 presents all the probe inhibitors, inducers, and substrates of the transporters. All these information were collected from industry standard (http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm064982.htm), reviewed in the top pharmacology journal .
Metabolism The cytochrome P450 superfamily (officially abbreviated as CYP) is a large and diverse group of enzymes that catalyze the oxidation of organic substances. The substrates of CYP enzymes include metabolic intermediates such as lipids and steroidal hormones, as well as xenobiotic substances such as drugs and other toxic chemicals. CYPs are the major enzymes involved in drug metabolism and bioactivation, accounting for about 75% of the total number of different metabolic reactions . CYP enzyme names and genetic variants were mapped from the Human Cytochrome P450 (CYP) Allele Nomenclature Database (http://www.cypalleles.ki.se/). This site contains the CYP450 genetic mutation effect on the protein sequence and enzyme activity with associated references.
Transport Proteins are proteins which serves the function of moving other materials within an organism. Transport proteins are vital to the growth and life of all living things. Transport proteins involved in the movement of ions, small molecules, or macromolecules, such as another protein, across a biological membrane. They are integral membrane proteins; that is they exist within and span the membrane across which they transport substances. Their names and genetic variants were mapped from the Transporter Classification Database (http://www.tcdb.org). In addition, we also added the probe substrates and probe inhibitors to each one of the metabolism and transportation enzymes (see prescribed description).
Drug names was created using the drug names from DrugBank 3.0 . DrugBank consists of 6,829 drugs which can be grouped into different categories of FDA-approved, FDA approved biotech, nutraceuticals, and experimental drugs. The drug names are mapped to generic names, brand names, and synonyms.
Subject included the existing ontologies for human disease ontology (DOID), suggested Ontology for Pharmacogenomics (SOPHARM),, and mammalian phenotype (MP) from http://bioportal.bioontology.org (see Table Table1)The1)The PK ontology was implemented with Protégé  and uploaded to the BioPortal ontology platform.
A PK abstract corpus was constructed to cover four primary classes of PK studies: clinical PK studies (n = 56); clinical pharmacogenetic studies (n = 57); in vivo DDI studies (n = 218); and in vitro drug interaction studies (n = 210). The PK corpus construction process is a manual process. The abstracts of clinical PK studies were selected from our previous work, in which the most popular CYP3A substrate, midazolam was investigated . The clinical pharmacogenetic abstracts were selected based on the most polymorphic CYP enzyme, CYP2D6. We think these two selection strategies represent very well all the in vivo PK and PG studies. In searching for the drug interaction studies, the abstracts were randomly selected from a PubMed query, which used probe substrates/inhibitors/inducers for metabolism enzymes reported in the Table Table66.
Once the abstracts have been identified in four classes, their annotation is a manual process (Figure (Figure1).1). The annotation was firstly carried out by three master level annotators (Shreyas Karnik, Abhinita Subhadarshini, and Xu Han), and one Ph.D. annotator (Lang Li). They have different training backgrounds: computational science, biological science, and pharmacology. Any differentially annotated terms were further checked by Sara K. Quinney and David A. Flockhart, one Pharm D. and one M.D. scientists with extensive pharmacology training background. Among the disagreed annotations between these two annotators, a group review was conducted (Drs Quinney, Flockhart, and Li) to reach the final agreed annotations. In addition a random subset of 20% of the abstracts that had consistent annotations among four annotators (3 masters and one Ph.D.), were double checked by two Ph.D. level scientists.
A structured annotation scheme was implemented to annotate three layers of pharmacokinetics information: key terms, DDI sentences, and DDI pairs (Figure (Figure2).2). DDI sentence annotation scheme depends on the key terms; and DDI annotations depend on the key terms and DDI sentences. Their annotation schemes are described as following.
Key terms include drug names, enzyme names, PK parameters, numbers, mechanisms, and change. The boundaries of these terms among different annotators were judged by the following standard.
• Drug names were defined mainly on DrugBank 3.0 . In addition, drug metabolites were also tagged, because they are important in in vitro studies. The metabolites were judged by either prefix or suffix: oxi, hydroxyl, methyl, acetyl, N-dealkyl, N-demethyl, nor, dihydroxy, O-dealkyl, and sulfo. These prefixes and suffixes are due to the reactions due to phase I metabolism (oxidation, reduction, hydrolysis), and phase II metabolism (methylation, sulphation, acetylation, glucuronidation) .
• Enzyme names covered all the CYP450 enzymes. Their names are defined in the human cytochrome P450 allele nomenclature database, http://www.cypalleles.ki.se/. The variations of the enzyme or gene names were considered. Its regular expression is (?:cyp|CYP|P450|CYP450)?[0–9][a-zA-Z][0–9](?:\*[0–9])?$.
• PK parameters were annotated based on the defined in vitro and in vivo PK parameter ontology in Table Table22 and and4.4. In addition, some PK parameters have different names, CL = clearance, t1/2 = half-life, AUC = area under the concentration curve, and AUCR = area under the concentration curve ratio.
• Numbers such as dose, sample size, the values of PK parameters, and p-values were all annotated. If presented, their units were also covered in the annotations.
• Mechanisms denote the drug metabolism and interaction mechanisms. They were annotated by the following regular expression patterns: inhibit(e(s|d)?|ing|ion(s)?|or)$, catalyz(e(s|d)?|ing)$, correlat(e(s|d)?|ing|ion(s)?)$, metaboli(z(e(s|d)?|ing)|sm)$, induc(e(s|d)?|ing|tion(s)?|or)$, form((s|ed)?|ing|tion(s)?|or)$, stimulat(e(s|d)?|ing|ion(s)?)$, activ(e(s)?|(at)(e(s|d)?|ing|ion(s)?))$, and suppress(e(s|d)?|ing|ion(s)?)$.
• Change describes the change of PK parameters. The following words were annotated in the corpus to denote the change: strong(ly)?, moderate(ly)?, high(est)?(er)?, slight(ly)?, strong(ly)?, moderate(ly)?, slight(ly)?, significant(ly)?, obvious(ly)?, marked(ly)?, great(ly)?, pronounced(ly)?, modest(ly)?, probably, may, might, minor, little, negligible, doesn’t interact, affect((s|ed)?|ing|ion(s)?)?$, reduc(e(s|d)?|ing|tion(s)?)$, and increas(e(s|d)?|ing)$.
The middle level annotation focused on the drug interaction sentences. Because two interaction drugs were not necessary all presented in the sentence, sentences were categorized into two classes:
• Clear DDI Sentence (CDDIS): two drug names (or drug-enzyme pair in the in vitro study) are in the sentence with a clear interaction statement, i.e. either interaction, or non-interaction, or ambiguous statement (i.e. such as possible or might and etc.).
• Vague DDI Sentence (VDDIS): One drug or enzyme name is missed in the DDI sentence, but it can be inferred from the context. Clear interaction statement also is required.
Once DDI sentences were labeled, the DDI pairs in the sentences were further annotated. Because the fundamental difference between in vivo DDI studies and in vitro DDI studies, their DDI relationships were defined differently. In in vivo studies, three types of DDI relationships were defined (Table (Table8):8): DDI, ambiguous DDI (ADDI), and non-DDI (NDDI). Four conditions are specified to determine these DDI relationships. Condition 1 (C1) requires that at least one drug or enzyme name has to be contained in the sentence; condition 2 (C2) requires the other interaction drug or enzyme name can be found from the context if it is not from the same sentence; condition 3 (C3) specifies numeric rules to defined the DDI relationships based on the PK parameter changes; and condition 4 (C4) specifies the language expression patterns for DDI relationships. Using the rules summarized in Table Table8,8, DDI, ADDI, and NDDI can be defined by C1 ^ C2 ^ (C3 ^ C4). The priority rank of in vivo PK parameters is AUC > CL > t1/2 > Cmax. In in vitro studies, six types of DDI relationships were defined (Table (Table8).8). DDI, ADDI, NDDI were similar to in vivo DDIs, but three more drug-enzyme relationships were further defined: DEI, ambiguous DEI (ADEI), and non-DDI (NDEI). C1, C2, and C4 remained the same for in vitro DDIs. The main difference is in C3, in which either Ki or IC50 (inhibition) or EC50 (induction) were used to defined DDI relationship quantitatively. The priority rank of in vitro PK parameters is Ki > IC50. Table Table99 presented eight examples of how DDIs or DEIs were determined in the sentences.
Krippendorff’s alpha  was calculated to evaluate the reliability of annotations from four annotators. The frequencies of key terms, DDI sentences, and DDI pairs are presented in Table Table10.10. Their Krippendorff’s alphas are 0.953, 0.921, and 0.905, respectively. Please note that the total DDI pairs refer to the total pairs of drugs within a DDI sentence from all DDI sentences.
The PK corpus was constructed by the following process. Raw abstracts were downloaded from PubMed in XML format. Then XML files were converted into GENIA corpus format following the gpml.dtd from the GENIA corpus . The sentence detection in this step is accomplished by using the Perl module Lingua::EN::Sentence, which was downloaded from The Comprehensive Perl Archive Network (CPAN, http://www.cpan.org). GENIA corpus files were then tagged with the prescribed three levels of PK and DDI annotations. Finally, a cascading style sheet (CSS) was implemented to differentiate colours for the entities in the corpus. This feature allows the users to visualize annotated entities. We would like to acknowledge that a DDI Corpus was recently published as part of a text mining competition DDIExtraction 2011 (http://labda.inf.uc3m.es/ DDIExtraction2011/dataset.html). Their DDIs were clinical outcome oriented, not PK oriented. They were extracted from DrugBank, not from PubMed abstracts. Our PK corpus complements to their corpus very well.
This example shows how to annotate a pharmacogenetics studies with the PK ontology. We used a published tamoxifen PG study . The key information from this tamoxifen PG trial was extracted as a summary list. Then the pre-processed information was mapped to the PK ontology (column 2 in Additional file 1: Table S1). This PG study investigates the genetics effects (CYP3A4, CPY3A5, CYP2D6, CYP2C9, CYP2B6) on the tamoxifen pharmacokinetics outcome (tamoxifen metabolites) among breast cancer patients. It was a single arm longitudinal study (n = 298), patients took SOLTAMOXTM 20mg/day, and the drug steady state concentration was sampled (1, 4, 8, 12) months after the tamoxifen treatment. The study population was a mixed Caucasian and African American. In additional file 1: Table S1, the trial summary is well organized by the PK ontology.
This was a cross-over three-phase drug interaction study  (n = 24) between midazolam (MDZ) and ketoconazole (KTZ). Phase I was MDZ alone (IV 0.05 mg/kg and PO 4mg); phase II was MDZ plus KTZ (200mg); and phase III was MDZ plus KTZ (400mg). Genetic variable include CYP3A4 and CYP3A5. The PK outcome is the MDZ AUC ratio before and after KTZ inhibition. Its PK ontology based annotation is shown in Additional file 1: Table S1 column three.
This was an in vitro study , which investigated the drug metabolism activities for 3 enzymes, such as CYP3A4, CYP3A5, and CYP3A7 in a recombinant system. Using 10 CYP3A substrates, they compared the relative contribution of 3 enzymes among 10 drug’s metabolism. Its PK ontology based annotation is shown in Additional file 1: Table S2.
We implemented the approach described by  for the DDI extraction. Prior to performing DDI extraction, the testing and validation DDI abstracts in our corpus was pre-processed and converted into the unified XML format . The following steps were conducted:
• Drugs were tagged in each of the sentences using dictionary based on DrugBank. This step revised our prescribed drug name annotations in the corpus. One purpose is to reduce the redundant synonymous drug names. The other purpose is only keep the parent drugs and remove the drug metabolites from the tagged drug names from our initial corpus, because parent drugs and their metabolites rarely interacts. In addition, enzymes (i.e. CYPs) were also tagged as drugs, since enzyme-drug interactions have been extensively studied and published. The regular expression of enzyme names in our corpus was used to remove the redundant synonymous gene names.
• Each of the sentences was subjected to tokenization, PoS tags and dependency tree generation using the Stanford parser .
• C2n drug pairs form the tagged drugs in a sentence were generated automatically, and they were assigned with default labels as no-drug interaction. Please note that if a sentence had only one drug name, this sentence didn’t have a DDI. This setup limited us considering only CDDI sentence in our corpus.
• The drug interaction labels were then manually flipped based on their true drug interaction annotations from the corpus. Please note that our corpus had annotated DDIs, ADDIs, NDDIs, DEIs, ADEIs, and NDEIs. Here only DDIs and DEIs were labeled as true DDIs. The other ADDIs, NDDIs, DEIs, and ADEIs were all categorized into the no-drug interactions.
Then sentences were represented with dependency graphs using interacting components (drugs) (Figure (Figure3).3). The graph representation of the sentence was composed of two items: i) One dependency graph structure of the sentence; ii) a sequence of PoS tags (which was transformed to a linear order “graph” by connecting the tags with a constant edge weight). We used the Stanford parser  to generate the dependency graphs. Airola et al. proposed to combine these two graphs to one weighted, directed graph. This graph was fed into a support vector machine (SVM) for DDI/non-DDI classification. More details about the all paths graph kernel algorithm can be found in . A graphical representation of the approach is presented in Figure Figure33.
DDI extraction was implemented in the in vitro and in vivo DDI corpus separately. Table Table1111 presented the training sample size and testing sample size in both corpus sets. Then Table Table1212 presents the DDI extraction performance. In extracting in vivo DDI pairs, the precision, recall, and F-measure in the testing set are 0.67, 0.79, and 0.73, respectively. In the in vitro DDI extraction analysis, the precision, recall, and F-measure are 0.47, 0.58, 0.52 respectively in the in vitro testing set. In our early DDI research published in the DDIExtract 2011 Challenge , we used the same algorithm to extract both in vitro and in vivo DDIs at the same time, the reported F-measure was 0.66. This number is in the middle of our current in vivo DDI extraction F-measure 0.73 and in vitro DDI extraction F-measure 0.52.
Error analysis was performed in testing samples. Table Table1313 summarized the results. Among the known reasons for the false positives and false negatives, the most frequent one is that there are multiple drugs in the sentence, or the sentence is long. The other reasons include that there is no direct DDI relationship between two drugs, but the presence of some words, such as dose, increase, and etc., may lead to a false positive prediction; or DDI is presented in an indirect way; or some NDDI are inferred due to some adjectives (little, minor, negligible).
A comprehensive PK ontology was constructed. It annotates both in vitro PK experiments and in vivo PK studies. Using our PK ontology, a PK corpus was also developed. It consists of four classes of PK studies: in vivo PK studies, in vivo PG studies, in vivo DDI interaction studies, and in vitro DDI studies. This PK corpus is a highly valuable resource for text mining drug interactions relationship.
We previously had developed entity recognition algorithm or tools to tag PK parameters and their associated numerical data . We had shown that for one drug, midazolam, we have achieved very high accuracy and recall rate in tagging PK parameter, clearance (CL), and its associated numerical values. However, using our newly developed PK corpus, we cannot regain such a good performance in a more general class of drugs and PK parameters. This area will need much further investigation.
We would like to acknowledge that a DDI Corpus was recently published as part of a text mining competition DDIExtraction 2011 (http://labda.inf.uc3m.es/DDIExtraction2011/dataset.html). Their DDIs were clinical outcome oriented, not PK oriented. They were extracted from DrugBank, not from PubMed abstracts. Our PK corpus complements to their corpus very well.
PK ontology is available in OWL for download at http://rweb.compbio.iupui.edu/corpus/ontology/, which can be accessed by using any OWL editor/viewer, e.g., protégé. PK corpuses are available in XML at http://rweb.compbio.iupui.edu/corpus/.
ADMET: Absorption, disposition, metabolism, excretion, and transportation; DDI: Drug-drug interaction; KTZ: Ketoconazole; MDZ: Midazolam; POS: Part of speech; PK: Pharmacokinetics; PG: Pharmacogenetics.
The authors declare that they have no competing interests.
H-YW developed the three level hierarchical PK and DDI annotation scheme for the corpus; SK designed the PK corpus annotation implementation scheme and was one of the master annotator; AS designed the PK ontology and was one of the master annotator; ZW applied the PK ontology to three PK studies; SP collected the pharmacogenetics abstracts; Xu Han was one of the master annotator; Chienwei Chiang collect the ontology information for the transporter; LLiu advised the utility of protégé; MB, LMR and SKQ defined the in vitro and in vivo PK terminologies; SKQ was one of the Ph.D. level annotator; DF confirmed the disagreed annotations and double checked the PK terminologies and study design; and LLi contributed the idea, guide this research, and wrote the manuscript. All authors read and approved the final manuscript.
Clinical PK Studies. Table S2.in vitro PK studies.
This work is supported by the U.S. National Institutes of Health grants R01 GM74217 (Lang Li) and AHRQ Grant R01HS019818-01 (Malaz Boustani), 2012ZX10002010-002-002 (Lei Liu), and 2012ZX09303013-015 (Lei Liu).