Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Biomed Inform. Author manuscript; available in PMC 2017 April 1.
Published in final edited form as:
PMCID: PMC4837034

Prediction of Black Box Warning by Mining Patterns of Convergent Focus Shift in Clinical Trial Study Populations Using Linked Public Data

Handong Ma, MAa and Chunhua Weng, PhDa,*



To link public data resources for predicting post-marketing drug safety label changes by analyzing the Convergent Focus Shift patterns among drug testing trials.


We identified 256 top-selling prescription drugs between 2003 and 2013 and divided them into 83 BBW drugs (drugs with at least one black box warning label) and 173 ROBUST drugs (drugs without any black box warning label) based on their FDA black box warning (BBW) records. We retrieved 7499 clinical trials that each had at least one of these drugs for intervention from the We stratified all the trials by pre-marketing or post-marketing status, study phase, and study start date. For each trial, we retrieved drug and disease concepts from clinical trial summaries to model its study population using medParser and SNOMED-CT. Convergent Focus Shift (CFS) pattern was calculated and used to assess the temporal changes in study populations from pre-marketing to post-marketing trials for each drug. Then we selected 68 candidate drugs, 18 with BBW warning and 50 without, that each had at least nine pre-marketing trials and nine post-marketing trials for predictive modeling. A random forest predictive model was developed to predict BBW acquisition incidents based on CFS patterns among these drugs. Pre-and post-marketing trials of BBW and ROBUST drugs were compared to look for their differences in CFS patterns.


Among the 18 BBW drugs, we consistently observed that the post-marketing trials focused more on recruiting patients with medical conditions previously unconsidered in the pre-marketing trials. In contrast, among the 50 ROBUST drugs, the post-marketing trials involved a variety of medications for testing their associations with target intervention(s). We found it feasible to predict BBW acquisitions using different CFS patterns between the two groups of drugs. Our random forest predictor achieved an AUC of 0.77. We also demonstrated the feasibility of the predictor for identifying long-term BBW acquisition events without compromising prediction accuracy.


This study contributes a method for post-marketing pharmacovigilance using Convergent Focus Shift (CFS) patterns in clinical trial study populations mined from linked public data resources. These signals are otherwise unavailable from individual data resources. We demonstrated the added value of linked public data and the feasibility of integrating summaries and drug safety labels for post-marketing surveillance. Future research is needed to ensure better accessibility and linkage of heterogeneous drug safety data for efficient pharmacovigilance.

Keywords: Black Box Warning, Clinical Trial, Patient Selection, Convergent Focus Shift, Random Forest

Graphical abstract

An external file that holds a picture, illustration, etc.
Object name is nihms760217u1.jpg


Clinical trials are the gold standard for generating high-quality medical evidence. Pre-marketing clinical trials, ranging from phase I to III (sometimes also phase 0), validate the safety and efficacy of novel prescription drugs. Phase IV clinical trials, often called post-marketing surveillance trials, are designed to collect post-marketing drug information, including risks, benefits, and optimal use. Phase IV studies are crucial for clinical decision-making. The Food and Drug Administration (FDA) is responsible for regulating most medicines in order to ensure the safety of the medications. Black box warnings (BBWs) are the most severe medication-related safety warnings that can be placed on a drug label by FDA to indicate major drug-related risks [1]. If the post-marketing trials find too many adverse events, the Food and Drug Administration will restrict the use of the drug or even mandate that it be withdrawn from the market. About 20% of approved chemical entities are later found to cause severe adverse events and hence either receive a black box warning label or are withdrawn from the market [2]. Within biomedical informatics domain, BBWs are frequently used as important predictors or gold standards in predicting adverse drug reactions (ADRs) or drug-drug combination safety [3,4]. Unfortunately, studies have shown a significant lag between the drug approval and its acquisition of BBW, ranging from 2 to 170 months [1]. This delay can cause remarkable unnecessary loss to patients and the healthcare industry. Therefore, it is crucial to accurately and timely predict human drug toxicity and future FDA actions.

Although facing challenges, many attempts have been made to tackle this problem. Previous studies utilized FDA’s Adverse Event Reporting System (AERS) [5], medical literatures [6], online health forums [7] or other data sources [8] to predict adverse effects or even FDA safety actions. Hochberg et al. found that by using AERS data 2–3 years following the approval, more than half of FDA actions that occurred in the next 2–4 years were predictable [5]. Natural language processing (NLP) and machine learning methods have been developed to tackle this problem. A recent study used ensemble classifier (bagging) to identify drugs that are most similar to other watch lists and withdrawn drugs [7]. Despite those studies, rarely do researchers consider using clinical trial study population description to predict future FDA actions.

The specification of the study population of a trial reflects the focus of the trial when testing a drug. A recently published study reviewed the importance of streamlining eligibility criteria, which play an essential role in clinical and translational research for study population specification [9]. In practice, a study may only focus on a particular population subgroup, for example, those with a certain medical condition. The study population specification is important for excluding factors that introduce confounders to an experiment. Often, researchers design the eligibility criteria to generate a “pure” but not “typical” trial population as their focus on certain subgroup of patients [10]. This study population focus can shift over time, especially after a drug is launched. However, the phenomenon of study population focus shift has not received adequate attention or been utilized, partly because of the lack of data in the past.

With the massive public clinical trial information and drug safety reports available nowadays, we have an opportunity to forecast potential future BBW acquisitions before the completion of Phase IV trials. From, which requires timely status update for all clinical trials, we can get a complete picture of the past and present study population focuses of existing trials of varying phases and gain insights to inform BBW forecast [11]. A couple of related studies have been conducted to retrieve useful information from eligibility criteria section on [1214]; however, the relationship between study population focus and future drug outcome remains unknown. A deeper insight into clinical trial patient selection may enable us to predict which drugs are harmful based on their study population descriptions.

In this study, we investigated the correlation between drug safety label changes and study population focus shift patterns for existing interventional drug trials. We defined the Convergent Focus Shift (CFS) pattern for each prescription drug as the converged focus in post-marketing trials compared to that in pre-marketing trials. We hypothesized that drugs with potential safety warnings have different CFS patterns compared to those without warnings. For example, studies recruit mainly smokers without cardiovascular disease for smoking cessation drug Chantix before its approval by FDA (pre-marketing trials). However, many studies shifted their focuses to depressed patients after the drug was approved for sale, which was followed by serious side effects in depressive patients [15]. Since monitoring CFS pattern does not require trial outcome, it has little time lag for post-marketing pharmacovigilance compared to traditional outcome-based warning systems. Understanding of the study population CFS patterns and their correlation with adverse events may help researchers assess a drug’s potential safety issues and predict future black box warning acquisitions.


2.1 Candidate Drug Selection

We evaluated FDA-approved prescription drugs for human beings that were among the top sellers in the United States between 2003 and 2013 based on drug type information from the drugs@fda database [16]. We assessed the popularity of a drug based on its retail sales in US by obtaining this information from [17]. We identified 402 drugs that appeared at least once in the top-selling lists during these 11 years, including 200 drugs hitting the list between 2003 and 2013 and 100 drugs between 2011 and 2013. Those drugs were common in daily uses and might affect a large patient population, which makes it crucial to predict their safety-related issues.

2.2 Black Box Warning Label Extraction

At present, there exists no satisfactory FDA black box warning label database that contained both the label text and BBW acquisition date. Many of the previous studies manually checked the Physicians’ Desk Reference (PDR) [18,19] Network ( for drug labeling information. This manual method could only roughly identify the year of BBW acquisition date and was thus too imprecise for this study. Instead, we gathered the BBW information in a two-step semi-automatic manner. First, we automatically extracted all drug labeling information from PDR websites and created a drug database with FDA boxed warning text. This database contained 3068 drugs along with their black box warning label (if any). Second, we obtained each drug’s first BBW acquisition date via manual web search to make the BBW data to be precise at the month level based on the dates specified in news and FDA safety communications. If no specific day was provided for BBW acquisition, we used the first day of that month in calculation. For example, if the resource states that “In January 2014, the FDA issued a black box warning for losartan”(, we set the acquisition date to 01/01/2014. We excluded those drugs without month information from this study.

In order to tag the popular drugs with black box warning information, we mapped different drug names from to Physicians’ Desk Reference (PDR). In addition to the exact matches of drug names, we did a manual mapping to retrieve drug names with different semantic representations. We grouped trials for each drug regardless of dosages (50mg, 100mg, etc.) or product types (spray, tablets, scalp-solution, oral-solution, etc.) For example, we mapped Lamisil-tablets to Lamisil-oral, Exelon-patch to Exelon and Children’s-Zyrtec-syrup to Zyrtec-syrup. We separated the drugs into two groups, one labeled with a black box warning of unexpected adverse events. We refer to the drugs with BBW label as the “BBW” group and those without a warning as the “ROBUST” group. Only the “BBW” drugs with both label content and acquisition date information were included as candidate drugs in future analysis.

2.3 Clinical Trial Information Processing

We retrieved all interventional trial lists for all candidate drugs using the search API. Each trial can be mapped to different drugs if it uses multiple-drug interventions. All trial contents were then downloaded from the database [20] with information from all structured fields defined in its XML schema. The first marketing date for each drug was retrieved from drugs@fda database[16]. If multiple “start marketing” dates occur for a single drug, we chose the earliest date since most of the following dates only pertain to different products of the same drug. For example, one of the most popular drugs, Nexium (Esomeprazole Magnesium), contains multiple “start marketing” dates from drugs@fda database ranging from 06/28/2001 to 02/22/2013. While the first product was the delayed-release capsule with dosage of 20mg and 40mg, most of the following marketing dates were related to different dosages or forms, such as granule and injection, of the same ingredient. Thus, we used 06/28/2001 as the marketing date for our analysis.

Based on a drug’s “start marketing” date from the drugs@fda database and trial start date from, we defined two groups for all included human prescription drugs: 1) phase 0, I, II, and III trials started before drug marketing date (pre-marketing trials) and 2) phase IV trials started after marketing date (post-marketing trials). A trial must have well-defined start date, enrollment number, and eligibility criteria sections to be included in this study [21]. Note that all interventional trials of a drug were collected by their start dates rather than the first received dates or primary end dates in

To ensure the data quantity and quality for every drug, we set a minimum trial count cut-off score for each drug. Only drugs containing a certain number of trials were retained. We designed an evaluation method, which runs the predictive model iteratively with different minimum thresholds and identifies the value with best predictive power and largest sample size. Based on the evaluation result, we chose only drugs containing at least nine trials for each group, i.e. at least nine pre-marketing trials and at least nine post-marketing trials, as candidate drugs for future study. After this screening process, the remaining drugs were used to assess textual coherence and focus shift, and to build the random forest predictor. We found 256 drugs with BBW status, among which 68 had adequate clinical trials on and were included for model training and validation. The above data pre-processing workflow can be found below in

2.4 Clinical Trial Feature Extraction

The clinical trial study population description used for this study exists mainly in free-text form. We implemented a natural language processing pipeline to transform the free-text criteria into human readable decision rules. In this study, we chose the combination of eligibility criteria, condition, official title, and brief summary sections as our information fields. Trial titles and summaries contain valuable information about the intervention and trial population. For example, a trial for the smoking cessation drug Chantix was named “Varenicline + Prazosin for Heavy Drinking Smokers” (NCT02193256). Adverse Drug Events (ADEs) and Drug-Drug Interactions (DDIs) are the leading indications for drug safety [22].

We extracted disease concepts from the information fields that could be mapped to the SNOMED-CT (version: 201405) [23] by using regular expressions to search trial contents for disease concepts in SNOMED-CT. For drug name extraction, we utilized the medParser application found in the medication information extraction system MedEx, which achieved competitive precision and recall for clinical narratives [24]. MedEx was developed by Xu et al. and it focuses on extracting detailed medication data from text. It was ranked second among all 20 teams from the Third i2b2 Workshop on Natural Language Processing Challenges for Clinical Records [25]. Based on the evaluation on 50 discharge summaries and 25 clinical notes, it achieved precision of 95% and 97%, and recall of 92% and 88% [24]. We used all parameters by default when parsing the clinical eligibility texts. To increase the precision of the name entity recognition process, we only matched and included standard drug name (including both generic and brand names) and disease names in SNOMED-CT to increase the mapping accuracy. The disease and drug lists that we retrieved using this method were used to model the focus (study population) of the trial.

2.5 Textual Coherence and Focus Shift Measurement

The CFS pattern comprises two parts: 1) intra-group coherence shift and 2) inter-group focus shift across trial groups. The intra-group coherence indicates the similarity among trials within pre-marketing trials or post-marketing trials, respectively. When comparing the coherence shift between pre-marketing and post-marketing trials, we can tell whether the focuses of studies concerning a certain drug diverge or converge after the drug launch. The inter-group focus shift indicates whether post-marketing trials are focusing on the same patient population compared to pre-marketing trials; that is, whether the research’s “hot spot” changes.

In order to quantify the coherence of trials within the same trial group, we need to find a way of estimating the similarity between two word-frequency vectors. There are several ways to represent the coherence between two frequency vectors. We chose Jensen-Shannon divergence (JSD), the symmetrized and smoothed version of the Kullback–Leibler divergence, to calculate the intra-group coherence level [26,27]. This is the method Boyack et al. used to calculate within-cluster coherence [27].


In this formula, m = (p+q)/2, p is the probability of a word in a trial eligibility fields, and q is the probability of the same word in the cluster of trials in that trial group (pre-/post-marketing).


Here, N is the count of all the pre-marketing/post-marketing clinical trials for a certain drug. For each drug, we calculated two JS-Divergence scores: one for its pre-marketing trials (JSDpre) and the other for its post-marketing trials (JSDpost). For a particular drug, a low pre-marketing JS-Divergence score and a high JS-Divergence score means the researchers tended to have the same focus before the drug was approved and differing focuses after drug launch.

We used Cosine distance (Dc) to measure the inter-group focus shift. We treated trials in the same group as a whole and generated a single corpus containing all the diseases/drugs from all trials. Then, we compared the difference between the pre-marketing trial corpus (A) and post-marketing trial corpus (B) using Cosine distance. The larger the distance is, the greater the difference in focus between pre-marketing and post-marketing trials is. For example, given a drug X with three pre-marketing trials and two post-marketing trials, we computed the coherence shift level as follows (Figure 2).

Figure 2
Computational frameworks for coherence/focus shifts on simulated dataset.

2.6 Model Generation and Assessment

We built the random forest predictor [28] using the randomForest package in R. Out of all possible combinations of candidate variables, we selected six random variables to build the model based on their Gini importance. The number of variables randomly sampled as candidates at each split (mtry) was selected based on the smallest Out-Of-Bag (OOB) error rate. We included a variable selection process to finally include six variables in the model and discarded the others. We used the ROCR and pROC packages to compute the confidence interval of the ROC.

To assess the precision of the model’s prediction of future BBW acquisition, we designed a set of analysis by limiting the training set and validation set visibility. In other words, we wanted to assess the model performances in predicting future events that will happen in one, two, or three years, etc. First, we set up a pool of time points from 01/01/2006 to 04/09/2015 with a time interval of one year, i.e. 01/01/2006, 01/01/2007…01/01/2015 and 04/09/2015. The intuition was to train and validate the model with limited data before simulated time points. This process was designed to simulate the real world scenario so that we trained the model in 01/01/2006 and returned every year after the training to assess the predictive accuracy with the newest data available at that time. For example, when the training date was set to 01/01/2009, we trained the model only with trials that started before 01/01/2009 and only with drugs that acquired a BBW label before 01/01/2009 (seven drugs) would be labeled BBW. BBW drugs that acquire their labels after 01/01/2009 would be false negatives in this training set. We validated the model seven times using the seven time points that were one year apart since 01/01/2009, i.e. 01/01/2010, 01/01/2011…05/09/2015. We updated the drug labeling information for each validation. For the one-year validation (01/01/2010), we had four new BBW-labeled drugs between 2009–2010. Thus, we had 11 BBW drugs in the validation set. We then assessed the model’s accuracy in predicting BBW incidents for those new drugs. We repeat the process for each subsequent year until the date of the writing (04/09/2015), by which time we had 13 BBW drugs to validate. We plotted and analyzed the AUCs for each of the validations.


3.1 Clinical Trial Eligibility Composition

Among the 402 popular drugs, 342 were successfully mapped to PDR for label extraction. Of those, 256 drugs contained valid marking date and BBW acquisition dates and were included in the first part of the analysis. Of the 256 drugs, 83 were labeled with a BBW warning (32.4%) and 173 were not. Compared to a recently study [20], which examined 748 FDA approved NMEs between 1975 and 2009 and found 114 (15.2 percent) with one or more BBWs, our result showed a significantly large proportion of BBW acquisition (p-value: 4*10−9) among popular drugs on market.

For each trial in the analysis, researchers used 315 words on average in describing the trial’s title, condition, summary and criteria (with stop words removed), including 7.1 disease concepts and 6.3 drug concepts. There were several common scenarios in which a drug name may be mentioned in a clinical trial’s eligibility criteria: 1) The drug is used for intervention in the trial (inclusion); 2) The drug has a function similar to that of the study drug; 3) The drug may interact with the study drug and introduce noise to the trial; and 4) The drug history makes the patients not suitable for the study. Similarly, there were several scenarios for a disease to be mentioned: 1) The disease is the target disease of the trial (inclusion); 2) Patients with the disease may be not able to conduct or finish the trial; 3) The disease has a complex or unknown relationship with target disease and researchers want to distinguish between them; and 4) The intervention is developed for a certain disease domain so other related diseases are excluded.

3.2 BBW Acquisition Trends

The average time for a drug to acquire a BBW label was 9.15 (±1.49) years after marketing date for all 83 BBW drugs. In general, 45.7% of all BBW acquisition happened within seven years of marketing, which was consistent with Lasser [2], who reported a 50% rate of BBW acquisition. Their reports also showed that out of 548 new chemical entities approved in 1975–1999, 56 (10.2%) acquired a new black box warning or were withdrawn. However, results from the 256 top-selling drugs identified from 2003 to 2013 showed a much high rate of BBW acquisition (32.4%). This different might be partially caused by the popularity of selected drugs. Compared to a drug that is seldom used, a popular drug tends to have a high probability of usage, especially used in combination with other medications. Thus, more ADE and DDIs might be identified and monitored. In our dataset, all BBW acquisition events but one happened during or after year 2005. We also observed a significant increase in BBW acquisition incidents with significantly shortened acquisition time for popular drugs from 2003 to 2013 (Figure 3 d).

Figure 3
Analysis of BBW acquisition trends

3.3 Coherence Shift Analysis

After applying the minimum trial count cut-off score for drug filtering, the remaining 68 drugs were included for the following coherence shift and focus shift analysis as well as detailed model training and validation. We retrieved and analyzed 7499 clinical trials, which studied those drugs. We divided the 68 drugs into two groups for coherence analysis by their BBW status: 18 with BBW label (BBW group) and 50 without BBW warning (ROBUST group). Coherence level was measured separately for pre-marketing trials and post-marketing trials. For each trial group, we analyzed the trial coherence and focus using methods mentioned above (Figure 4a). The result showed that the textual coherence of all clinical trials (both BBW drugs and ROBUST drugs) increased for post-marketing trials (Phase IV trials) compared to pre-marketing trials (Phase I~III trials). The p-value for the KS-test was 1.5*10−3, indicating a significant higher level of divergence for post-marketing trials. Using the method of JS-Divergence as coherence measure, its results showed that the higher the divergence score is, the greater the difference between trials is. Since the similarity of clinical eligibility text indicates the researchers’ focus, the focus of post-marketing Phase IV trials diverged more compared with pre-marketing Phase 0~III trials. This observation corresponded to the fact that drug companies conducting Phase 0~III trials tend to focus on getting the drug approved for a group of study population. After a drug is approved, the original company, together with many other organizations or academic institutions, tend to conduct extended trials to test the safety and efficacy of the drug. Also, many researchers test existing drugs in different sub-domains to address possible side effects or off-label usages.

Figure 4
Coherence Shift for Robust/BBW Drugs

When we analyzed BBW drugs and ROBUST drugs separately, we observed two different coherence shift patterns. The coherence of ROBUST drugs tended to increase significantly (p = 0.01) (Figure 4b). For BBW drugs however, the increase was less significant (p = 0.07). This means the trial focus of BBW drugs did not diverge much. Research groups tend to follow the same patient groups for their independent studies. Perhaps a hidden factor drives researchers to focus on the same aspect, which might be caused by reported ADE/DDI or other safety concerns. However, there are still two alternative explanations for this: 1) the content of BBW drugs’ eligibility does not change over time, so researchers just repeat previous trials; and 2) the content of its eligibility does change but in the same direction, which can be caused by a common factor. We tested the inter-group trial content similarity in order to clarify this ambiguity.

3.4 Focus Shift Analysis

We defined eligibility focus shift as the content dissimilarity between pre-marketing and post-marketing trials for a certain drug. Compared to the coherence, which shows the intra-group trials (of same trial group) dis-similarity, the focus shift can be seen as the inter-group trials (between different trial groups) dissimilarity. This measure can be used as a declaration of the overall emphasis change before and after drug launch. We used Cosine distance to compare the focus shift in this study. Note that we conducted the analysis in both drug concept level and disease concept level. We then analyzed each condition separately.

We found that the focus of post-marketing trials differed more than that of the corresponding pre-marketing trials. The average Cosine distance was 0.329 for drug and 0.328 for disease. The greater the Cosine distance is, the larger the difference in trial contents is. We found a set of post-marketing trials having distinct content with a Cosine distance of more than 0.6. Unlike the similar direction of coherence shift for drug level and disease level, these trials showed different focus shift patterns. For BBW drugs, the focus shift in drug concept level was smaller than that of ROBUST drugs (p = 0.15), while larger in disease concept level (p = 0.08). The combined plot for coherence and focus shift is shown in Figure 5. These results suggest that post-marketing BBW drug trials focused on analyzing patients with different diseases rather than different drugs. One possible explanation is that a high incidence of drug side effects is a consequence of biased pre-marketing patient selection. The pre-marketing trial participants were not representative for the general patient population. Thus, many post-marketing drug trials are designed to assess the drug safety for patients of a subdomain, i.e. having a certain disease. Using the example of Chantix again, a much higher percentage of post-marketing trials study patients with depression. For this reason, post-marketing trials for potential BBW drugs often show a greater concern for patients with different diseases.

Figure 5
Different Eligibility Shift trends for BBW and ROBUST drugs

3.5 Random Forest Predictor

We started with dozens of variables as candidate predictors including average enrollment count, popular years, sales, to name a few. However, after the standard feature selection process based on feature importance scores measured by the widely adopted Gini index [29], only six variables were significantly more important than others. The six variables were Focus Shift (disease), Focus Shift (drug), Coherence Shift (drug), Coherence Shift (disease), Mean Enrollment (pre-marketing), and Marketing Date. The importance ranking of those variables is shown in Figure 6a. The most important variable was “Focus Shift (Disease)” for this model. The order of importance might change with different random forest resampling; however, those six variables were the most significant ones compared to those that were discarded. The best mtry value for the model was 2, with an OOB error of 17.65% compared to the 19.12% for by value 3. When applied to the most up-to-date training set, the model achieved an AUC of around 0.77. Of the top 10 predictions, only two were not BBW-labeled (one with detailed analysis below). Sixteen out of 18 BBW drugs (89%) were identified within the top 47% of all predictions. The top twenty predictions can be found in Table 1, which were excluded when we rebuilt the model in order to validate the predictive accuracy of focus shift and coherence shift. We plotted the baseline ROC (gray solid line in Figure 6b). The AUC for the baseline model was 0.59. As we can see, the focus shift and coherence shift help increase the AUC by 0.18. The increase is more significant for top predictions.

Figure 6
Random Forest Predictor
Table 1
Top 20 Drugs Predicted to Acquire Black Box Warning Using Our Random Forest Predictor

The first false positive prediction was the anticonvulsant drug Topiramate (brand name Topamax). This drug was within the top three predictions and was not labeled with BBW. We did a manual web search for its post-marketing performance. According to,

“…the increased risk of suicidal thoughts and actions as a result of using Topamax was brought to the attention of the public by the FDA in 2008. The FDA initially considered adding a black box warning. Instead, the FDA decided to add the warning to the labels and require manufacturers to develop and disperse a medication guide to warn patients of the dangers…”

Also, the drug has been labeled with most of the other safety labels such as ADVERSE REACTIONS (AR), WARNINGS (W) and PRECAUTIONS (P) by the FDA since 2009 [30]. More detailed validation of ROBUST drugs can be found in Table 2.

Table 2
Other Safety Label for ROBUST drugs (from FDA medwatch)

3.6 Minimum Trial Count Cut-off Score Selection

There is a tradeoff between drug sample size and data quality. Since we use textual analysis on disease and drug level, a drug with few trials will create a large bias for the final statistics. However, if the drug sample size is too small for training, the model performance will be lower, too. To determine the choice of minimum trial count cut-off score for the highest prediction accuracy and reasonable sample size, we performed model evaluations using different cut-off scores. The evaluation result is shown in Figure 7. As the cut-off score increased, we set a more restrict drug inclusion criteria. Thus, the candidate drug count decreased from 121 (when cut-off score equals to two) to 58 (when cut-off score equals to ten). As it is shown in the figure, when the number of trial was eight—that is, more than 8 trials each group—the model achieved the best performance. The prediction would usually achieve an AUC of greater than 0.75. This minimum trial count cut-off score’s corresponding sample size was 68 drugs, among which 18 were BBW drugs. Thus, we chose this cutoff for the random forest predictor in this study.

Figure 7
Model Evaluations with Different Minimum Trial Count Cut-off Score

3.7 Long-term Accuracy of BBW Prediction

We set up 11 time points (0–10), each representing a year from 2006 to 2015, to test the model’s predictive power. The result is shown in Figure 8. In general, the predictive accuracy increased significantly with more training data and decreased slightly over time. One out of 17 drugs was labeled BBW for the 2006 training set. With such a small and biased training set, the model performs even worse than random guess. Similar conditions applied to models trained with data before 2008. For the 2009 training set, seven out of 41 drugs were labeled BBW, with six additional BBW unknown. The model performance for one-year validation (2007) was 0.712 and remains constant for the following two years (with 11 BBW for validation). Four-year accuracy was 0.675 with another BBW acquired in 2012. Finally, AUC turned to 0.687 for the remaining years until current date (a total of 13 BBW drugs). This result shows that given enough data for model training, the model could predict the future BBW acquisition events quite well, with little accuracy loss even for long-term predictions. With more data for training after 2010, the model performance remained between 0.75 and 0.80 for almost all validation sets. The OOB error rate using the full dataset in training remained below 20%. Result showed the sample size had a significant impact on predictive power. With few sample for training in 2006, 2007 and 2008, the models performed no better than random guess. As sample size increased for training (i.e. from 2009), predictive power increased and stayed above 0.7 for most of the time. The model was even able to predict BBW acquisition in 5 years without much decrease in accuracy given enough sample size in the training set.

Figure 8
Model Predictive Range Assessments


4.1 Importance of BBW label prediction

In this study, we first analyzed the hidden drug safety signals from linked clinical trial summaries and FDA warnings. We showed clinical trial study population description could be used for predicting future BBW acquisition. Currently, the time it takes for a drug with serious side effects to gain an FDA warning ranges from several years to more than 10 years. During this time, many physicians remain unaware and unprepared for the drug’s potential side effects. Bioinformaticians rely on the “outdated” BBW labels as predictors and gold standards to facilitate the secondary use of clinical data. However, the public can be alerted to those potential BBW labels using a timelier prediction system. Our system complements existing prediction systems that use electronic health records or other online data sources by using authoritative data and tackling the problem from a new angle. Clinical trial summaries are designed by experts who have thought carefully about the research procedure of a study. They contain valuable and yet largely unused signals for predicting future drug safety events. By performing aggregate analysis of clinical trial summaries, we can gain a deep insight of current situation of drug development. With this method, we may not need to wait for a multi-year trial’s outcome to understand potential drug safety issues.

4.2 Limitations and Improvements

As an initial study serving as a proof of concept, this study aims to verify the existence of Convergent Focus Shift (CFS) pattern and its capacity in predicting BBW acquisition events. In this study, we defined two trial groups based on drug marketing dates and trial phases and calculated the CFS based on the inter-and intra-variances of the two groups. The reasons for not excluding post-marketing trials after BBW dates are stated below. First, the BBW acquisitions have been becoming faster in recent years. Consequently, the majority of post-marketing trials occur after BBW acquisition (if any). Using BBW dates to filter out post-marketing trials will cause a dramatic loss in eligible BBW drugs. Second, a drug can carry multiple BBW labels. Thus, it is hard to tell which BBW date serves as the separation point for focus changes. We could have identified multiple BBW acquisition dates for each drug and stratified trials into multiple time periods separated by these dates; however, that design would need far more data than what were available for this study. Third, the current design allows more trials to be analyzed, which increased the natural language processing precision. We are aware of the pros and cons of using this design.


  • A larger number of trials are retained in the training set, which results in an increase of eligible drugs for prediction in this study (above the Minimum Trial Count Cut-off Score).
  • Longer time period allows for the detection of the focus shift.
  • This type of research suffers from inherent small sample sizes due to information fragmentation problems. Therefore, allowing more trials to be aggregated for analysis increase the NLP precision and CFS assessment accuracy.


  • The target population focus of some trials, such as those serving as follow-up studies of the BBW warnings, can be influenced by prior BBW acquisitions. Therefore, focus shift for these trials do not occur independently from BBW events.

The analysis and prediction method in this study highly depends on the availability of data and their quality. Even with thousands of drugs available on market today, after pre-screening for drugs with complete records in various databases, only a few drugs were retained for detailed analysis. Thus, the data quality is and will remain the bottleneck for such analyses in the foreseeable future. There are several problems that need to be addressed including 1) the need to improve the data mapping process across different databases; 2) the need to bridge the gap between researchers who have data mining knowledge and government authorities that produce the data; 3) the need to maintain currency of the information to make timely predictions.

In this study, we chose drugs and diseases as two entities for detailed analysis. Meanwhile, there is additional drug safety information hidden in other parts of the original trial text. We did not explore those parts with proposed method. For example, there are valuable lab results and unstructured narratives of patients’ symptoms for study population profile. Apart from that, there are also unconsidered relationships among concepts within the plain text. That type of information might not be easily extractable by current processing techniques. Through this analysis, we find the granularity of concepts might also relate to future drug performance. Many of the ROBUST drugs tend to correlate with more general concepts, such as mental disorder, bipolar disorder etc. Partly because without too many adverse events to worry about, researchers would apply this drug to other domains, which might not be as specific as it is for BBW drugs’ situation. However, it will be necessary to migrate from concept level processing to phrase or complex relationship level processing. As a first step, we should apply negation relationship detection in the context of this study. Since there are many semantic representations for a single idea, it would be valuable to address the task of detecting those similar representations and harmonizing them into a standardized human-readable format.

4.3 Future Applications in secondary use of clinical trial summaries

This study represents a possible application of clinical data reuse for improving medication safety. One conclusion that we can draw from such a study is that, as clinical data continue to grow big in volume, we are capable of harnessing clinical data reuse on a global level. Apart from the content that we currently focus on, we can also conduct analyses regarding research behavior, trend analysis, and so on. Meanwhile, we should also be aware that the clinical data analysis problem is never a pure statistical or mathematical problem. Besides the statistical soundness, we should make our analyses logically reasonable and acceptable by domain experts in order to gain more confidence and possible applications. Finally, our current predictive model could not predict after how long a drug would be BBW-labeled; therefore, temporal prediction can be a natural meaningful follow-up project.


This study contributes a novel method based on Convergent Focus Shift (CFS) measurements to leverage linked public data for pharmacovigilance and found a significant difference in study population changes between drugs with and without BBW labels. Post-marketing trials for BBW labeled drugs focused on recruiting patients with unconsidered medical conditions while post-marking trials for unlabeled drugs have more diversified focuses. This finding sheds new light on secondary use of linked public data. As an application, we built the random forest predictor to predict future black box warning label acquisition events. We also proved that the predictor is capable of identifying long-term BBW acquisition events without significant decrease in accuracy. With the large volume of data available today, we can take advantage of the useful information hidden behind data and make smarter clinical decision in the future.

Figure 1
The workflow of pre-processing steps.


  • Black box warning acquisition time decreases over time among drugs
  • Linked and FDA Safety labels reveal hidden drug safety signals
  • Unsafe drugs differ from safe drugs in convergent focus shift in their test trials
  • Pre-and post-marketing trials are more different for unsafe drugs than safe drugs
  • Black Box Warning acquisition was predictable with AUC of 0.77 using random forest


This study was funded by NLM grant R01LM009886, “Bridging the semantic gap between clinical research eligibility criteria and clinical data.” (PI: Weng).


Black Box Warning
Convergent Focus Shift
Adverse Drug Event
Drug-Drug Interaction
Physicians’ Desk Reference


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


1. Panagiotou OA, Contopoulos-Ioannidis DG, Papanikolaou PN, Ntzani EE, Ioannidis JP. Different black box warning labeling for same-class drugs. J Gen Intern Med. 2011;26:603–610. [PMC free article] [PubMed]
2. Lasser KE, Allen PD, Woolhandler SJ, Himmelstein DU, Wolfe SM, et al. Timing of new black box warnings and withdrawals for prescription medications. JAMA. 2002;287:2215–2220. [PubMed]
3. Harpaz R, DuMouchel W, LePendu P, Bauer-Mehren A, Ryan P, et al. Performance of Pharmacovigilance Signal-Detection Algorithms for the FDA Adverse Event Reporting System. Clinical Pharmacology & Therapeutics. 2013;93:539–546. [PMC free article] [PubMed]
4. Roper N, Stensland KD, Hendricks R, Galsky MD. The landscape of precision cancer medicine clinical trials in the United States. Cancer Treat Rev. 2015;41:385–390. [PubMed]
5. Hochberg AM, Reisinger SJ, Pearson RK, O’Hara DJ, Hall K. Using data mining to predict safety actions from FDA adverse event reporting system data. Drug Information Journal. 2007;41:633–643.
6. Shetty KD, Dalal SR. Using information mining of the medical literature to improve drug safety. Journal of the American Medical Informatics Association. 2011;18:668–674. [PMC free article] [PubMed]
7. Chee BW, Berlin R, Schatz B. Predicting adverse drug events from personal health messages. American Medical Informatics Association; 2011. p. 217. [PMC free article] [PubMed]
8. Gibbons RD, Amatya AK, Brown CH, Hur K, Marcus SM, et al. Post-approval drug safety surveillance. Annual review of public health. 2010;31:419. [PMC free article] [PubMed]
9. Kim ES, Bernstein D, Hilsenbeck SG, Chung CH, Dicker AP, et al. Modernizing eligibility criteria for molecularly driven trials. Journal of Clinical Oncology. 2015;33:2815–2820. [PubMed]
10. Hoertel N, Le Strat Y, Lavaud P, Dubertret C, Limosin F. Generalizability of clinical trial results for bipolar disorder to community samples: findings from the National Epidemiologic Survey on Alcohol and Related Conditions. The Journal of clinical psychiatry. 2013;74:265–270. [PubMed]
11. Califf RM, Zarin DA, Kramer JM, Sherman RE, Aberle LH, et al. Characteristics of clinical trials registered in ClinicalTrials. gov, 2007–2010. Jama. 2012;307:1838–1847. [PubMed]
12. Weng C, Yaman A, Lin K, He Z. Smart Health. Springer; 2014. Trend and Network Analysis of Common Eligibility Features for Cancer Trials in; pp. 130–141. [PMC free article] [PubMed]
13. Humphreys K, Harris AH, Weingardt KR. Subject Eligibility Criteria Can Substantially Influencethe Results of Alcohol-Treatment Outcome Research. Journal of studies on alcohol and drugs. 2008;69:757. [PubMed]
14. Miotto R, Weng C. Unsupervised mining of frequent tags for clinical eligibility text indexing. Journal of biomedical informatics. 2013;46:1145–1151. [PMC free article] [PubMed]
16. Food, Administration D. Drugs@ FDA 2011
17. Weng C, Li Y, Ryan P, Zhang Y, Liu F, et al. A Distribution-based Method for Assessing The Differences between Clinical Trial Target Populations and Patient Populations in Electronic Health Records. Applied Clinical Informatics. 2014;5:463–479. [PMC free article] [PubMed]
18. Staff P. PDR Network. 2011. Physicians’ desk reference.
19. Lasser KE, Allen PD, Woolhandler SJ, Himmelstein DU, Wolfe SM, et al. Timing of new black box warnings and withdrawals for prescription medications. Jama. 2002;287:2215–2220. [PubMed]
20. Frank C, Himmelstein DU, Woolhandler S, Bor DH, Wolfe SM, et al. Era of faster FDA drug approval has also seen increased black-box warnings and market withdrawals. Health Affairs. 2014;33:1453–1459. [PubMed]
21. Ross J, Tu S, Carini S, Sim I. Analysis of eligibility criteria complexity in clinical trials. AMIA Summits on Translational Science Proceedings. 2010;2010:46. [PMC free article] [PubMed]
22. Bates DW, Cullen DJ, Laird N, Petersen LA, Small SD, et al. Incidence of adverse drug events and potential adverse drug events: implications for prevention. Jama. 1995;274:29–34. [PubMed]
23. Schulz S, Hanser S, Hahn U, Rogers J. The Semantics of Procedures and Diseases in SNOMED® CT. Methods of Information in Medicine. 2006;45:354. [PubMed]
24. Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, et al. MedEx: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association. 2010;17:19–24. [PMC free article] [PubMed]
25. Uzuner Ö, Solti I, Cadag E. Extracting medication information from clinical text. Journal of the American Medical Informatics Association. 2010;17:514–518. [PMC free article] [PubMed]
26. Lin J. Divergence measures based on the Shannon entropy. Information Theory, IEEE Transactions on. 1991;37:145–151.
27. Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, et al. Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One. 2011;6:e18029. [PMC free article] [PubMed]
28. Liaw A, Wiener M. Classification and regression by randomForest. R news. 2002;2:18–22.
29. Breiman L, Friedman JH, Olshen RA, Stone CI. In: Classification and regression trees. Belmont CW, editor. 1984.
30. Ioannidis JP, Greenland S, Hlatky MA, Khoury MJ, Macleod MR, et al. Increasing value and reducing waste in research design, conduct, and analysis. Lancet. 2014;383:166–175. [PMC free article] [PubMed]