Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Addiction. Author manuscript; available in PMC 2010 November 1.
Published in final edited form as:
PMCID: PMC2763048

A Bayesian Model for Estimating the Effects of Drug Use when Drug Use may be Under-reported



We present a statistical model for evaluating the effects of substance use when substance use might be under-reported. The model is a special case of the Bayesian formulation of the “Classical” measurement error model, requiring that the analyst quantify prior beliefs about rates of under-reporting and the true prevalence of substance use in the study population.


Prospective study.


A diversion program for youths on probation for drug related crimes.


257 youths at risk for re-incarceration.


The effects of true cocaine use on recidivism risks while accounting for possible under-reporting.


The proposed model showed a 60% lower mean time to re-incarceration among actual cocaine users. This effect size is about 75% larger than that estimated in the analysis that only relies on self-reported cocaine use. Sensitivity analysis comparing different prior beliefs about prevalence of cocaine use and rates of under-reporting universally indicate larger effects than the analysis assuming that everyone tells the truth about their drug use.


The proposed Bayesian model allows one to estimate the effect of actual drug use on study outcome measures.

Keywords: Under-reporting, self-report, Bayesian methods, measurement error, criminal justice, recidivism, cocaine use, adolescent drug use


Virtually all researchers in addiction medicine, substance abuse epidemiology, drug and alcohol abuse prevention, pharmacological therapies for addiction or any field related to drug and alcohol use eventually face a pervasive and vexing problem: to what extent can self-reported drug and alcohol use be trusted? There is a stigma to drug and alcohol use that might promote under-reporting, and clinicians recognize that addicts commonly deny the extent of their drug and alcohol use. Among criminal justice populations there may be serious punitive repercussions for using drugs, thus promoting under-reports. Our research regularly involves youths incarcerated and on probation for substance-related violations, adult DWI offenders, heroin addicts, and volunteers in clinical trials of treatments for alcohol dependence alcohol addiction. Almost without exception, in any paper that we have submitted for publication, or proposal submitted for funding, we have been asked by reviewers to address potential bias due to under-reporting.

Concern about the effects of inaccurate self-reports cannot be overstated, and much research has been dedicated to addressing various facets of the problem. Several themes in previous work have thus emerged, including assessment of the validity of self-report (1;2), variation in the validity of self-report among subgroups (e.g. men vs. women) (3) or substances queried (e.g. alcohol vs. opiates) (4), theories on the cognitive mechanisms that might induce more or less accurate self-reports (5), and alternative methods of querying drug and alcohol use(6). Another approach is to use biomarkers only, such as urinalysis or breathalyzer results, recognizing the limited time horizon for use that these measure obtain. Researchers within nutritional epidemiology have developed sophisticated measurement error models that augment self-reported alcohol use with food diaries collected from a subset of a study sample. The common assumption is that a diary is more accurate than self-report, though heavily stigmatized behavior such as illicit drug use is not considered (7). To our knowledge, methods that address the problem of under-reporting illicit drug use during the analysis phase of a study have not been developed and applied within addiction medicine.

Our goal is to present a framework for estimating the effects of “true” drug use when drug use might be under-reported. Our motivating example is the effect of recent drug use on recidivism among youths on probation. These youths have been arrested and incarcerated at least once for a substance-use violation, and have been released on probation. Our goal is to estimate the extent to which cocaine use in the last 90 days predicts re-incarceration risk. The model described, and the results presented, estimate the impact of true cocaine use on recidivism rates while incorporating uncertainty due to self-report.

Our approach is to treat the problem of under-reported drug use as a special case of measurement error modeling, for which statistical methods, particularly Bayesian methods, have been extensively developed (8-10). Our goals are to: 1) Explicitly define the role of under-reporting in a statistical model, and 2) Allow one to formally evaluate the impact of different assumptions about under-reporting. We wish to emphasize that this methodology does NOT eliminate the effects of under-reporting, nor does it allow one to ignore under-reporting in a study design. Additionally, our approach does not devalue previous work on the topic of under-reporting. In fact, the Bayesian methodology that we propose allows one to directly incorporate previous research into the analysis through the use of informative priors.


The study population is composed of juvenile offenders recently released from a Juvenile Detention Center (JDC) located in a large, Southwestern US city and mandated to attend a highly structured diversion program designed as an alternative to incarceration for youth. 288 youths were recruited over a 20-month enrollment period from January 2005 through September 2006. Recruitment was limited to youth mandated to the diversion program and willing to participate in a cognitive-behavioral therapy program to reduce substance use and risky behaviors. (Evaluation of the therapy is ongoing, though no impact of the therapy on recidivism was directly observed). Eligible youth were English-speaking, between 14−18 years of age, with a prior substance use referral to the juvenile justice system. Youth who were more than 13 and one-half years old were allowed to participate, as were youth who were no more than 19 and two months old.

The Form 90 (11) was one of a battery of assessments administered to the youths upon recruitment into the study. This instrument asks respondents to self-report all drug use, by type of drug and route of administration for the past 90 days. Additionally, a complete record of all youths booked into the JDC from January 1, 2002 to March 31, 2007 was obtained from the detention center administration. Study participants were matched to the booking database on last name, first name, gender, and date of birth. Seventeen youths were too old to be re-incarcerated into the JDC after enrollment into the study, and eight youths could not be matched into the JDC booking database using their names and dates of birth. An additional six youths did not complete the Form 90, and were excluded from this analysis. The resulting sample of 257 youths recruited into the study provided valid baseline cocaine use in the last 90 days and could be matched to the JDC booking database. Days from enrollment into the study until the first re-incarceration event constituted the outcome measure. Participants were censored if they were not re-incarcerated before March 31, 2007 or their 18th birthday. The number of days until re-incarceration was flagged as censored for these participants and defined as ending on the earlier of these two dates.

Statistical Model

We wish to estimate the relative time to re-booking as a result of cocaine use. Unfortunately, we only have self-reports of cocaine use from the Form 90, not true use, and we strongly suspect that youths are under-reporting. Thus, the goal is to estimate the relative re-incarceration rate while recognizing uncertainty due to our reliance on self-report.

Our approach to the problem of under-reporting is a special case of the “Classical” measurement error model, and follows a Bayesian formulation largely paraphrasing Richardson and Gilks (8-10). The model requires that we specify a “Disease Model”, which describes the relationship between time to re-incarceration and true cocaine use; a “Measurement Model”, which identifies the relationship between the self-reported cocaine use and the true cocaine use; and an “Exposure Model,” which is a model for true cocaine use. While this model is not new, nor particularly complex, some of the terminology might be unfamiliar. We refer the reader to Richardson and Gilks (9) for an introduction to Bayesian measurement error models in epidemiological research.

(1) Disease Model

We estimate the effects of cocaine use on time to re-incarceration using a log-normal accelerated failure time model. Specifically, we will model the time in days to re-incarceration for the ith youth, denoted Ti, given a binary indicator Ci of true cocaine use for that youth, (where Ci = 1 if youth actually used cocaine, and Ci = 0 otherwise) as


where σ is the lognormal scale parameter, β0 is the intercept, and β1 measures the relative decrease in the mean days to re-incarceration due to true cocaine use. Recidivism is a complex phenomenon that might be influenced by many factors. Other covariates may be included in the Disease model, such as gender or ethnicity, though we presently set these aside for the purposes of illustrating the model.

(2) Measurement Model

Ri is the self-reported cocaine use by the ith subject. We assume that Ri (given Ci) ~ Bernoulli( (1-λ)Ci), where λ is the rate with which cocaine use is under-reported in the study population. This model states that the probability of self-reporting cocaine use is equal to zero if one is truly a non-user (Ci = 0), and is equal to (1-λ), the probability of telling the truth, for those who are users. This model explicitly recognizes a key assumption of the model: That youths never over-report cocaine use. We don't believe that over-reporting is particularly significant in addiction medicine, a position that is justified by some empirical research. Williams and Nowatzki (2005) and Kim et al. (2000) compared urinalysis and self-reports of cocaine use among adolescents and found only 0% and 1.4% rates of over-reporting, respectively. However, the implications of relaxing this assumption are discussed below. We note that if λ is defined as exactly zero, (i.e. no one ever under-reports cocaine use), then the model reduces to a model of time to re-incarceration given self-report. This means replacing Ci with Ri in the disease model described above.

(3) Exposure Model

Ci is the (unobserved) actual cocaine use of the ith subject, and is assumed to have a binary or Bernoulli distribution Ci~Bernoulli(θ), where θ is the prevalence of true cocaine use in the study population. Ci may also depend on covariates, though this is not explored in the current study.

The true prevalence of cocaine use, θ, and the rate of under-reporting, λ, require explicit probability distributions, called “priors” in the Bayesian framework. These are quantifications of the assumptions that the investigators take into the analysis about true under-reporting and true cocaine use. By combining these priors with the data using Bayes theorem, one's priors are updated by the available data. By conducting several analyses under different priors, one explores the sensitivity of the analysis to prior beliefs. In this way we formally evaluate different assumptions about rates of under-reporting.

We consider three different priors for the prevalence of cocaine use in the last 90 days among juvenile offenders. The prevalence is assumed to follow Beta distributions shown in Figure 1. Beta distributions are commonly used as priors for probabilities (12). Prior (a) corresponds to the initial belief that the true prevalence of cocaine use in the last 90 days is about 10% (the mean for a Beta(x,y) prior is x/(x+y), so this Beta(1,9) prior has a mean of 1/(1+9) = 1/10 or 10%), with much of the probability concentrated towards 0%. This corresponds to initial beliefs that cocaine use in the last 90 days is somewhat unusual in this population. Prior (b) expects the prevalence to be about 45%, but allows for considerable uncertainty in the true prevalence. This prior is appropriate for an investigator who is fairly certain that the true prevalence is between 10% and 80%, but is uncertain about its true value within that range. Prior (c) has an expected prevalence that is fairly high, about 56%, and is almost certainly within about 30% and 80%. Our uncertainty about the prevalence of cocaine use in this sample leads us to believe that prior (b) is most appropriate, but we compare results to priors (a) and (c) in the sensitivity analysis.

Figure 1
Prior probability plots for the prevalence of cocaine use in the last 90 days among juvenile offenders. (a) = Beta(1,9), (b)=Beta(2.5,3), (c) = Beta(10,8).

We also consider three different Beta prior distributions for the rate of under-reporting, λ (Figure 2). Prior (q) corresponds to an initial belief that under-reporting is rare in this population, with only 1 in 6 users not admitting drug use. However, this prior admits that the rate of under-reporting may be as high as 50%. Prior (r) is more moderate, with an expected rate of under-reporting of about 42%, but admits that values between about 1% and 90% are plausible. Note the concentration to the left in this distribution, which assumes that lower values of under-reporting are more likely than higher values. This prior corresponds to our initial guess to the rate of under-reporting in this population. Finally, prior (s) assumes much higher rates of under-reporting, with the expected rate of about 70% and assumes that a rate lower than about 30% is extremely unlikely.

Figure 2
Prior probability plots for rates of under-reporting cocaine use among juvenile offenders. (q) = Beta(1,5), (r)=Beta(1.5,2), (s) = Beta(7,3).

Finally, we also require priors on the terms included in the disease model. β1, the parameter of interest describing the effect of cocaine use on days until re-incarceration is given a normal prior with mean 0 and a very large variance (e.g. 10,000). This is sometimes referred to as an “objective” prior in that nearly any effect of true cocaine use on recidivism is plausible. β0 is given a normal prior centered at 5, which we observed in our original analysis of this data, and with variance 10 to allow for a wide range of possible intercepts. σ is given a half-normal (0.526) prior, which has mean = 1.91 and is also based on our initial analysis of this data.

The joint posterior distribution of the parameters is evaluated using Markov Chain Monte Carlo (MCMC). We used WinBUGS for this analysis, the code for which is shown in appendix 1. We used three sets of initial parameter values and allowed a 1000 iteration ‘burn-in’ period, followed by 5000 iterations of the sampler to generate posterior summaries. The marginal posterior distribution for each parameter is summarized by the median and 2.5- and 97.5-percentiles as Bayesian “confidence intervals”. We also computed the difference between the upper and lower limit as a measure of the posterior dispersion in the parameter estimates. A Bayesian confidence interval for the β1 parameter that does not include zero is analogous to a “statistically significant” result at the 5% level in a traditional frequentist analysis. We note that there are several computational issues to consider when using MCMC, and it is advisable to work with an experienced Bayesian statistician. A full description of the pitfalls in Bayesian analysis are beyond the scope of this article, though we refer readers to the collection of papers edited by Spiegelhalter et al. (13).


Seventy-eight (30.3%) of the 257 youths recruited into the study self-reported cocaine use in the last 90 days. Youths self-reporting cocaine use in the last 90 days spent on average 12 days in jail in the last three months, while youths self-reporting no cocaine use in the last 90 days spent on average 11.99 days in jail in the last three months. 52 of the 78 youths (66.7%) reporting use were subsequently re-incarcerated. 97 of the 179 youths (54.2%) self-reporting no cocaine use in the last 90 days were subsequently re-incarcerated. Figure 3 shows Kaplan-Meier estimates of the re-incarceration rates for each group of self-reported use. Also shown in figure 3 are the recidivism functions estimated using the lognormal regression model, assuming that there is no under-reporting (“naïve analysis”) of cocaine use. The effect of self-reported cocaine use on time to recidivism was −0.51 (95% CI = −1.05 to 0.02; shown in the first row of Table 1). This indicates that mean time to re-incarceration among self-reported users is about exp(−0.52) = 0.6 that of self-reported non-users, corresponding to about 40% faster mean time to re-incarceration among self-reported users than non-users. However, the naïve analysis indicates self-reported users and non-users are not statistically different.

Figure 3
Kaplan-Meier (solid lines) and lognormal regression model (dashed lines) estimates of the recidivism functions for self-reporting users (black lines) and non-users (grey lines). The circles indicate recidivism functions estimated using the Bayesian under-reporting ...
Table 1
Posterior sensitivity analysis for the under-reporting model presented in the text.

Table 1 also shows results of the Bayesian analysis incorporating different prior beliefs about the rates of under-reporting and the prevalence of cocaine use in this study population. Rows correspond to different priors on λ and θ. The shaded row corresponds to our priors on the distributions of λ and θ.

Several important features are illustrated in Table 1. First, all effect sizes of true cocaine use in the under-reporting model are larger than in the naïve model. Depending on the priors used, the difference can be considerable. Our priors on θ and λ (priors ‘b’ and ‘r’) yield a posterior median for β1 of −0.91, which is more than 75% larger in magnitude than the naïve analysis. This effect size corresponds to 60% shorter time to re-incarceration among true cocaine users than non-users. Furthermore, the 95% interval does not include 0, corresponding to “statistical significance” that was otherwise not observed in the naïve analysis.

As prior beliefs about θ and λ increase in magnitude, there is a general tendency towards larger effect sizes. With priors on λ that are not concentrated towards zero (as in prior ‘q’), posteriors of β1 has an upper limit that is always below zero. Likewise, prior (q), which concentrates λ closer to 0% under-reporting, yields posteriors intervals that include zero. This is not surprising, since models that assume under-reporting is relatively rare most closely resemble the naïve analysis that assumes that there is never any under-reporting.

Accompanying the increased effect sizes is increased uncertainty about the effect size, as measured by the posterior dispersion in the fourth column of table 1. Commensurate with the belief that subjects self-report the truth about their drug use, the naïve analysis estimates comparatively lower uncertainty about the true effect size than the model that allows for under-reporting.

Figure 3 also shows the estimated recidivism functions for the under reporting model using our prior (b) for the prevalence of cocaine use in the last 90 days and prior (q) for the rates of under-reporting. The analysis drops the recidivism rates of non-users, causing a larger estimated difference between true users and non-users than one observes using self-report alone. Effectively, the analysis separates the true users from the true non-users among those who deny use of cocaine in the last 90 days.

The analysis also updates our priors on the rates of under-reporting and the prevalence of cocaine use. Our prior on θ was centered at 0.4, and had a plausible range from 0.1 to 0.8. The posterior on θ centers at 0.6, and ranges from 0.3 to 0.8. Similarly, the prior on λ centered at 0.4 and ranged from 0 to 0.9. The posterior centered at 0.5 and ranged from 0.1 to 0.6. In both cases the data modified the point estimate of each parameter and reduced the range of plausible values.

Ultimately, some degree of consensus is achieved across the different priors. For example, we compared a variety of priors on λ and θ, and found that all of the posterior medians for θ were between 33% and 68%. All of the lower limits of the posterior confidence intervals were above 26%. These are remarkable statements about true cocaine use in this population without observing actual cocaine use. Likewise, the analysis updated our beliefs about the rate of under-reporting in this sample, though comparison across different priors was less consensual than for θ. All posterior medians for λ were between 10% and 56%, and all of the upper limits were below 68%. Again, this is a remarkable result given that there was no gold standard for testing the truthfulness of self-reported cocaine use.


The proposed model shows that under a variety of prior assumptions about rates of under-reporting and prevalence of cocaine use among juvenile offenders, the effects of cocaine use on recidivism, and the uncertainty of these effects, are greater in magnitude than a model that assumes no under-reporting. While this result does not necessarily generalize to other analyses of the effects of drug use when drug use might be under-reported, it illustrates the utility of proposed under-reporting model and the simplicity of Bayesian sensitivity analysis.

The proposed model assumes that subjects will never over-report illicit drug use. While we believe that this is a reasonable assumption in our data, some might wish to incorporate over-reporting into our model. To do so, define a new parameter, δ, which is the rate of over-reporting. The Measurement Model becomes Ri ~ Bernoulli( Ci(1-λ) + (1-Ci)δ ). This model will require highly informative priors on each of the parameters to achieve identifiability, and must be subjected to careful sensitivity analysis. This model is worth exploring in detail, especially outside of the criminal justice setting where penalties from admitting drug use might be less severe.

Finally, our choice of priors on the rate of under-reporting are relatively uninformative. This ignores an abundance of previous research comparing self-reports to urinalysis results that have been conducted among criminal justice populations (4). This work can be formally introduced into our model by constructing a summary prior, though this is not a trivial task and requires careful thought (12).

The under-reporting model that we proposed is very general and can apply in a wide variety of circumstances. However, the model only pertains to situations where one investigates drug use as a risk factor for other outcomes, such as health status or criminal behavior. Where one is interested in drug use itself as an outcome, for example in a clinical trial to investigate the impact of Brief Motivational Interviewing on cannabis use (14), a different model is required. Additionally, the current model requires that self-reported drug or alcohol use be reported on a binary scale. This includes, for example National Institute on Alcohol Abuse and Alcoholism (NIAAA) definitions of risky vs. non-risky drinking (15), any drug use, or diagnoses of drug / alcohol use disorders using DSM-IV. Expanding the model to include continuous or ordinal self-reports, and also in situations where drug use is a study outcome, is worthy of future research.


Many studies in addiction medicine rely on self-reported drug and alcohol use. Additionally, many large-scale, national surveys such as the National Survey of Drug Use and Health use self-reported drug and alcohol use. All analyses of these data must confront the issue of under-reporting. We propose an analytical framework for investigating the impacts of under-reporting on the effects of drug and alcohol use. We emphasize that this methodology does not eliminate the problem of under-reporting, but requires that the investigator makes explicit, quantitative assumptions about under-reporting. Furthermore, like any Bayesian analysis, guidance from an experienced Bayesian analyst is strongly recommended. Our analysis of the effects of cocaine use on recidivism among incarcerated youths indicates uniformly larger effect sizes when examined using the under-reporting model, though we are hesitant to ascribe any generalizability to these findings. We believe that the model is worthy of exploration in other situations.


This study was funded by the National Institutes of Drug Abuse on Drug Abuse grant number 1 R21 DA016571-1A1. We appreciate the contribution of Liz Wozniak for editing assistance.

Appendix 1

Example WinBUGS code for the under-reporting model described in the text.

for(i in 1:257){
#Disease Model;
T[i] ~ dlnorm(mu[i],tau)I(CEN[i],)
mu[i]<- b0 + C[i]*b1
#Measurement Model.
R[i] ~ dbern(phi[i]);
phi[i] <- C[i]*(1-lambda);
#Exposure Model;
C[i] ~ dbern(theta)
#priors (b) and (r) on theta and lambda;
#priors on the regression coefficients and scale parameter.
#WinBUGS uses scale in terms of the precision, or 1/scale2;
b0~ dnorm(5,0.1)
b1~ dnorm(0,0.0001)
x.tau ~ dnorm(0,2.4)
tau <- 1/(abs(x.tau)*abs(x.tau) )


Declaration of Conflicts of Interests: None.

Reference List

1. Midanik LT. Validity of self-reported alcohol use: a literature review and assessment. British Journal of Addiction. 1988;83:1019–29. [PubMed]
2. Magura S, Goldsmith D, Casriel C, Goldstein PJ, Lipton DS. The validity of juvenile arrestees’ drug use reporting: a gender comparison. International Journal of Addictions. 1987;22:727–49. [PubMed]
3. Kim JS, Fendrich M, Wislar JS. The validity of juvenile arrestees’ drug use reporting: a gender comparison. J Res Crime Del. 2000;37:419–32.
4. Williams RJ, Nowatzki N. Validity of adolescent self-report of substance use. Subst Use Misuse. 2005;40:299–311. [PubMed]
5. Del Boca FK, Darkes J. The validity of self-reports of alcohol consumption: state of the science and challenges for research. Addiction. 2003 Dec.98:1–12. [PubMed]
6. Richter L, Johnson PB. Current methods of assessing substance use: A review of strengths, problems, and developments. J Drug Issues. 2001;31:809–32.
7. Thiebaut AC, Freedman LS, Carroll RJ, Kipnis V. Is it necessary to correct for measurement error in nutritional epidemiology? Comment on: Annals of Internal Medicine. Annals of Internal Medicine. 2007;124:65–7. [PubMed]
8. Richardson S, Gilks WR. A Bayesian approach to measurement error problems in epidemiology using conditional independence models. American Journal of Epidemiology. 1993;138:430–42. [PubMed]
9. Richardson S, Gilks WR. Conditional independence models for epidemiological studies with covariate measurement error. Stat Med. 1993;12:1703–22. [PubMed]
10. Gilks WR, Richardson S, Spiegelhalter DJ. Measurement Error. Chapman & Hall; Boca Raton, FL: 1996. pp. 401–14.
11. Westerberg VS, Tonigan JS, Miller WR. Reliability of form 90D: An instrument for quantifying drug use. Substance Abuse. 1998;19:179–90. [PubMed]
12. Spiegelhalter DJ, Abrams K, Myles JP. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. John Wiley and Sons; San Francisco, CA: 2004.
13. Gilks W, Richardson S, Spiegelhalter D. Markov Chain Monte Carlo in Practice. Chapman and Hall; London: 2006.
14. D'Amico EJ, Miles JN, Stern SA, Meredith LS. Brief motivational interviewing for teens at risk of substance use consequences: A randomized pilot study in a primary care clinic. J Subst Abuse Treat. 2008;35:53–61. [PubMed]
15. Dawson DA. Methodological issues in measuring alcohol use. Alcohol Research and Health. 2003;27:18–29. [PubMed]