Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Am Stat Assoc. Author manuscript; available in PMC Oct 6, 2011.
Published in final edited form as:
PMCID: PMC3188406
Cocaine Dependence Treatment Data: Methods for Measurement Error Problems With Predictors Derived From Stationary Stochastic Processes
Yongtao Guan, Associate Professor, Yehua Li, Assistant Professor, and Rajita Sinha, Professor of Psychiatry
Yongtao Guan, Division of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT 06520;
Yongtao Guan: yongtao.guan/at/; Yehua Li: yehuali/at/; Rajita Sinha: rajita.sinha/at/
Yongtao Guan and Yehua Li are joint first authors of this article.
In a cocaine dependence treatment study, we use linear and nonlinear regression models to model posttreatment cocaine craving scores and first cocaine relapse time. A subset of the covariates are summary statistics derived from baseline daily cocaine use trajectories, such as baseline cocaine use frequency and average daily use amount. These summary statistics are subject to estimation error and can therefore cause biased estimators for the regression coefficients. Unlike classical measurement error problems, the error we encounter here is heteroscedastic with an unknown distribution, and there are no replicates for the error-prone variables or instrumental variables. We propose two robust methods to correct for the bias: a computationally efficient method-of-moments-based method for linear regression models and a subsampling extrapolation method that is generally applicable to both linear and nonlinear regression models. Simulations and an application to the cocaine dependence treatment data are used to illustrate the efficacy of the proposed methods. Asymptotic theory and variance estimation for the proposed subsampling extrapolation method and some additional simulation results are described in the online supplementary material.
Keywords: Bias correction, Method-of-moments correction, Subsampling extrapolation
Cocaine dependence remains a serious public health problem in the United States with more than one million individuals meeting criteria for current dependence (SAMSHA 2004). Cocaine abuse causes serious concerns to society because of its association with criminal activities as well as the high cost to treat health problems related to it. In the battle against cocaine addiction, cocaine craving and relapse pose the greatest challenges. Despite the availability of efficacious behavioral treatments, relapse rates remain high (Sinha 2001, 2007).
Previous studies have revealed that one’s baseline cocaine-use behavior is predictive of cocaine relapse (Fox et al. 2006), along with many other risk factors such as age and gender, cocaine withdrawal severity (Kampman et al. 2001), and stress and negative mood (Sinha 2001, 2007). To describe the baseline cocaine-use behavior, daily cocaine-use trajectory data are often collected in a short period prior to treatment. Summary statistics derived from these trajectories are then used as predictors in a subsequent analysis to explain cocaine craving and cocaine relapse. Among these, the most popular choices are the baseline cocaine-use frequency and average daily use amount (Carroll et al. 1993; Sinha et al. 2006).
Because the baseline trajectories are random, we view them each as a realization from a stochastic process. The parameters governing the distribution of the process then characterize one’s true baseline cocaine-use behavior. However, the summary statistics derived from an observed realization are only estimates of these parameters and may often contain large error, leading to a measurement error problem. In a regression setting, the use of error-prone variables as predictors may cause severe bias to regression coefficients that are associated with these variables and those that are correlated with them (Carroll et al. 2006). Thus, the effect of measurement error due to the summary statistics must be accounted for.
We stress that the kind of measurement error problem faced here is different from the classical measurement error problems. For the latter, the measurement error is often assumed to have a known and constant variance and also follow certain distribution, for example, a normal distribution. When the variance is unknown, replicates of the error-prone variable or some instrumental variables are needed in order to estimate the variance (Carroll et al. 2006). In the current setting, the summary statistics are derived based on only one single realization of the underlying process for each subject, that is, there are no replicates, and the variances of the summary statistics are typically unknown and unequal across subjects. In addition, the errors in the summary statistics are usually non-Gaussian.
In the joint modeling literature, many authors have considered the so-called second-level regression models, where the covariates are subject-specific regression parameters of some longitudinal measurements. A partial list of such work includes Wang, Wang, and Wang (2000), Song, Davidian, and Tsiatis (2002), Li, Zhang, and Davidian (2004), Li, Wang, and Wang (2007). In all these articles, strong parametric assumptions are made for the measurement error, such as that they are normally distributed. In a recent application in the Sleep Heart Health Study, Crainiceanu et al. (2009) proposed building a regression model to predict coronary heart disease status using summary statistics derived from the electroencephalogram (EEG) δ-power spectra. They adopted a functional data analysis approach and proposed a simulation extrapolation (SIMEX) method to account for the measurement error due to smoothing the EEG curves, where the simulation step was done by bootstrapping the (independent) residuals. To account for the randomness in each experiment, they proposed a Bayesian hierarchical model that jointly models the EEG data from multiple visits. Similar to the authors mentioned above, we are also interested in building a second-level model where some covariates are summary statistics derived from the baseline cocaine-use trajectories. It is becoming an increasingly more important problem in statistical practice to account for the estimation error in such summary statistics.
In this article, we develop two robust bias-correction methods to account for the effect of measurement error when a subset of the predictors are summary statistics derived from some stationary stochastic processes. We consider both linear and nonlinear regression models. For the former, we propose a computationally simple method-of-moments bias-correction method, which is closely related to its counterpart in classical measurement problems. For the latter, we propose a novel subsampling extrapolation (SUBEX) method, which can yield approximately unbiased estimator in the presence of measurement error for a wide range of problems. The SUBEX is motivated by the widely used SIMEX method proposed by Cook and Stefanski (1994), which involves a simulation step to increase the variance in the measurement error and an extrapolation step to correct for the bias. The simulation step in SIMEX is difficult to apply here because of the challenges highlighted earlier. We instead develop a novel subsampling algorithm to achieve the variance inflation. The proposed methods do not assume any distribution for the measurement error, nor require knowledge on its variance or the availability of replicates.
We organize the remainder of the article as follows. In Section 2 we describe a real dataset that has motivated our research. We discuss the proposed bias-correction methods in Section 3, and assess their numerical properties through simulation in Section 4. We then apply them to analyze the motivating data example in Section 5. We conclude this article with a discussion in Section 6. Additional theoretical and simulation results are given in the online Supplementary Materials.
2.1 Description of Data
One hundred and forty two cocaine dependent individuals between the age of 19 and 51 were admitted to the Clinical Neuroscience Research Unit (CNRU) of the Connecticut Mental Health Center, for two to four weeks of inpatient treatment for cocaine dependence. The CNRU is a locked inpatient treatment and research facility with no access to alcohol or drugs and limited access to visitors. The study was conducted in two separate periods with 59 enrolling in the first period and 83 enrolling in the second.
Upon treatment entry, all subjects were interviewed by means of the Structured Clinical Interview for DSM-IV (First et al. 1995) to collect baseline demographic variables and daily cocaine-use history in the 90 days prior to admission. The latter was documented using a 90-day time-line follow-back (TLFB; Sobell and Sobell 1993) Substance Use Calendar, which is a reliable instrument for assessing self-report drug use in alcoholic and drug abusing populations (Fals-Stewart et al. 2000; see Section 2.3 for more discussion on the TLFB). In addition, the Tiffany Cocaine Craving Questionnaire-Brief (CCQ–Brief) (Tiffany et al. 1993), which measures “desire for using cocaine at this moment,” was also collected.
After completion of the inpatient treatment, all participants were invited back for follow-up interview(s) to assess cocaine use outcomes. For the 59 subjects in the first study, only one interview was administrated at day 90 after treatment. For the 83 subjects in the second study, four interviews were given at days 14, 30, 90, and 180 after treatment. Daily cocaine record was collected based on the TLFB procedure during each interview for the period before. Urine toxicology screen was conducted to check the accuracy of a reported relapse or abstinence in the same period. Some subjects reported abstinence but their urine samples tested positive. Hence, these subjects had relapsed in the period proceeding the first positive urine sample test date. However, the exact relapse date is unknown. Thus, our relapse data are interval censored. In addition, the CCQ–Brief scores were also measured, typically only at day 90 for the first 59 subjects but at days 14, 90, and 180 for second 83 subjects.
2.2 Exploratory Analysis of Baseline Trajectories
The 90-day baseline daily cocaine-use trajectory data were given in the form of actual daily use amount (in gram equivalents) that was estimated by the study participants. However, the data may contain large errors due to two reasons. First, it is generally difficult to recall the exact use amount over a long period of time, at least more difficult than to recall use/no use, say. Second, it is challenging to develop a common scale to assess the amount used because of different methods of consumption. For example, some subjects may smoke cocaine in the form of crack, others may use it in powder form intranasally, and others may inject it; requiring all subjects to convert the amounts into the exact gram equivalents would simply be unrealistic. Given these reasons, we also consider dichotomized trajectories, that is, trajectories comprised of no use (=0) and any use (=1). We illustrate the main ideas in the remainder of Section 2 based on such trajectories. However, we will also consider the trajectories of actual use amounts in Section 5.
Our primary goal here is to assess the effect of one’s long-term cocaine use behavior on posttreatment cocaine craving and cocaine relapse. The vast majority of the study participants had had multiple years of cocaine use history before enrolling in the study. Because the baseline period is relatively short compared to the cocaine use history, it is sensible to assume that the observed baseline cocaine use pattern was stable and also reflective of one’s long-term cocaine use behavior. However, the cocaine use behavior could have been altered temporarily because of the impending admission to treatment. To be sure, we have examined the plot of daily percentages of individuals reporting cocaine use [Figure 1(a)]. The plot indicates that there were more subjects using cocaine on the second last day before treatment admission but much fewer on the last day. It also reveals that the daily percentages in the first six days appeared to be lower than average, which could be due to, for example, recalling error. Since these data may not be most representative for one’s long-term cocaine-use behavior, we have removed them from our analysis. This leaves us with an 80-day long baseline period. Figure 1(b) plots the baseline trajectories for four representative subjects. The trajectories suggest that the daily cocaine use patterns are reasonably consistent over the baseline period for these subjects.
Figure 1
Figure 1
Plot (a): Percentage of individuals reporting cocaine use by day; the solid line shows the daily percentage, the dotted lines are the pointwise 95% confidence envelope, the horizontal dashed line is the average daily percentage. Plot (b): Individual baseline (more ...)
We next study the dependence within each trajectory. Let Bij be a binary variable to denote a cocaine use (=1) or no use (=0) for the ith subject on the jth day, i = 1, …, 142 and j = 1, …, 80. Let r ≤ 40 be a positive integer. Define
equation M1
If j < r, we set Bi(jr) = Bi(80+jr). This is equivalent to assuming that the first and last days are next to each other. Note that gi(r) is a second-order statistic and thus may contain useful information about the dependence in a trajectory.
Figure 2(a) shows values of gi(r) for the four selected trajectories shown in Figure 1(b). It reveals a large variability among the individual gi(r), which range from consistently near-zero values for Trajectory 1 and a constant value of one for Trajectory 4. The former suggests a very low baseline frequency whereas the latter implies cocaine use everyday. The plot of gi(r) for Trajectory 2 reaches a peak every seven days, suggesting a weekly cocaine use pattern; however, the same plot for Trajectory 3 shows a slightly decreasing trend, implying that the dependence becomes weaker as the time lag r increases. Combined together, the plot illustrates the highly complex nature of the dependence in these trajectories. Consequently, the distribution of the summary statistics and hence that of the resulting measurement error are also expected to be highly complex. A strong parametric assumption for the measurement error may then be unrealistic and potentially restrictive.
Figure 2
Figure 2
Plots (a) and (b) show plots of gi(r) and of equation M87 versus X ˜(p) for the four individual trajectories given in Figure 1(b). The symbols in each plot match those in Figure 1(b).
2.3 Construction of Baseline Summary Statistics
We illustrate the construction of baseline summary statistics based on the four trajectories given in Figure 1(b). A most obvious choice is the baseline frequency, expressed as
equation M2
where Bij is as defined in Section 2.2. For the four given trajectories, the resulting baseline frequencies are 0.125, 0.725, 0.763, and 1, respectively. Note that Trajectories 2 and 3 yield similar baseline frequencies, although they look different both visually and as indicated by the plots of gi(r).
To better distinguish the patterns between Trajectories 2 and 3, we may consider
equation M3
where r* is a positive integer and ψ (r) is a properly defined function. For example, if r* = 7 and ψ(r) = 0 except when r = 7, then equation M4 simply quantifies the strength of a weekly pattern. We may also set ψ(r) as an (estimated) principal component of gi(r). For our data, the second and third principal components (Figure 3) describe variabilities due to weekly patterns and short-range dependence, respectively. The obtained summary statistics are −0.1945 and 0.0037 for Trajectory 2, and are 0.0688 and −0.0474 for Trajectory 3. Hence, Trajectory 2 exhibits more evidence for a weekly pattern while Trajectory 3 appears to have a stronger short-range dependence. The first principal component essentially yields a summary statistic that is comparable with the baseline frequency. Note that the summary statistics defined in (1) and (2) can both be viewed as the sample mean of a stochastic process associated with the ith subject. We will focus on this case in our article.
Figure 3
Figure 3
Plot of the (estimated) first, second, and third principal components of gi(r), denoted by the dotted, solid, and dashed lines, respectively.
The obtained summary statistics are random mainly due to two reasons. First, the baseline cocaine use trajectories, even if they can be accurately observed, are only realizations of some stochastic processes. Suppose that the baseline period could be repeated, then the resulting new trajectories, and hence the summary statistics based on them, would most likely be different. This will be especially true for Trajectories 1 and 3. Such kind of randomness cannot be reduced, because it is an inherent part of the stochastic nature of one’s cocaine-use behavior. Second, the accuracy of a reconstructed baseline trajectory will depend on a number of factors, such as patient characteristics and their patterns of cocaine use as well as methods and settings in which data were collected. Some of these factors are random and therefore may introduce additional variability in the obtained baseline trajectories. The TLFB is designed to minimize such variabilities. It uses a calendar prompt and a number of other memory aids (e.g., the use of key dates such as holidays, birthdays, newsworthy events and other personal events as anchor points) to render more accurate recalls of daily cocaine uses. The subjects are also asked to recall lengthy periods of time when they completely abstained or used in a very regular pattern (e.g., used consecutively or weekly); patterns in such periods can often be identified more accurately. It has been shown that the TLFB can provide reliable daily cocaine-use data that have high (1) retest reliability, (2) correlation with other cocaine use measures, and (3) agreement with collateral informants’ reports of patients’ cocaine use as well as results obtained from urine assays (Fals-Stewart et al. 2000). Hence, the variability due to self-report should be small, especially for those with marked regular cocaine use patterns such as Trajectories 2 and 4.
Despite the well recognized effectiveness of TLFB, under-reporting could still be a concern as with most other self-report data on use of illicit drugs. However, there are considerably less incentives for one to deny a use that occurred in a baseline period compared to during treatment and follow-up. Several studies have in fact reported a high degree of agreement between patients’ self-report recent drug use and urine testing results obtained at the time of admission to treatment (Brown, Kranzler, and Del Boca 1992; Sherman and Bigelow 1992). In light of these facts, we do not consider the potential underreporting in the baseline data but will do so when analyzing the relapse data.
Another potential source of variability in the summary statistics is due to the use of an estimated function ψ(r). However, this is often small when the sample size is large and can be eliminated when ψ(r) is decided in advance (e.g., as is the case with baseline frequency). Our main focus of the article is to account for the variability caused by the stochastic nature in one’s cocaine-use behavior, although our approach can also be applied when the baseline trajectories each are contaminated with a stationary error process of mean zero, i.e., when there is no systematic underreporting. The major challenge is that the distribution of the summary statistics may be very difficult to model, as a result of the highly complex within-trajectory dependence shown in Section 2.2. We will develop methods without having to specify the within-trajectory dependence.
3.1 Notation and Setup
Let Ni be a stationary stochastic process that has generated the ith cocaine use trajectory over the baseline period (0, τ], where τ > 0. Let Wi be a summary measure that is calculated from a realization of Ni. We assume that Wi takes the following general form:
equation M5
where equation M6 is either Ni or a different stationary process that is derived from Ni. In other words, Wi is the sample mean of equation M7. We present Wi in a more general integral form as given in (3), but its discrete counterpart can be easily derived; for example, see the baseline frequency and the second-order summary statistics given in (1) and (2). For ease of presentation but without losing generality, we simply write equation M8 as Ni.
Suppose that the distribution of Ni depends on a set of parameters Λi. Let Xi = E(Wii) be a desired predictor that describes some aspect of the long-term cocaine use behavior for the ith subject, for example, the long-term cocaine-use frequency. Let Ui be a measurement error such that Wi = Xi + Ui; this specifies the classical measurement error model (Carroll et al. 2006). The magnitude of Ui can vary greatly across subjects. For example, for the four baseline trajectories that are given in Figure 1(b), Ui is expected to be small for Trajectories 2 and 4, but may be much larger for Trajectories 1 and 3. In regression analyses, the use of Wi as a substitute for Xi may lead to biased estimators for the regression coefficients.
To correct for the bias, it is often necessary to estimate the variability in Wi. When independent replicates of Wi are available, this is rather straightforward (Carroll et al. 2006). However, independent replicates are not available for our data, because only one baseline trajectory was observed for each subject. However, for any given baseline period, we may obtain multiple equal-length subsets. Let S(t, p) = (t, t + ] be one such subset, where 0 < p ≤ 1 and 0 ≤ t ≤ (1 − p) τ. The summary statistics defined on S(t, p), denoted by Wi(t, p), can then be treated as replicates of each other. We will derive our proposed bias-correction methods based on these replicates. However, Wi(t, p) are not independent because of the within-trajectory dependence and also possible overlapping in the subsets.
Let Ui(t, p) denote the measurement error that is associated with Wi(t, p). By stationarity and condition (3), Wi(t, p) is an unbiased estimator for Xi and hence Ui(t, p) has an expectation equal to zero when conditional on Λi. We next consider the variance of t, p). Define equation M9, where equation M10 is the variance of Ui conditional on Λi. Let αi be a constant that is typically nonnegative. We assume that
equation M11
Condition (4) is a standard assumption for sample-mean-like statistics (Politis, Romano, and Wolf 1999). For the motivating data given in Section 2, it essentially requires that the baseline trajectories are weakly correlated, and that the rate of decay in the correlation is dominated by the inverse of the length of the time interval. This condition is used mainly to ensure that our proposed methods are asymptotically valid as both n and τ increase to infinity, where n is the total number of subjects. However, in a practical setting, τ is often small relative to the dependence range and hence the last term of (4) can be quite large compared to the two leading terms. We investigate the performance of our proposed methods under different levels of dependence through simulation (see Section 4).
Based on condition (4), we can link the variance of the measurement error in Wi(t, p) to that in Wi. This link will be critical for the development of our proposed bias-correction methods based on Wi(t, p). Specifically, it follows from straightforward algebra that
equation M12
if p > p0 for some p0 > 0 such that p0τ is sufficiently large. In the special case that Ni is an independent process, then (5) holds exactly with αi = 0 and consequently equation M13. However, this is unrealistic for our data given the evidence of within-trajectory dependence shown in Figure 2. In a more general setting, condition (5) implies that p var[Ui(t, p)] is biased for equation M14, but the bias can be reduced by adding back (1 − p)αi/(2).
We introduce the following notation. Let Yi be the response and let Zi be some covariates observed without error, both for the ith subject. Let Y = (Y1, …, Yn) be an n × 1 column vector. Similarly define W, X, and U as the n × 1 vectors based on Wi, Xi, and Ui, respectively, and define Z as the matrix obtained by stacking together the row vectors equation M15, where equation M16 denotes the transpose of Zi.
3.2 Method-of-Moments Bias Correction for Linear Models
We illustrate our proposed bias-correction method through the following simple model:
equation M17
where θ = (β, η) is a vector of regression coefficients and εi: i = 1, …, n are independent random variables with mean zero. Based on the observed data (Y, W, Z), the least-squared estimator for θ can be written as
equation M18
Note that W = X + U. It follows from some simple algebra that
equation M19
where equation M20. Clearly [theta w/ hat] is biased for θ because σ2 ≠ 0. If an estimator for σ2 is available, we can then obtain a method-of-moments bias-corrected estimator, say [theta w/ hat]mom, by subtracting the variance estimator from the first element of A. Thus, the problem has become a variance estimation problem.
We develop a nonparametric variance estimator for σ2. A nonparametric estimator is desirable because of the complex within-trajectory dependence. To develop our method, let t be an arbitrary time point satisfying 0 ≤ t < τkpτ, where k = [1/p] and [x] denotes the integer part of a positive constant x. We require p ≤ 1/2 so that k ≥ 2. We then partition the interval (t, t + kpτ] into k nonoverlapping subintervals each of length . Let Wi(t + jpτ, p) be the summary statistic defined on the (j + 1)th subinterval for the ith trajectory, where j = 0, 1, …, k − 1. Define
equation M21
equation M22
In the special case that τ = kpτ, we simply set equation M23.
By condition (3), the average of the summary statistics t + Wi(jpτ, p) is Wi(t, kp). Thus, equation M24 intends to estimate the variance of Ui(t, p). Define equation M25. Recall that Wi(t + jpτ, p) is an unbiased estimator for Xi. It then follows from condition (5) that
equation M26
In light of (6), define
equation M27
We propose to regress (p) on X (p) at some preselected values of p, say (p1, …, pJ) where J ≥ 2. The resulting intercept and (minus) slope estimators, denoted by [sigma with hat]2 and [alpha], can then be treated as estimators for σ2 and α, respectively. Moreover, a scatter plot of X (p) and (p) can be used as a practical tool to assess whether condition (4) is reasonable.
To understand the contributions of individual trajectories to the calculation of [sigma with hat]2, we plot i(p) against X (p) in Figure 2(b) for the four trajectories shown in Figure 1(b); here Wi is the baseline frequency. Note that Trajectory 3 has significantly larger i(p) values than the other three trajectories. This reflects the greatly increased uncertainty in the empirical baseline frequency given the strong within-trajectory dependence for Trajectory 3. It is also interesting to see that Trajectories 2 and 4 have the smallest i(p) values, as a result of their highly regular patterns. Combined together, these observations suggest that [sigma with hat]2 is mostly affected by trajectories that exhibit strong within-trajectory dependence, but much less so by those that possess highly regular patterns.
3.3 A Subsampling Extrapolation Approach
The method-of-moments bias-correction method discussed in the last section is limited to linear models. Cook and Stefanski (1994) proposed a SIMEX method which can yield approximately unbiased estimators in the presence of measurement error for both linear and nonlinear models. The key idea is to generate some pseudo “remeasurements,” Wi(ζ), where equation M28, i = 1, …, n. This is typically achieved through simulation by adding to each error-prone variable Wi an independent error with variance equal to equation M29 when equation M30 is known, or by taking suboptimal linear combinations of the replicate measurements when equation M31 is unknown but independent replicates of Wi are available (Devanarayan and Stefanski 2002). Regression analysis without any bias correction is then conducted in terms of Wi(ζ) to obtain [theta w/ hat] (ζ). Because ζ = −1 corresponds to the case of no measurement error, a bias-corrected estimator can then be obtained by extrapolating a fitted model to [theta w/ hat] (ζ) at ζ = −1.
It is difficult to apply the simulation step of SIMEX here, because there is only one baseline trajectory observed per subject and the variance of the summary statistic defined on it is also unknown. The lack of independent replicates is common for data arising from stochastic processes such as time series and spatial data. If the within-process dependence is reasonably weak, one can create approximately independent subreplicates using subsampling techniques. The obtained sub-replicates can then be used in applications such as estimating the variance and distribution of a general statistic calculated from the data (Politis, Romano, and Wolf 1999). In the current setting, the use of subsampling entails considering the summary statistics Wi(t, p) that are defined over the subintervals S(t, p) = (t, t + ], where 0 ≤ t ≤ (1 − p)τ and p0 < p < 1 for some p0 > 0. By condition (5), we can immediately conclude
equation M32
Because xi(p) is typically larger than one, Wi(t, p) has a variance that is larger than equation M33.
The above interesting observation, together with the fact that Wi(t, p) is an unbiased estimator for Xi, has inspired us to treat Wi(t, p) as a pseudo remeasurement for Wi. Regression analysis without accounting for measurement error can be done based on data {Yi, Wi(ti, p), Zi}, where ti [set membership] [0, (1− p) τ], i = 1, …, n. Let [theta w/ hat] (p; t1, …, tn) denote the resulting estimator. We then integrate out the ti’s and obtain
equation M34
It is generally difficult to calculate (9) exactly. We use Monte Carlo average of [theta w/ hat] (p; t1, …, tn) based on multiple randomly selected ti’s as an estimate for (9), where 0 ≤ ti ≤ (1 − p)τ.
To correct for the bias, we also need to obtain xi(p), where
equation M35
Because αi and equation M36 are both unknown, xi(p) must be estimated. One quick approximation is to set xi(p) ≈ 1/p. This is valid if equation M37 is small and/or p is large so that equation M38 is much smaller than one. Alternatively we may estimate equation M39 in xi(p) by [alpha]/[sigma with hat]2. This can lead to less biased estimators if equation M40 is approximately constant. In either case, the estimate is free of i so we rewrite it as x̂(p).
Following Cook and Stefanski (1994), we propose to fit a parametric model to (x̂(p), [theta w/ hat] (p)) based on some preselected values of p, where [theta w/ hat] (p) is as defined in (9). We then use the extrapolated value at p = ∞ from the fitted model as our bias-corrected estimate for θ. The validity of extrapolating at p = ∞ can be justified in two ways. First, note that p = ∞ means that each Ni is observed over an infinitely long time interval. This is equivalent to that Xi’s are observed without error, since Wi converges to Xi almost surely assuming ergodicity (Daley and Vere-Jones 1988). Second, note that the extra variance added to Wi is approximately equation M41 under our approach, but would be equation M42 if SIMEX could be used. Solving 1/p − 1 = ζ for ζ = −1 leads to p = ∞. From this perspective, our extrapolation criterion coincides with that used for SIMEX.
To conduct the extrapolation, we also need a parametric model for (x̂(p), [theta w/ hat] (p)). We consider a quadratic model defined in terms of x̂(p),
equation M43
where Γ = (γ1, γ2, γ3). Carroll et al. (2006) suggested that the SIMEX version of the above model often works well in practice. In our simulation, we also consider a cubic model,
equation M44
with the understanding that now Γ = (γ1, γ2, γ3, γ4).
The asymptotic theory for the proposed SUBEX estimator is derived in our online appendix. Following similar lines as those in Carroll et al. (1996), we show that the SUBEX estimator has an asymptotically normal distribution. Inspired by the theory, we also propose a sandwich formula in the Appendix to estimate the variance of the estimator. Our simulation results confirm that the sandwich formula works well.
3.4 Comparison to Two Naive Alternatives
It is intuitively appealing to assume independence between subsets of a baseline cocaine use trajectory observed over the intervals (0, τ/2] and (τ/2, τ]. Recall that Wi(0, 1/2) and Wi(τ/2, 1/2) are the counterparts of Wi based on the data in (0, τ/2] and (τ/2, τ], respectively. In the linear model case, this assumption leads to the naive variance estimator,
equation M45
In the nonlinear model case, we may instead use the pseudo “remeasurements”:
equation M46
A bias-corrected estimator can then be obtained based on equation M47 instead of equation M48 in the linear case, and SIMEX can be applied based on the pseudo “remeasurements” Wi(ζ) in the nonlinear case. The latter is simply a special case of the empirical SIMEX when two independent replicates are available for the error-prone variable (Devanarayan and Stefanski 2002).
For the naive variance estimator, it follows from (6) that
equation M49
Thus, equation M50 underestimates σ2 if α > 0, where the bias is of order τ−2 and can be approximated by −2[alpha]/(τ2[sigma with hat]2). This immediately implies that equation M51 is not consistent for σ2 unless τ → ∞. However, for our motivating data τ cannot be too large, because of the difficulty to reconstruct a reliable baseline trajectory over an extensively long period of time.
In contrast, our proposed estimator [sigma with hat]2 has a much smaller bias, which in turn will yield less biased regression coefficient estimates. To see this point, suppose that Ni is m-dependent, that is, the dependence in Ni completely vanishes beyond a lag distance m. Then it is easy to show that (4) becomes
equation M52
Suppose that m/τp ≤ 1/2. Then (6) becomes
equation M53
The above relationship implies that [sigma with hat]2 is an unbiased estimator for σ2. Write equation M54 for [sigma with hat]2 to make explicit its dependence on n. By the strong law of large numbers, equation M55 converges to its expected value almost surely. It then follows that equation M56 is consistent for σ2, in the sense that equation M57 almost surely as n → ∞. This in turn implies that the bias-corrected estimator [theta w/ hat]mom is consistent for θ. Note that the consistency result requires only τ > 2m and n → ∞ but not τ → ∞.
To further illustrate their differences, we estimate equation M58 by our proposed variance estimator and the naive estimator for the four baseline trajectories given in Figure 1(b), where Wi is the baseline frequency. Specifically, the former estimator is obtained by regressing i(p) against X (p), where i(p) and X (p) are as defined in (7), and the latter estimator is simply equation M59. Our proposed variance estimator yields a much larger estimate (=0.0308) than the naive one (=0.0264) for Trajectory 2; this is expected given its relatively strong within-trajectory dependence. The estimates for the other three trajectories are all similar. These results demonstrate the difference by accounting for the negative bias in the naive estimator when there is a strong within-trajectory dependence.
For the new pseudo “remeasurements” Wi(ζ), it follows from condition (5), the definition of Wi and stationarity of Ni that
equation M60
Clearly, the amount of variance inflation is smaller than the desired amount, that is, equation M61. Let ζ = 1/p − 1. Then, we can rewrite the above as
equation M62
For our proposed SUBEX, we use Wi(t, p) as the pseudo “remeasurements,” where Wi(t, p) is the counterpart of Wi defined on (t, t + ]. Recall from (8) that
equation M63
If αi ≥ 0 and p ≥ 1/2, then var[Wi(t, p)|Λi] ≥ var[Wi(ζ)|Λi], where ζ = 1/p− 1. Thus, the variance of Wi(t, p) is closer to the target (inflated) variance equation M64 than that of Wi(ζ). This implies that the proposed SUBEX with x̂(p) = 1/p can be more effective in bias correction than the empirical SIMEX. Moreover, it suggests that we should set p ≥ 1/2. The proposed SUBEX with x̂(p) = 1/p − (1 − p)[alpha]/(p2τ2[sigma with hat]2) is generally even more effective in bias reduction. This can be seen from our simulation results in Section 4.
3.5 Extension to Nonstationary Processes
We argued in Section 2.2 that it is reasonable to assume stationarity for the baseline trajectories in our motivating data. However, there may be situations where such an assumption is questionable. To generalize our methods, let Ni be a nonstationary stochastic process with mean given by Xi + λ(t), where λ (t) changes over time but is fixed across processes at any given t. For identifiability let equation M65. Define Wi, Ui, and Wi(t, p) as in Section 3.1. It is easy to see that E(Uii) = 0. Also define
equation M66
equation M67
Clearly, equation M68 and hence E[Ui(t, p)|Λi] = 0.
Suppose that the variance of Ui(t, p) satisfies condition (4). In the linear model case, our proposed method-of-moments bias-correction method can be extended to the current setting by replacing Wi(t, p) therein with equation M69. The validity of this approach follows analogously as in the stationary case. In the nonlinear model case, observe the facts that equation M70 and
equation M71
We therefore use equation M72 as the pseudo “remeasurements.” In practice we replace λ (t) by its consistent estimator [lambda with circumflex] (t), given that such an estimator exists.
4.1 Simulation 1: Linear Model
We simulate independent tuples {Yi, Ni, Zi}, for i = 1, …, n. Here Ni is a stationary point process with intensity Xi, where Xi ~ 1 + Log Normal(0, 0.5); Zi is an error-free Bernoulli random variable independent of Xi and with mean equal to 0.5; and Yi is given by
equation M73
where θ = (θ0, θ1, θ2)T = (1, 1, 1)T and εi ~ Normal(0, 0.52). We assume that Xi is unknown but can be estimated by Wi = Ni((0, τ])/τ, where Ni((0, τ]) denotes the number of events of Ni contained in the time interval (0, τ]. Note that Wi takes the general form given in (3).
We consider two scenarios for Ni. Under the first scenario, Ni is a homogeneous Poisson process. To simulate Ni, we first generate a Poisson random variable Mi with mean equal to τXi, and then randomly simulate Mi events in (0, τ] according to a uniform distribution. Under the second scenario, Ni is a homogeneous Poisson cluster process. To simulate Ni, we first generate a homogeneous Poisson process with intensity ρi = Xi/3 as the parent process. For each parent, we then generate m = 3 children on average according to a Poisson distribution, and disperse the children locations independently following a normal distribution centered at the parent location and with standard deviation ω = 2. The final process contains only the children event times. For Poisson processes, all events in disjoint time intervals are independent. However, this is not true for Poisson cluster processes.
We repeat the simulation 200 times in each case. For each simulated dataset, we apply our proposed MOM and SUBEX estimators to estimate θ. For the former, we obtain [alpha] and [sigma with hat]2 based on 20 equally spaced p-values in [1/3, 1/2] when regressing (p) on X (p). For the latter, we assign x̂(p) = 1/p and x̂(p) = 1/p − (1 − p)[alpha]/(p2τ2[sigma with hat]2) and refer to the resulting estimators as SUBEX1 and SUBEX2, respectively. We set the p-values used for both estimators to be from 0.6 to 0.95 with an increment of 0.05, and use a quadratic function for the extrapolation. In the online Appendix, we show additional results for SUBEX with a finer choice of p-values and also with a cubic extrapolating function. These results suggest that the SUBEX method is not sensitive to the choice of p-values, since increasing the number of p does not change the results much. Using the cubic function can reduce the bias but also increases the variance of the SUBEX estimator, and leads to a less efficient estimator in terms of mean squared error.
For comparison, we also apply the naive estimator by treating Wi as if it was Xi, and the naive MOM (referred to as MOM1) and the empirical SIMEX estimators described in Section 3.4. For the last two estimators, we assume that Ni in (0, τ/2] and (τ/2, τ] are independent. This is true for the Poisson processes, but not for the Poisson cluster processes. In Table 1, we report the bias, standard error, mean of the estimated standard error and relative efficiency of the considered estimators. The relative efficiency of an estimator is defined as the mean squared error of the naive estimator divided by that of the estimator under consideration.
Table 1
Table 1
Simulation results for linear model
In the Poisson process case, we set τ = 10 and n = 200. From Table 1, we see that the naive estimator for θ1 has a large bias due to the estimation error in Wi. However, the biases in the other estimators are much smaller. Since Ni is Poisson, there is no need to consider within-trajectory dependence. As a result, the naive MOM1 estimator, the empirical SIMEX, and SUBEX1 have slightly higher relative efficiencies than the proposed MOM2 and SUBEX2 estimators, both of which are designed to account for within-trajectory dependence. However, none of the estimators for θ2 suffers from attenuation, and the naive estimator has the smallest standard error. This is because Zi is measured without error and is also independent with Xi. This phenomena has been repeatedly observed throughout our simulation. We will therefore discuss only the results for θ1 from now on.
In the Poisson cluster process case, we set τ = 10, 20 and n = 200. From Table 1, we see that the naive estimator is even more biased than in the Poisson process case. This is because of the increased estimation error in Wi due to correlation. Our proposed MOM2 estimator has the smallest bias among all estimators: the bias when τ = 10 is −0.1869 whereas the next smallest bias in magnitude is −0.4009; the bias when τ = 20 further reduces to −0.1014, which is again the smallest among all estimators. The relative efficiency for the proposed MOM2 estimator is also the highest. In particular, it can be eight times as efficient as the naive estimator when τ = 20. The two SUBEX estimators perform significantly better than the empirical SIMEX. This is expected following our discussion in Section 3.4. Between the two, SUBEX2 works slightly better than SUBEX1, because it better accounts for within-trajectory dependence. In general, the amount of information used to estimate Xi increases as τ increases. This fact explains the better results for all estimators when τ = 20.
4.2 Simulation 2: Survival Model
An important part of the project is to model the first relapse time after treatment. As explained in Section 2, the data are partially interval censored. We use the second simulation study to mimic the real data and to check the performance of the proposed estimators.
We simulate Xi, Ni, and Zi as in Section 4.1, for i = 1, …, n, and simulate the failure time Ti through a Cox proportional hazard model with the hazard function
equation M74
where λ0(t) = t is the (unobserved) baseline hazard function. The regression coefficients are θ = (β, η)T = (1, 1)T. We assume censoring at random and set the censoring times to be (0.2, 0.5, 1). Let the censoring indicator δi be a binary variable independent of Xi and Zi, with P(δ;i = 1) = 0.5. When δi = 1, the event time Ti is censored in the interval between the two censoring times closest to Ti; if Ti is less than 0.2, it is censored in [ equation M75]; if an event time is over 1, it is automatically right censored at 1. Overall, about 12% of the observations are right censored, 43% are interval censored, and the rest 45% are observed.
For interval censored data, it is difficult to separate estimating the baseline hazard function from estimating θ by using approaches such as the partial likelihood (Cai and Betensky 2003). Many authors have considered modeling the baseline hazard as spline functions. We follow Cai and Betensky (2003) and Ruppert, Wand, and Carroll (2003) to model the log baseline hazard as a linear spline function:
equation M76
where x+ [equivalent] max(x, 0) and κk’s are the knots. There are two immediate benefits for using this model. First, λ0(·) is guaranteed to be nonnegative, so that we do not need any constraint on the parameters when maximizing the likelihood. Second, since [var phi](·) is modeled as piecewise linear polynomial, the cumulative hazard function can be written out in an explicit form. For higher-order spline functions, such explicit expressions are not available.
Let Θ be the collection of all parameters, including both θ and the spline coefficients in (11). Then, Θ can be estimated by maximizing the penalized likelihood
equation M77
where [ell]i(Θ) is the log-likelihood function for the ith subject, equation M78 is a tuning parameter that controls the smoothness of [var phi](t), and b = (b1, b2, …, bK)T. Cai and Betensky (2003) gave the detailed expression for [ell]i(Θ) and also suggested a data-driven procedure to estimate equation M79. As pointed out by Apanasovich, Carroll, and Maity (2009), estimation of the parametric component in a semiparametric setting is not sensitive to the choice of the smoothing parameter. We therefore use a fixed yet reasonable equation M80 in our simulation so as to reduce the computational burden.
As in the linear model case, we use both homogeneous Poisson processes and Poisson cluster processes for Ni. We consider five estimators: the estimator using the true Xi, the naive estimator that replaces Xi by Wi, the two SUBEX estimators, that is, SUBEX1 and SUBEX2, and the empirical SIMEX. The results in Table 2 are obtained by modeling [var phi](t) with a 10-knot spline function. The spline function provides a sufficient approximation, as can be seen from the fact that the estimator based on the true Xi is almost unbiased. Thus, all biases in the other four estimators are mainly caused by the estimation error in Wi. We have also tried K = 25 knots and obtained very similar results.
Table 2
Table 2
Simulation results in the interval-censored failure time data case
In the Poisson process case, we set τ = 10 and n = 200. Similar to the linear model case, the naive estimator is severely biased, but the SUBEX and the empirical SIMEX estimators can significantly reduce the bias. Since there is no within-trajectory dependence in Ni, the empirical SIMEX perform better than the two SUBEX estimators, and there is no benefit for using SUBEX2 over SUBEX1.
In the Poisson cluster process case, we set τ = 10, 20 and n = 200. We also consider n = 400 when τ = 20. From Table 2, we see that both SUBEX1 and SUBEX2 have much smaller bias in magnitude than the empirical SIMEX but are also more variable. Thus, there is a trade-off between reducing the bias and increasing the variance. Between the two SUBEX estimators, SUBEX2 is slightly more effective in reducing the bias than SUBEX1. In terms of relative efficiency, the two SUBEX estimators perform significantly better than the empirical SIMEX except when τ = 20 and n = 200. In this case, the relative efficiency of all three estimators are similar.
When τ = 10, the bias is quite large after the bias correction. This is because the within-trajectory dependence is strong in this case. When τ increases to 20, the biases decrease significantly in all cases. However, when n increases from 200 to 400, the biases are nearly unchanged for all estimators, although the standard errors decrease. In general, the bias of the SUBEX estimators depends on the strength of the within-trajectory dependence relative to the length of the observation interval, that is, τ, but not on the sample size n. The statistic 2[alpha]/(τ 2[sigma with hat]2) given in Section 3.4 can be used to gauge the level of dependence. Specifically, it is approximately equal to 0.2519 when τ = 10 and equal to 0.1228 when τ = 20.
In the online Appendix, we show additional simulation results with a finer choice of p-values and also with a cubic extrapolating function. Similar to the linear model case, we find that using the cubic function can reduce the bias but inflate the variance of the final estimator. Increasing the number of p-values does not change the results much, although variability of the new estimator does become smaller.
5.1 Definition of Variables
We apply our proposed bias-correction methods to analyze the motivating dataset given in Section 2. The main outcomes of interest are the (log) CCQ–Brief scores collected at days 90 and 180 (when available) after treatment and the time to first relapse. For the latter, we focus only on data in the first 90 days from participants in the second study, because the self-report cocaine use data obtained in this period are believed to be more reliable and are also supplemented with more frequent urine samples for these subjects.
The predictor variables used in both analysis include age, gender (=1 for female and 0 for male), race (=1 for African American and 0 for the rest), number of cocaine-use years and number of anxiety disorders present at the baseline interview. For the first analysis involving the CCQ–Brief scores, we also include an indicator variable which is equal to one if the measurement was take at day 90 and equal to zero otherwise. For a given baseline cocaine-use trajectory, we calculate the baseline frequency and baseline average daily-use amount. We include one of them each time as an error-prone predictor variable in the analyses. We have also conducted a separate analysis to include the second-order statistics defined in (2) with ψ(r) therein being the second and third principal components of gi(r), r = 1, …, 7; these summary statistics characterize one’s weekly use pattern and short-range dependence, respectively. Figure 4 shows histograms for these four summary statistics. However, neither the weekly pattern nor the short-range dependence is significant, likely because the associated second-order summary statistics are tightly distributed around their means [see Figure 4(c) and (d)] and also the sample size is relatively small. A significant relationship could be obtained if more subjects are included. In light of this result, we focus on the baseline frequency and baseline average daily use amount in our discussion below.
Figure 4
Figure 4
Histograms of the four summary statistics derived from the baseline cocaine use trajectories. The online version of this figure is in color.
5.2 Analysis of CCQ–Brief Scores
Our analysis is based on 126 participants whose CCQ–Brief scores were available at days 90 and/or 180. Eight of the remaining 16 participants never returned for any interview nor provided any cocaine-use record. It is difficult to assess the causes of dropout for these individuals. The other eight participants dropped out before day 90 with partial cocaine-use records reported. Among these, six dropped out on or before day 15, but only two of them relapsed before dropout. Thus, there is no strong evidence suggesting a connection between the missing mechanism and the success of treatment. Given the relatively small number of missing observations, we conduct our analysis based on the complete data.
Figure 5 shows a scatter plot of X (p) and (p) for p = k/τ, where k = 30, 31, …, 40 and τ = 80. There is clearly a linear trend between X (p) and (p), which suggests that condition (4) is reasonable. By regressing (p) on X (p), we obtain [sigma with hat]2 = 2.269 and [alpha]/τ2 = 0.4261. Recall that 2[alpha]/(τ2[sigma with hat]2) can be regarded as the amount of bias of the naive variance estimator equation M81 relative to σ2. Based on [alpha] and [sigma with hat]2, we conclude that equation M82 approximately underestimates σ2 by 37.56%. This bias is quite large and in turn will yield biased regression coefficient estimates. As we discussed in Section 3.3, the bias is mainly caused by underestimation of equation M83 for trajectories that are similar to the third trajectory given in Figure 1(b).
Figure 5
Figure 5
Scatter plot of X (p) and (p) for p = k/τ where k = 30, 31, …, 40 and τ = 80.
Table 3 lists the estimated regression coefficients and their standard errors for our proposed estimators based on the proposed MOM and the two SUBEX estimators used in the simulation. The standard errors are obtained by bootstrap. The estimated coefficients from SUBEX1 are very close to those from the MOM estimator, whereas those from SUBEX2 are slightly more different. For baseline frequency, the resulting estimates are equal to 0.565, 0.577, and 0.662 for these three estimators, respectively. For comparison, we also apply the naive estimator, the naive MOM estimator based on equation M84, and the empirical SIMEX estimator. For baseline frequency, the estimates are equal to 0.476, 0.529, and 0.537, which are all smaller than those given by our proposed methods.
Table 3
Table 3
Results for the analysis of the CCQ–Brief score data
We interpret our findings based on the proposed MOM estimator, in light of its better performance over the SUBEX as suggested by our simulation. The results suggest that both race and baseline frequency are significant at the 0.01 level. Because the signs for these estimated coefficients are positive, we conclude that African Americans and heavy baseline cocaine user tend to have high (log) CCQ–Brief scores three/six months after treatment (95% confidence intervals 0.1116 to 0.5027 for race and 0.1523 to 0.9782 for baseline frequency). In addition, the number of current anxiety disorders is significant at the 0.10 level but not at the 0.05 level. Thus, the (log) CCQ–Brief scores appear to increase with the number of current anxiety disorders. The remaining variables are all insignificant. In particular, the (log) CCQ–Brief scores are not significantly different at days 90 and 180 after treatment.
We have conducted additional analysis by defining the baseline average daily-use amount as the error-prone variable but have found it insignificant at the 0.10 level; this is likely due to the significantly larger measurement error associated with this particular summary statistic. Among the remaining variables, race still remains significant at the 0.01 level but all the others are not significant at the 0.10 level. We have also repeated our analysis by using the (log) CCQ–Brief score collected at day 14 as the response variable. In this case, no variable is significant at the 0.10 level.
5.3 Analysis of First Relapse Time
We model the time-to-first-relapse data by the proportional hazard model given in (10). We use data from 79 subjects participating in the second study; data for the remaining four subjects in the same study were completely missing. The first relapse time was determined from the self-report posttreatment cocaine use trajectories. To ensure the quality of the self-report data, urine samples were tested on days 14, 30, 90, and 180 to check whether a patient had lied. As a result, the relapse time data are partially interval-censored. If a positive urine test was obtained before the self-report relapse time, then the true relapse time was interval censored between the first positive urine sample test date and the previous negative test date (or the discharge date if no previous test was available). Since data are less reliable after the first 90 days, any relapse time greater than 90 is considered as right censored. This data structure is similar to that considered in the simulation study in Section 4.2: about 50.6% of the subjects have observed relapse time, 31.6% are interval censored and 17.8% are right censored.
We use the baseline average daily use amount as the error-prone variable. Table 4 lists the estimated regression coefficients and their standard errors from the proposed SUBEX and the empirical SIMEX estimators. As we can see from the table, both age and cocaine use years have a significant effect on the time to first relapse. Specifically, cocaine use years has a positive effect on the hazard function, meaning that a longer cocaine use history results in a quicker relapse. The variable, age, on the other hand, has a negative effect. This means that among users with the same length of cocaine use history, those who are older tend to remain abstinent for a longer period of time.
Table 4
Table 4
Results for the analysis of the time-to-first-relapse data
For the baseline average daily use amount, the naive estimator surprisingly shows a significant negative effect on the hazard function. A similar result was also reported in Sinha et al. (2006). However, this is very counterintuitive, since it implies a longer time to relapse for those who used more during the baseline period. The empirical SIMEX estimator yields results very similar to the naive estimator and hence fails to correct for the bias in this case. In contrast, our proposed SUBEX estimators yield estimates that are much closer to zero than the naive and the empirical SIMEX estimators; SUBEX2, which accounts for within-trajectory dependence, even gives an insignificant positive estimate. Both estimators suggest that the effect due to baseline average daily use amount is insignificant after accounting for effects due to the other variables. This is intuitively more reasonable.
In many scientific problems in particular substance use research, it is common to have summary statistics derived from some stochastic processes as covariates. The cocaine dependence treatment dataset considered in this article is one example of such problems. The estimation error in these summary statistics causes estimation bias in the regression coefficients like in classical measurement error problems. As we have illustrated through simulation and data analysis, the estimated coefficients can be severely attenuated and can even result in wrong inference, if the measurement error is not properly accounted for.
The fact that the summary statistics are derived from stochastic processes presents new challenges. Unlike in classical measurement error problems, the error in the summary statistics is heteroscedastic and depends on individual stochastic processes. To correct for the bias in the estimators, we propose a new method-of-moments approach for linear models and a subsampling extrapolation method that is generally applicable to both linear and nonlinear models. The methods we have proposed are based on novel subsampling techniques that take into account of the correlation within individual processes, and have shown good performance in both simulation and real data analysis.
It is important to note that there could be significant bias (typically underreporting) associated with self-report cocaine use data. The TLFB is no exception to that limitation. Other than the TLFB, cocaine use in addicted individuals has been measured in other ways such as using toxicology analysis of cocaine metabolites in urine, blood or hair samples. Indeed all patients included in this study were tested with urine samples upon admission to the inpatient treatment unit. However, detection of cocaine in blood or urine samples is limited to 2–3 days since use and hence requires multiple sampling in an ongoing basis to evaluate pattern of drug use over a defined period. As such, it cannot be used to assess pattern of use in the recent past such as would be needed as patients enter a treatment facility. While hair samples can provide presence or absence of drug over a 90-day period, they do not provide any information on when a subject used drugs and at what frequency. Furthermore, hair samples are not reliable for all types of hair and often it is problematic to detect drug in high curly or wavy hair. Cost of hair sample testing is also high and hence it presents a feasibility challenge. Finally, new daily self-report methods such as ecological momentary assessment, which involves moment-to-moment monitoring of daily drug use with electronic diaries while patients are in the real world, has also been used recently (Epstein and Preston 2010; Preston et al. 2009). However, this method has been used in community and outpatient treatment samples of drug abusers and it would be difficult to assess recent drug-use patterns when subjects are entering a treatment facility.
Given the limitations with other data collection methods as being explained above, we relied on structured interview techniques (i.e., the TLFB) to obtain self-report baseline cocaine uses in our study. To ensure data quality, we followed strictly the well-developed TLFB interview procedures and also took additional measures. For example, all research assistants who collected the data had been trained by Ph.D.-level psychologists and each one of them had already had over three years’ experience in administration of similar assessments; they were also closely supervised when conducting these interviews. Moreover, all subjects had been informed upfront that all data would be only coded by a number and be kept strictly confidential, and that the federal certificate of confidentiality granted to this study would protect the research information from being legally summoned by the courts. Hence there would be no legal consequences to them by providing the most honest responses. They were also informed that they would be removed from the study if they were found out not being truthful about their drug use. All interviews were conducted in a quiet and comfortable testing room, and excellent rapport was established with subjects.
With the aforementioned additional measures we believe that we have minimized the chance of underreporting, however this problem cannot be completely eliminated. The effect of measurement error due to self report has also been studied in other settings, most notably in nutritional epidemiology. For example, Kipnis et al. (1999) modeled the self-report dietary intake by Qij = α0 + α1Ti + ri + εij, where Qij is the jth replicate of the self-report dietary intake from the ith subject, Ti is the true intake, ri is a subject-specific bias and εij is a random error. Different from the classical measurement error models, the model proposed by Kipnis et al. allows a subject-specific bias, ri, which is modeled as a random effect with distribution equation M85 for some equation M86. They argued that this subject-specific bias if ignored would cause further attenuation to the regression coefficients and make them even less significant. We expect a similar phenomenon to happen in our setting as well if there is serious underreporting. To account for the self-report error, Kipnis et al. assumed that a reference instrument was available and could be modeled as Fij = μ0 + Ti + si + uij, where si is another subject-specific random effect possibly correlated with ri and uij is a random error in the instrument variable. As pointed out by the authors, even with the additional information in Fij, this model is complicated and some of the parameters may have identifiability issues. Neither such a reference instrument nor replicates of the baseline trajectories are available in our data. Pushing for additional baseline measurements is key in order to better account for the potential bias due to underreporting for substance use research. However, this is beyond the scope of this article.
Supplementary Material
supplementary material
Yongtao Guan’s research was supported by NSF grant DMS-0845368 and by NIH grant 1R01DA029081-01A1. Yehua Li’s research was supported by NSF grant DMS-0806131. Rajita Sinha’s research was supported by NIH grants P50DA016556 and UL1DE019586.
Theory for SUBEX and simulation: Asymptotic theory and variance estimation for the proposed SUBEX estimator and some additional simulation results. (supplement.pdf, pdf file)
Contributor Information
Yongtao Guan, Division of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT 06520.
Yehua Li, Department of Statistics, University of Georgia, Athens, GA 30602.
Rajita Sinha, Child Study Center, Yale University, New Haven, CT 06511.
  • Apanasovich TV, Carroll RJ, Maity A. SIMEX and Standard Error Estimation in Semiparametric Measurement Error Model. Electronic Journal of Statistics. 2009;3:318–348. [PMC free article] [PubMed]
  • Brown J, Kranzler HR, Del Boca FK. Self-Reports by Alcohol and Drug Abuse Inpatients: Factors Affecting Reliability and Validity. British Journal of Addictions. 1992;87:1013–1024. [PubMed]
  • Cai T, Betensky RA. Hazard Regression for Interval-Censored Data With Penalized Spline. Biometrics. 2003;59:570–579. [PubMed]
  • Carroll KC, Power M, Bryant K, Rounsaville BJ. One Year Follow-Up Status of Treatment-Seeking Cocaine Abusers: Psychopathology and Dependence Severity as Predictors of Outcome. Journal of Nervous and Mental Disease. 1993;181(2):71–79. [PubMed]
  • Carroll RJ, Küchenhoff H, Lombard F, Stefanski LA. Asymptotics for the SIMEX Estimator in Nonlinear Measurement Error Models. Journal of the American Statistical Association. 1996;91(433):242–250.
  • Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. New York: Chapman & Hall; 2006.
  • Cook JR, Stefanski LA. Simulation-Extrapolation Estimation in Parametric Measurement Error Models. Journal of the American Statistical Association. 1994;89:1314–1328.
  • Crainiceanu CM, Caffo BS, Di C-Z, Punjabi NM. Non-parametric Signal Extraction and Measurement Error in the Analysis of Electroencephalographic Activity During Sleep. Journal of the American Statistical Association. 2009;104:541–555. [PMC free article] [PubMed]
  • Daley DJ, Vere-Jones D. An Introduction to the Theory of Point Processes. New York: Springer-Verlag; 1988.
  • Devanarayan V, Stefanski LA. Empirical Simulation Extrapolation for Measurement Error Models With Replicate Measurements. Statistics & Probability Letters. 2002;59:219–225.
  • Epstein DH, Preston KL. Daily Life Hour by Hour, With and Without Cocaine: An Ecological Momentary Assessment Study. Psychopharmacology (Berl) 2010;211(2):223–232. [PMC free article] [PubMed]
  • Fals-Stewart W, O’Farrell TJ, Freitas TT, McFarlin SK, Rutigliano P. The Timeline Follow-Back Reports of Psychoactive Substance Use by Drug-Abusing Patients: Psychometric Properties. Journal of Consulting and Clinical Psychology. 2000;68:134–144. [PubMed]
  • First M, Spitzer R, Gibbon M, Williams J. Structured Clinical Interview for DSMIV: Patient Edition. Washington, DC: American Psychiatric Press Inc; 1995.
  • Fox HC, Garcia M, Milivojevic V, Kreek MJ, Sinha R. Gender Differences in Cardiovascular and Corticoadrenal Response to Stress and Drug Cues in Cocaine Dependent Individuals. Psychopharmacology. 2006;185(3):348–357. [PubMed]
  • Kampman KM, Volpicelli JR, Mulvaney F, Alterman AI, Cornish J, Gariti P, Cnaan A, Poole S, Muller E, Acosta T, Luce D, O’Brien C. Effectiveness of Propranolol for Cocaine Dependence Treatment May Depend on Cocaine Withdrawal Symptom Severity. Drug and Alcohol Dependence. 2001;63:69–78. [PubMed]
  • Kipnis V, Carroll RJ, Freedman LS, Li L. Implication of a New Dietary Measurement Error Model for Estimation of Relative Risk: Application to Four Calibration Studies. American Journal of Epidemiology. 1999;150:642–651. [PubMed]
  • Li E, Wang N, Wang N-Y. Joint Models for a Primary Endpoint and Multiple Longitudinal Covariate Processes. Biometrics. 2007;63:1068–1078. [PubMed]
  • Li E, Zhang D, Davidian M. Conditonal Estimation for Generalized Linear Models When Covariates Are Subject-Specific Parameters in a Mixed Model for Longitudinal Measurements. Biometrics. 2004;60:1–7. [PMC free article] [PubMed]
  • Politis DN, Romano JP, Wolf M. Subsampling. New York: Springer-Verlag; 1999.
  • Preston KL, Vahabzadeh M, Schmittner J, Lin JL, Gorelick DA, Epstein DH. Cocaine Craving and Use During Daily Life. Psychopharmacology (Berl) 2009;207(2):291–301. [PMC free article] [PubMed]
  • Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. New York: Cambridge University Press; 2003.
  • SAMHSA. National Household Survey on Drug Abuse. 2004 available at
  • Sherman MF, Bigelow GE. Validity of Patients Self-Reported Drug Use as a Function of Treatment Status. Drug and Alcohol Development. 1992;30:1–11. [PubMed]
  • Sinha R. How Does Stress Increase Risk of Drug Abuse and Relapse? Psychopharmacology. 2001;158:343–359. [PubMed]
  • Sinha R. The Role of Stress in Addiction Relapse. Current Psychiatry Reports. 2007;9:388–395. [PubMed]
  • Sinha R, Garcia M, Paliwal P, Kreek MJ, Rounsaville BJ. Stress-Induced Cocaine Craving and Hypothalamic-Pituitary-Adrenal Responses Are Predictive of Cocaine Relapse Outcomes. Archives of General Psychiatry. 2006;633:324–331. [PubMed]
  • Sobell L, Sobell M. Timeline Follow Back: A Technique for Assessing Self-Reported Ethanol Consumption. In: Allen J, Litten R, editors. Techniques to Assess Alcohol Consumption. Totowa, NJ: Humana Press, Inc; 1993.
  • Song X, Davidian M, Tsiatis AA. An Estimator for the Proportional Hazard Model With Multiple Longitudinal Covariates Measured With Error. Biostatistics. 2002;3(4):511–528. [PubMed]
  • Tiffany ST, Singleton E, Haertzen CA, Henningfield JE. The Development of a Cocaine Craving Questionnaire. Drug and Alcohol Dependence. 1993;34:19–28. [PubMed]
  • Wang CY, Wang N, Wang S. Regression Analysis When Covariates Are Regression Parameters of a Random Effect Model for Observed Longitudinal Measurements. Biometrics. 2000;56:487–495. [PubMed]