|Home | About | Journals | Submit | Contact Us | Français|
Misclassification of binary outcome variables is a known source of potentially serious bias when estimating adjusted odds ratios. Although researchers have described frequentist and Bayesian methods for dealing with the problem, these methods have seldom fully bridged the gap between statistical research and epidemiologic practice. In particular, there have been few real-world applications of readily grasped and computationally accessible methods that make direct use of internal validation data to adjust for differential outcome misclassification in logistic regression. In this paper, we illustrate likelihood-based methods for this purpose that can be implemented using standard statistical software. Using main study and internal validation data from the HIV Epidemiology Research Study, we demonstrate how misclassification rates can depend on the values of subject-specific covariates, and illustrate the importance of accounting for this dependence. Simulation studies confirm the effectiveness of the maximum likelihood approach. We emphasize clear exposition of the likelihood function itself, to permit the reader to easily assimilate appended computer code that facilitates sensitivity analyses as well as the efficient handling of main/external and main/internal validation-study data. These methods are readily applicable under random cross-sectional sampling, and we discuss the extent to which the main/internal analysis remains appropriate under outcome-dependent (case-control) sampling.
The consequences of misclassified binary outcome or exposure variables when estimating a crude odds ratio (OR) are well understood.1–5 Existing literature also covers the use of validation data to estimate crude ORs while adjusting for misclassification in case-control and cross-sectional studies,6–11 considering the relative merits of external versus internal validation study designs.1; 11–12 In regression applications, many researchers advocate the use of validation data to adjust for measurement error in continuous predictors.13–17
Regarding outcome misclassification for discrete responses, Magder and Hughes18 outline the problem under logistic regression and advocate maximum likelihood via an expectation-maximization algorithm.19 Their work primarily addresses the case of known misclassification probabilities (i.e., sensitivities and specificities) characterizing the observed outcome variable. While continuing to focus on the known sensitivities/specificities case, Neuhaus20 provides further insight into asymptotic bias and efficiency in the broader realm of the generalized linear model, as well as a more efficient computational maximum likelihood approach. Recent articles in the epidemiologic literature demonstrate Monte Carlo-based techniques that similarly facilitate sensitivity analyses with misclassified binary variables.21–22
Other related research includes extensions to settings with count or discrete survival outcomes.23–25 To incorporate validation data, some authors gravitate toward Bayesian approaches using prior assumptions about misclassification probabilities.26–28 From the parametric frequentist perspective, Carroll et al.11 provide general expressions for likelihood functions that accommodate internal validation data. Alternative developments include robust modeling of sensitivity and specificity via kernel smoothers,29 with comparisons of that approach versus parametrically modeling their dependence upon covariates.30
Our aim is to provide guidance for epidemiologists seeking accessible and efficient methods for obtaining validation data-based estimates of logistic regression parameters when the outcome is misclassified. We keep to a likelihood-based approach, as it avoids explicit specification of prior distributions and is readily facilitated for binary outcomes. In the general case, we model the dependence of sensitivity and specificity upon covariates via a second logistic regression model, promoting a flexible and intuitively appealing analytic approach.
The methodology that we illustrate is a direct expansion of the known misclassification rate setting considered by Magder and Hughes18 and Neuhaus,20 a covariate-adjusted extension of well-discussed methods for estimating crude ORs,6–9 and ultimately an application of the general main/validation study maximum likelihood approach outlined in Carroll et al.11 However, there have been few if any real-world applications of the latter approach making use of internal validation data, and such application presents computational challenges to the practicing epidemiologist. Thus, our goal is to bring this approach for addressing outcome misclassification in regression closer to the forefront of epidemiologic research. We pursue this aim by highlighting an instructive example involving misclassified outcome status in the HIV Epidemiology Research Study, by transparent exposition of appropriate likelihood functions, and by providing appendices with straightforward computer code that connects directly with that exposition.
Assume we wish to fit the following logistic regression model to cross-sectional data (we discuss implications of case-control sampling later):
We use the symbol τ for easy reference to Eq. (1) in Appendix 1. Instead of the true (0,1) response Y, suppose the primary (main) study relies upon an error-prone (0,1) alternative Y*. It is known that misclassification in Y* potentially invalidates estimates of (β0,…,βP) based on the “naïve” model that replaces Y by Y* in Eq. (1). The magnitudes and directions of biases in the naïve estimates depend upon the diagnostic properties of Y* as a substitute for Y.18
In the non-differential case,1 the critical diagnostic properties boil down to two parameters, sensitivity (SE) and specificity (SP):
If misclassification is differential, however, then sensitivity and specificity can vary according to subject-specific variables, making effects of misclassification less predictable.1 Thus, we define
where the vector X is usually some subset of (X1, X2,…, Xp).
Suppose first that no validation data are available so that one has only main study data consisting of (yi*, xi1, …, xiP) on the ith experimental unit (i=1,…, nm). In this case, each independent record contributes the following likelihood term:
While it may technically be possible to estimate (β1,…,βP) based only on main study data without supplying values of misclassification probabilities,11; 18 these parameters will be weakly identifiable at best.11 Neuhaus20 notes that estimability of misclassification rates is compromised under mis-specification of the primary model [Eq. (1) here], further emphasizing the limited utility of a main study-only analysis. Thus, use of Eq. (5) is effectively limited to sensitivity analysis, wherein one supplies assumed values of SEx and SPx.
Both the EM approach of Magder and Hughes18 and the alternative maximum likelihood conceptualization of Neuhaus20 purport to maximize Eq. (5) after pre-specifying sensitivity and specificity values. For an implementation under non-differential misclassification that adapts readily to the differential case, Appendix 1 provides ready-to-use computer code utilizing the capacity for user-specified log-likelihood functions in the SAS NLMIXED procedure.31 To specify the likelihood to the level of detail required for programming, note that Eq. (5) may be written as follows under non-differentiality:
where Pr(Y = 1| Xi = xi) = exp(τi) /[1+ exp(τi)], with via Eq. (1). A special case of general likelihood expressions provided in Carroll et al.,11 this structure is directly reflected in the first sample program in Appendix 1.
Because sensitivity analysis is seldom a fully satisfying solution, we emphasize using validation data to estimate (β1,…,βP) in Eq. (1) without pre-specifying sensitivity and specificity values. When the validation sample is external1 (i.e., separate from the main study), we confine attention to the non-differential case because external studies seldom measure the same covariates as the main study. External validation is also limited by a need to assume “transportability,” i.e., that sensitivity and specificity parameters targeted in the validation sample are identical to those operating in the main study.11–12 In the remainder of the paper, we use the shorthand “main/external” and “main/internal” to refer to settings in which main study data are combined with external or internal validation data, respectively.
Given that our primary focus is upon the analysis of main/internal study data as required in the motivating example, we relegate details of the main/external case to Appendix 2. The structure of the resulting main/external likelihood is reflected in the second SAS NLMIXED program found in Appendix 1.
Our main interest lies in the case in which an internal validation sample (of size nv) is randomly selected from the overall study sample. Again, main study experimental units contribute records of the form (yi*, xi1, …, xiP). In contrast, resources are expended toward those selected for validation to augment their records with the true outcome status (yi). Benefits of this supplemental data collection effort include removal of concern about transportability and flexibility to allow general patterns of differential misclassification.
As a first example, consider the case of two covariates, one continuous (X1) and one binary (X2), where sensitivity and specificity depend on X2. That is, define SEt = Pr(Y*=1 | Y=1, X2=t) and SPt = Pr(Y*=0 | Y=0, X2=t) (t=0,1). Main study contributions remain of the form in (4), yielding the following main study likelihood:
where I(.) is a binary (0,1) indicator for whether the condition in parentheses is true. In contrast, internal validation data records contribute terms of the form
yielding an internal validation subsample likelihood as follows:
Again, the full likelihood is proportional to L = Lm× Lv.
For the general case in which model (1) includes arbitrary predictors (X1,…, XP), we assume sensitivity and specificity depend on (X2*,…, XK*), which may denote a subset of (X1,…, XP) and/or include other variables or interaction terms. We favor a second logistic model to define associations between these predictors and sensitivity / specificity:
Assuming an adequate internal validation sample, Eq. (9) allows us to flexibly account for differential misclassification. It does so in a potentially robust manner when (X2*,… XK*) consists of categorical variables. For subject i contributing predictor values xi, Eq. (9) implies that
where . Maximum likelihood estimates (MLEs) for the differential sensitivity and specificity parameters follow from the MLE of θ = (θ1,…,θK).
The full likelihood for this general case is proportional to L = Lm× Lv, where
(identical to Eq. (6) except for covariate effects on sensitivity and specificity), and
This likelihood structure is reflected in the third SAS NLMIXED program in Appendix 1 (http://links.lww.com). The likelihood itself is equivalent to a general expression found in the paper by Carroll et al.11 We present it more explicitly here to enhance its clarity and connection with the provided program.
As with any parametric model, Eq. (9) makes likelihood ratio tests available based on Eq. (11)–(12) to aid in model selection and assess whether predictors are associated with sensitivity and specificity. This permits testing the hypothesis of completely non-differential misclassification, i.e., H0: (θ2 = θ3 = …= θK) = 0.
Prior treatments of outcome misclassification18 offered limited or no applicability under outcome-dependent sampling, despite well-known classical results32 establishing the utility of logistic regression for retrospective studies. It is thus of interest to explore whether and to what extent the recommended maximum likelihood approach accommodates the case-control design. By “case-control” here, we imply that sampling is done based on the error-prone response (Y*), with a higher sampling probability applied to “cases” (those with Y*=1) than to “controls” (those with Y*=0). We find that, with certain caveats, the internal validation study-based analysis proposed here can be used without modification despite the application of such “case” oversampling.
Specifically, the method described in the previous subsection yields valid estimates of (β1,…,βP)under model (1) when sampling favors those with Y*=1, assuming non-differential misclassification of case/control status. As with the classic case,32 the intercept loses its original interpretation. For similar reasons, the likelihood-based estimates of sensitivity and specificity will no longer reflect the true diagnostic properties of Y*. Rather, these tend to be inflated and deflated, respectively, in concert with the oversampling of cases according to Y*. In fact, the fallibility of the internal validation-based sensitivity/specificity estimators due to “case” oversampling is key to the validity of the (β1,…,βP ) estimates, as these estimators reflect the “operating” sensitivity and specificity of Y* under the sampling strategy employed. In contrast, direct analysis based on external validation data (or even employing correct assumed values of sensitivity and specificity) misconstrues the “operating” sensitivity and specificity, generally yielding inconsistent estimates of (β1,…,βP ) . This may explain why methods18; 20–22 that are not based on internal validation data encounter problems for case-control studies.
The validity of the main/internal validation study-based maximum likelihood approach for such case-control sampling with non-differential outcome misclassification recalls theoretical results in the statistical literature,33 and can be demonstrated by noting that terms involving the selection probabilities applied to those with Y*=1 and Y*=0 factor out of the likelihood. In contrast, no such clean factorization occurs under differential misclassification. Nevertheless, if differential outcome misclassification is appropriately modeled via Eq. (11)–(12), empirical evidence via simulation under large samples suggests that the MLEs for some elements of (β1,…,βP ) in model (1) may remain valid under “case” oversampling. Specifically, our experimentation suggests that β coefficients in model (1) remain reliably estimable if they correspond to predictor variables that are not needed in the second regression model (9) that defines sensitivity and specificity. A simulation study illustrating these points follows after the example section.
Our example concerns data on bacterial vaginosis status for women in the HIV Epidemiology Research Study. A total of 1,310 (871 HIV-infected and 439 at-risk uninfected) women were enrolled into this prospective study across four U.S. cities from 1993 to 1995.34 Researchers diagnosed bacterial vaginosis semi-annually by two different techniques, referred to as the “CLIN” (clinically-based) and “LAB” (laboratory-based) methods. A CLIN diagnosis required the presence of three or more specific clinical conditions based on a modification of Amsel's criteria,35 while LAB diagnoses were made via a sophisticated Gram-staining technique.36 Prior references37–38 provide details on these methods in the study. As in Gallo et al.,38 we treat the more costly LAB method as a gold standard assessment, while the CLIN approach represents an accessible error-prone substitute. These authors found evidence of low sensitivity for the CLIN method, and suggested that its accuracy may suffer due to wide heterogeneity in bacterial vaginosis cases or due to the need for technicians to be trained in order to properly apply the subjective Amsel criteria.38
A unique feature of this example is that both LAB and CLIN diagnoses were made regularly. Thus, in addition to fitting a “naïve” main study-only version of model (1) with CLIN status (Y*) substituted for LAB (Y), we were able to fit Eq. (1) to data using the assumed gold standard (Y) on all subjects. While the illustration of validation data-based adjusted analyses then requires ignoring LAB data on a random subset, an advantage is that we have an “ideal” complete-data model for comparison.
We use data from the 4th semi-annual study visit on 982 black, white, and Hispanic women who were 25 years or older at enrollment. Available variables potentially associated with bacterial vaginosis status include age, race, HIV status (0 if negative, 1 if positive), and HIV risk group (0 if via sexual contact; 1 if intravenous drug use). Study site and CD4 counts among HIV positives showed little association with bacterial vaginosis status in this sample.
Median age at enrollment was 37 years. Other potential bacterial vaginosis risk factors are distributed as follows: race/ethnicity (60% black, 24% white, 16% Hispanic); HIV status (69% positive, 31% negative); HIV risk group (47% sexual, 53% intravenous drug use). Among women with data on bacterial vaginosis, 41% were positive via the LAB method, versus 25% based on CLIN. Unadjusted estimates were 0.53 (sensitivity) and 0.94 (specificity), suggesting that CLIN yields a low risk of false positives but high risk of false negatives.
For an “ideal” comparative analysis, we first fit Eq. (1) to all women, with the gold standard diagnosis (LAB; 1 vs. 0) as the outcome. Preliminary analyses revealed similar bacterial vaginosis prevalence among white and Hispanic women, so we created a binary variable (0 if non-black, 1 if black). Initially dichotomizing age at the median, we assessed second- and higher-order interactions among age, race, HIV status, and risk group. A likelihood ratio test supported elimination of all 11 interaction terms.
A total of 924 women, with complete data on both bacterial vaginosis assessments and all risk factors, contributed to the fitted models summarized in Table 1. The upper half of the table summarizes the fit of the resulting version of model (1) for LAB status, in which we treat age (in years) continuously:
We then fit the same model upon substituting the error-prone CLIN diagnosis as the outcome (lower half of Table 1). The two analyses differ markedly in terms of magnitude of the estimated OR for HIV risk group (1.50 for LAB, 2.68 for CLIN), and directionality of the estimated OR for HIV status (1.19 for LAB, 0.71 for CLIN).
To illustrate misclassification adjustment, we selected a random internal validation subset of size nv=300 women. Predictor selection via model (9) fit to these 300 women revealed no independent association between race and CLIN status. Pairwise and higher-order interactions among LAB status, risk group, HIV status, and age (dichotomized for purposes of estimating sensitivity and specificity) were non-significant as a group. The version of Eq. (9) utilized in the main/internal validation study likelihood is
where AGEGTMED indicates whether a subject's age at enrollment exceeded the median.
The upper half of Table 2 summarizes a complete analysis of the data via the joint likelihood in Eq. (11)–(12). For comparison, the lower half of Table 2 gives corresponding results assuming non-differential misclassification [restricting θ2=θ3=θ4=0 in Eq. (14)]. The likelihood ratio test comparing the joint models with and without the non-differentiality assumption was highly significant (χ2=20.1, P<0.001), strongly confirming a need to account for dependence of the sensitivity and specificity of the CLIN diagnosis upon subject-specific covariates. Note that the analysis in the upper half of Table 2 yields the same interpretations as the “ideal” analysis (upper half, Table 1), in terms of directionalities and magnitudes of the estimated ORs. In contrast, results in the lower half of Table 2 are similar to those of the “naïve” analysis (lower half, Table 1), showing an elevated estimate for risk group and negative directionality for HIV status. This highlights the value of internal validation data for modeling sensitivity and specificity.
Table 3 provides the MLE of (θ0, θ1, θ2, θ3, θ4) in Eq. (14) based on the joint likelihood Eq. (11)–(12). Note that all three predictors (risk group, HIV status, and age) are independently associated with sensitivity and specificity. Table 3 provides corresponding MLEs of (SE, SP) via equations. (9)–(10), with multivariate delta method-based standard errors (details available from the authors). Holding other variables constant, sensitivity tends to be higher (and specificity lower) for those who are in the intravenous drug use risk group, younger, or HIV-negative. The variations in these estimates give further credence to the differential nature of outcome misclassification in this real-data example.
Our primary simulation experiment evaluates the general main/internal validation study analysis outlined in (9)–(12), under conditions mimicking the example. Four predictors (X1–X4) were randomly generated with distributions like those observed at study Visit 4 for “Race” (black vs. non-black), “Risk Group,” “HIV status,” and “age,” respectively. True outcomes (Y) were simulated according to Eq. (13), with β coefficients equal to the estimates reported in the top portion of Table 2. Error-prone outcomes (Y*) were generated via Eq. (14), with θ's equal to the estimates at the top of Table 3. For 1,000 such datasets, we conducted the “naïve” analysis in addition to two main/internal validation analyses based on Eq. (11)–(12). The first of these assumed the appropriate differential misclassification model, and the second incorrectly assumed non-differentiality.
Table 4 summarizes the results. The “naïve” analysis produces highly biased estimates, with means comparable to the estimates from the example with CLIN as the outcome (Table 1, bottom). Main/internal validation study-based analysis assuming the correct differential misclassification model produces reliable estimates of all four β coefficients, and excellent confidence interval (CI) coverage. In contrast, the main/internal analysis based on erroneously assuming non-differentiality produces average parameter estimates remarkably similar to the estimates reported in the lower half of Table 2. These are invalid except for the estimate of β1, corresponding to the predictor (X1) that was unassociated with sensitivity and specificity in model (14).
Table 5 summarizes simulations assessing the internal validation study-based methods under “case-control” sampling as previously described. The version of model (1) for generating data was as follows:
where X1 is standard normally distributed and X2 is a Bernoulli(0.5) binary predictor. The true regression coefficients were (β0,β1,β2) = (−0.4, 2.0, 0.5). For both scenarios in Table 5, approximately 5,200 observations were first generated via the above model in a cross-sectional manner. Error-prone (Y*) values were then generated, potentially allowing sensitivity and specificity to vary with X2 [i.e., assuming SEt = Pr(Y*=1 | Y=1, X2=t) and SPt = Pr(Y*=0 | Y=0, X2=t) (t=0,1)]. To mimic case-control sampling, we utilized 100% of data records with Y*=1 in each case but retained only a 5% random sample of those with Y*=0. Under these conditions, each simulated “case-control” sample contained approximately 1,500 observations, of which 500 were randomly selected into an internal validation sample. The main/internal validation study likelihood to analyze each data set is specified in Eq. (7)–(8).
The top half of Table 5 summarizes results for a non-differential case, in which SE1=SP1=SE0=SP0=0.8. As noted above under “Comments Regarding Case-Control Data,” maximum likelihood estimates of the SE and SP parameters differed from the true value of 0.8 on average, reflecting the “operating” sensitivity and specificity under “case” oversampling. However, estimates of β1 and β2 are quite reliable, with means near the true values of 2 and 0.5 and near-nominal CI coverage. The bottom half of Table 5 summarizes a differential case, where SE1=0.8, SP1=0.7, SE0=0.6, and SP0=0.9. Note that the coefficient (β1), corresponding to the predictor (X1) that was not associated with sensitivity and specificity, remains validly estimated. As also mentioned above, however, validity for estimating β2 is lost subsequent to X2's direct association with sensitivity and specificity. In both cases shown in Table 5, “naïve” analysis based on Y* for case-control status yielded severe bias.
We have considered the problem of outcome misclassification in logistic regression, with emphasis on clearly specifying likelihood functions corresponding to main/external and main/internal validation study designs. This emphasis distinguishes our work from related prior references in the epidemiologic literature,18, 21–22 which do not pursue the incorporation of validation data. Although validation data-based maximum likelihood methods are outlined in the comprehensive text of Carroll et al.,11 the treatment there is purposefully general and therefore made without a real-data example or facilitating computations. With the practicing epidemiologist in mind, we have sought to motivate such methods for handling outcome misclassification with a real-world study, and to make them fully accessible via user-friendly programs that directly reflect the likelihood specifications and utilize common software for optimization.31
Our treatment includes detailed evaluation of maximum likelihood methodology via simulation studies. These simulations, along with the HIV study example, clearly demonstrate the importance of internal validation subsampling when misclassification is differential. The results in Tables 2 and and44 illustrate that outcome misclassification adjustment via an erroneous assumption of non-differentiality may offer only marginal improvement over “naïve” analysis based on the error-prone outcome (Y*).
We have demonstrated how the methods considered here can be used in the case-control setting, for which little discussion about outcome misclassification in logistic regression appears in the literature and for which prior proposals18 were not applicable. Assuming appropriate model specifications, we find that the maximum likelihood approach for the main/internal validation design illustrated here remains directly applicable in case-control studies with random “case” (i.e., based on Y*) oversampling under non-differential outcome misclassification. While further investigation of the impact of outcome-dependent sampling is warranted when misclassification is differential with respect to covariates, empirical studies suggest that the maximum likelihood approach maintains validity for estimating primary regression parameters associated with predictor variables that are not associated with sensitivity and specificity values.
Future work could involve extensions of past research on cost-efficiency39 to the logistic regression setting considered here, because the ultimate appeal of main/internal validation study designs is their potential for conserving resources. Somewhat along these lines, we experimented with further simulations under the same conditions as were assumed in producing Table 4, but varying the size of the internal validation subsample. We found that decreasing the validation sampling fraction to select as few as 5% of the 1,000 subjects very seldom produced numerical problems with the maximum likelihood routine, despite expected increases in variability of the adjusted log odds ratio estimates. From a practical standpoint, the simulation program used to produce the results in Table 4 is a sharable resource that could aid an investigator in determining the validation fraction indicated for a particular study, and provide insight into the cost-efficiency of a main/internal validation design.
There may also be interest in regression-based methods to adjust for outcome misclassification in situations where no gold standard exists, but one has access to replicates of an error-prone outcome measure or to a diagnostic measure viewable as an “alloyed” gold standard.40–41 Additionally, there would be value in making methods29 that non-parametrically estimate the distribution of Y* | (Y, X) more readily accessible in practice. Nevertheless, the logistic regression approach advocated in Eq. (9)–(10) facilitates likelihood ratio testing and is potentially robust when all predictors in Eq. (9) are categorical, given the freedom to saturate that model.
The following is a list of the HIV EPIDEMIOLOGY RESEARCH STUDY GROUP: Robert S. Klein, Ellie Schoenbaum, Julia Arnsten, Robert D. Burk, Chee Jen Chang, Penelope Demas, and Andrea Howard, from Montefiore Medical Center and the Albert Einstein College of Medicine; Paula Schuman, and Jack Sobel, from the Wayne State University School of Medicine; Anne Rompalo, David Vlahov, and David Celentano, from the Johns Hopkins University School of Medicine; Charles Carpenter, and Kenneth Mayer, from the Brown University School of Medicine; Ann Duerr, Lytt I. Gardner, Charles M. Heilig, Scott Holmberg, Denise Jamieson, Jan Moore, Ruby Phelps, Dawn Smith, and Dora Warren, from the Centers for Disease Control and Prevention; and Katherine Davenny, from the National Institute of Drug Abuse.
Funding: This work was supported by National Institute of Nursing Research Grant 1RC4NR012527-01, by National Institute of Environmental Health Sciences Grant 2R01-ES012458-5, and by PHS Grant UL1 RR025008 from the Clinical and Translational Science Award Program, National Institutes of Health, National Center for Research Resources. The HER Study was supported by the Centers for Disease Control and Prevention: U64/CCU106795, U64/CCU206798, U64/CCU306802, and U64/CCU506831.
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.