|Home | About | Journals | Submit | Contact Us | Français|
To evaluate the probabilities of a disease state, ideally all subjects in a study should be diagnosed by a definitive diagnostic or gold standard test. However, since definitive diagnostic tests are often invasive and expensive, it is generally unethical to apply them to subjects whose screening tests are negative. In this article, we consider latent class models for screening studies with two imperfect binary diagnostic tests and a definitive categorical disease status measured only for those with at least one positive screening test. Specifically, we discuss a conditional independent and three homogeneous conditional dependent latent class models and assess the impact of misspecification of the dependence structure on the estimation of disease category probabilities using frequentist and Bayesian approaches. Interestingly, the three homogeneous dependent models can provide identical goodness-of-fit but substantively different estimates for a given study. However, the parametric form of the assumed dependence structure itself is not “testable” from the data, and thus the dependence structure modeling considered here can only be viewed as a sensitivity analysis concerning a more complicated non-identifiable model potentially involving heterogeneous dependence structure. Furthermore, we discuss Bayesian model averaging together with its limitations as an alternative way to partially address this particularly challenging problem. The methods are applied to two cancer screening studies, and simulations are conducted to evaluate the performance of these methods. In summary, further research is needed to reduce the impact of model misspecification on the estimation of disease prevalence in such settings.
Screening for a specific disease or condition is a fundamental component of human disease control and prevention. The objective of screening is to classify asymptomatic people as likely or unlikely to have the disease or condition of interest. People who appear likely to have the disease or condition are examined further for a diagnosis, and those people who are diagnosed with the disease are treated. Therefore, screening can reduce the morbidity and mortality of the disease among people screened and can enable early treatment for diagnosed cases. Screening programs for cancer and heart diseases are well established in many countries. In many screening programs, a population with known size n is screened by two imperfect binary diagnostic tests. If the results of both diagnostic tests are negative, no further screening is undertaken. If either of the two diagnostic tests is positive, then a full evaluation of the disease using a gold standard classification is undertaken .
For estimating diagnostic accuracy without a gold standard, it is well known that if the conditional independence assumption is incorrectly assumed, parameter estimates may be biased [2–4]. When the disease status D is a binary random variable, Albert and Dodd  showed that the estimation of diagnostic accuracy and prevalence is sensitive to the choice of dependence structure for studies with multiple diagnostic tests. The dependence structure was specified using a Gaussian random effects model [6,7] and a finite mixture model . They showed that it is difficult to distinguish between different dependence structures in the absence of a gold standard test in most practical situations (i.e., unless there are more than 10 tests). Albert  proposed methods for estimating diagnostic accuracy of multiple binary tests with an imperfect reference standard when information about the diagnostic accuracy of the imperfect test is available from external data sources. Furthermore, using the same dependence structure, Albert and Dodd  examined the effect of model misspecification on the estimation of test accuracy and prevalence when a binary gold standard is partially verified. They showed that for extreme biased sampling the estimation is sensitive to the choice of dependence structure. Other latent class models with a focus on diagnostic accuracy have also been considered in a single study [11,12] as well as in a meta-analysis . In addition, Black and Craig  discussed the estimation of disease prevalence in a scenario involving two imperfect tests in the absence of a gold standard and proposed Bayesian model averaging for inference over the conditional independence and dependence models. However, those dependence models are not directly applicable in the setting that we are considering because there are only two diagnostic tests, and more importantly, if both diagnostic tests are negative, no further gold standard classification will be applied.
Let T1, T2, and D be the random variables denoting the two screening tests and the disease status, respectively. In this article, we consider T1 and T2 to be binary variables with value 1 indicating test positive and 0 indicating test negative, and D to be a categorical variable with value d = 1, 2,…, K indicating the classes of disease. Let be the observed frequency with D= d, T1 = i and T2 = j (i = 0, 1 and j = 0, 1), be the observed frequency with T1 = i and T2 = j, and be the total number of observations. Furthermore, let πij = P(T1 = i, T2 = j), , Pd = P(D = d), , and denote the corresponding joint, marginal and conditional probabilities. In most studies, we only observe frequencies of and due to the nature of screening. The frequencies of are usually not observed, although the margin is observed. The data structure contains 3d+1 observed frequencies, which in general allows for the estimation of a maximum of 3d free parameters. In this paper we will not consider special cases when only less than 3d free parameters are identifiable, e.g., when there are zeros among the observed frequencies.
One way to write the likelihood function (ignoring constant terms) in this setting is in terms of Pd and (d = 1, 2,…, K; i = 0, 1; j = 0, 1) with constraints of and as follows,
This parameterization involves a mixture likelihood in the first term and prevents a closed-form solution for the maximum likelihood estimators (MLE). It contains 4d−1 free parameters. Without further assumptions, the parameters in equation (1) are not identifiable. However, this parameterization allows for direct specification of commonly used assumptions, usually specified through some constrains on . For example, the frequently used conditional independence assumption [15,16] assumes that the two tests T1 and T2 are independent conditioning on the disease status D, i.e., T1T2|D, and the number of free parameters in equation (1) is reduced to 3d−1 giving model identification since , and . For convenience, we denote the conditional independence model as the model with as the corresponding MLEs. Under the homogeneous dependence assumptions (i.e., the α, θ, and ρ models that will be discussed in Sections 2 and 3), the number of free parameters in equation (1) is reduced to 3d and the models become saturated and equivalent to the alternative parameterization below .
An alternative parameterization of the log-likelihood function can be written in terms of πij (i = 0, 1; j = 0, 1) and with constraint of as follows,
This representation relates to previous work in other settings [18–20]. This model is a saturated model with 3d parameters. The maximum likelihood equations are tractable and yield MLEs in closed-form. Omitting the algebra, we obtain, ij = xij/n (i, j =0, 1), and if i + j>0. The existence of closed-form solutions for this alternative parameterization allows for closed- form solutions for the α and θ homogeneous conditional dependent models , which will be briefly discussed in detail in Sections 2.1 and 2.2. Furthermore, this saturated alternative parameterization also suggests that the probability of having disease class d for those with both tests negative (i.e., ), and further the overall probability of having disease class d (i.e. ) are not identifiable without some additional “non-testable” assumptions. Thus, the dependence structure modeling considered in this paper itself is not “testable”, and can only be viewed as a sensitivity analysis for the estimation of disease prevalence. Similar to generalized linear models with non-ignorable missing data mechanism , the type of sensitivity analyses play an important role in the estimation and inference in this problem.
In similar settings when the gold standard was only measured on those who screened positive, Cheng et al.  and Pepe and Alonzo  have examined the potential overwhelming impact of the correlation between the two screening tests on the estimation of absolute test accuracy parameters. Both suggest using relative test accuracy for comparing disease screening tests. However, to our knowledge, no one has assessed the impact of the misspecification of conditional dependence structures, which can be specified by a homogeneous dependence parameter for two diagnostic tests, on the estimation of disease class probabilities in such screen-positive ascertained studies.
In this article, we empirically assess the impact of misspecification of the conditional dependence structure on the estimation of disease class probabilities through two case studies and simulations. Specifically, in Sections 2.1 and 2.2, we define the MLEs for the homogeneous dependent α and θ models, and in Section 2.3 we propose the homogeneous correlation coefficient conditional dependent ρ model. Bayesian approaches, which incorporate prior beliefs about dependence, are developed for the three models in Section 3 as an alternative to the maximum likelihood methods. Furthermore, we discuss Bayesian model averaging in Section 3.4 as an alternative way to address the challenging estimation problem since the three homogeneous dependent models can provide the same goodness-of-fit for the data but substantively different estimates  and the dependence structure itself is not “testable”. In Section 4, we compare the results for the two case studies using both the maximum likelihood methods and the Bayesian approaches. The two case studies were reanalyzed recently by Böhning and Patilea  using a capture-recapture approach under the α and θ model assumptions. Our focus here is to compare the estimates under the α, θ and ρ model assumptions using both the maximum likelihood methods and Bayesian approaches. A simulation study is conducted in Section 5 and a brief discussion is presented in Section 6.
In Section 2.1 and 2.2, we will briefly introduce the homogeneous conditional dependence α and θ models, recently proposed by Böhning and Patilea  using a capture-recapture approach. Using the alternative parameterization of the maximum likelihood as presented in equation (2), Chu and Nie  presented closed-form maximum likelihood solutions under the α and θ model assumptions.
Under this model, the association of the two tests T1 and T2 conditional on the disease status D as measured by the odds ratio is assumed to be homogeneous over all disease categories, i.e., (d = 1, 2, …, K) is assumed to be homogenous. By Bayes’ theorem, we obtain . With and simple algebra, we obtain the solution of under this homogeneity assumption. Thus, by plugging in the closed-form solutions of MLEs of π (i, j =0, 1) and from equation (2), the closed-form MLEs of α and are
where the superscript α indicates the homogeneous odds ratio assumption.
Under this model, ratio of conditional (conditional on test T2 being positive) and unconditional probabilities of test T1 being positive is assumed to be homogeneous over all disease categories, i.e., (d = 1, 2, …, K) is assumed to be homogenous. By Bayes’ theorem, we obtain . With and simple algebra, we obtain the solution of . Thus, the closed- form MLEs of θ and are
where the superscript θ indicates the homogeneous relative risk assumption.
In this section, we propose an alternative homogeneous conditional dependence model, the ρ model. Under this model, the correlation of the two tests T1 and T2 is assumed to be homogeneous over all disease categories, i.e., ρd(d = 1, 2, …, K) is assumed to be homogenous ρ. Let be the covariance between two tests in the dth disease group, then we have , and . The bounded range of correlations is determined by the marginal probability of testing positive and . Specifically, the correlation coefficients ρ satisfies
Let the MLEs of ρ and Pd be denoted as and , where the superscript ρ indicates the homogeneous correlation coefficient assumption. They do not have a closed-form solution under this model.
All three models assume a homogeneous dependence structure. This is a rather strong assumption. However, because all three homogeneous dependent models are already saturated, heterogeneous dependent models are not identifiable without additional constraints on test accuracy parameters (e.g., assuming the test accuracy parameters are the same for the two diagnostic tests, which is a much stronger assumption in general). Furthermore, the three homogeneous dependent models can provide the same goodness-of-fit for the data but substantively different estimates , a natural way addressing this problem might be through the frequentist model average estimators . Let wα, wθ and wρ with constraint of wα + wθ + wρ= 1 be the corresponding weights for the α, the θ, and the ρ models, the weighted model average estimator can be defined as . However, the MLEs and are usually correlated since they are based on the same data. Due to the technical difficulty of computing the variance-covariance matrix between ( ) and for the computation of the standard error of , we do not consider the frequentist model average estimator in this article. We will consider the Bayesian model averaging counterpart in Section 3.4.
In practice, it is often of interest to test the difference between the estimated probabilities of disease states using different dependence assumptions (i.e., the α, θ or ρ model). Since the closed-form maximum likelihood solutions for the α and θ models are based on the likelihood function as presented in equation (2), a Wald-type test comparing is directly available with the standard error obtained by the delta method. Due to the technical difficulty of computing the variance-covariance matrix between ( ) and , comparing the MLEs of ( ) with is not straightforward. In practice, bootstrapping methods can be used as an alternative way to compute the corresponding p-values and 95% confidence intervals .
We developed a SAS macro (SAS Institute, Cary, NC) to implement the models discussed above parameterized both in terms of Pd and as in equation (1) for the homogeneous ρ model, and in terms of πij and as in equation (2) for the homogenous α and θ models. To describe disease class prevalence and to implement the constraints of 0 < Pd < 1 and , we used the linear generalized logit model  which is widely applied in categorical data analysis. This model has an inverse link function defined as (d = 1, 2, …, K) with βK = 0. We used the delta method to compute the standard error of functions of MLEs and their confidence intervals based on normal approximation. The two parameterizations (i.e., in terms of Pd, and in terms of πij, ) provide exactly the same results.
In this Section, we discuss the Bayesian approaches [27,28]. Because the Bayesian approach and the frequentist approach use different frameworks, they can be considered complementary. When relatively large studies are combined with weak prior distributions, inferences obtained by Bayesian and frequentist methods generally agree. However, the Bayesian framework is particularly attractive when suitable prior distributions can be constructed to incorporate known constraints and subject-matter knowledge on model parameters . The Bayesian framework allows direct construction of 100(1−α)% equal tail and highest probability density (HPD) credible intervals of general functions of the estimated parameters without having to rely on asymptotic approximations. Furthermore, the Bayesian framework provides direct implementation of model averaging , which provides a natural way to address the problem of selecting a model from several competing models that give equal goodness-of-fit but potentially different inferences for a particular study.
To implement the constrains of and under the α model, we re-parameterize and as follows,
Let f(α, ad, bd, Pd) be the prior joint distribution of (α, ad, bd, Pd) (d = 1, 2, …, K), the joint posterior distribution given the observed frequencies is proportional to
To implement the constrains of and under the θ model, we re-parameterize and as follows,
Let be the prior joint distribution of (θ, , Pd) (d = 1, 2, …, K), the joint posterior distribution given the observed frequencies is proportional to
The feasible range of θ is determined by the marginal probability of testing positive and and is implemented through the addition of the four indicator functions I(·) in equation (8).
To implement the constrains of and ρd=ρ (d = 1, 2, …, K), we re-parameterize and as in equation (6). Let f(ρ, ,Pd) be the prior joint distribution of (θ, ,Pd) (d = 1, 2, …, K) and be the covariance, the joint posterior distribution given the observed frequencies is proportion to
The feasible range of correlation determined by the marginal probability of test positive and as in equation (5) is implemented through the addition of the four indicator functions I(·) in equation (9).
The homogeneous dependence models are saturated. Therefore, they provide the same goodness-of-fit for the data, but can provide substantively different estimates. Bayesian model averaging (BMA) provides a natural way to address this problem . The posterior distribution of the quantity of interest Pd given data is
where M1 …, MK are the models considered, and the posterior probability for model Mk is given by , where pr(Data|Mk)= ∫ pr(Data|k, Mk)pr(k|Mk)dk is the integrated likelihood of model Mk, and k is the vector of parameters of model Mk, pr(k|Mk) is the prior density of k under model Mk, pr(Data|k, Mk) is the likelihood, and pr(Mk) is the prior probability that Mk is the true model (assuming one of the models considered is true). In this paper, we assume equal prior probabilities for the α, θ and ρ models, i.e., pr(Mk)=1/3 for k=1,2,3.
In the Bayesian models discussed above, computation was done using Markov chain Monte Carlo (MCMC)  in WinBUGS  and BRUGs in R (http://www.r-project.org). Burn-in consisted of 50,000 iterations; 50,000 subsequent iterations were used for posterior summaries. Convergence of Markov chains was assessed using the Gelman and Rubin convergence statistic [33,34]. To describe disease class prevalence and to implement the constrain of 0 < P < 1 and , we use the linear generalized logit model which has inverse link function defined as (d = 1, 2, …, K) with βK = 0 . We selected proper but diffuse prior distributions for the hyperparameters . Specifically, the hyper-priors for the parameters were assumed to be as follows: 1) Vague priors of N(0, 103) were assumed for βd s (d = 1, 2, …, K−1) in the generalized logit transformed probabilities of disease classes Pd s; 2) Uniform prior of [−1.0, 1.0] was assumed for correlation coefficient ρ; 3) Vague priors of N(0, 103) were assumed for α and θ on the log scale to ensure α > 0 and θ > 0; and 4) Vague priors of N(0, 103) were assumed for ad s and bd s in the α model to directly implement the homogeneous odds ratios assumption, and for s and s in the logit scale for the θ and ρ models.
For the purpose of comparing the performance of different models, we reanalyzed the data from two screening studies, in which the disease status has been evaluated only for those who tested positive for at least one of the two tests. The first study consists of data from the Health Insurance Plan Study for screening breast cancer in New York . The study was carried out by the Health Insurance Plan, a prepaid comprehensive medical care plan with 750,000 subscribers enrolled in 31 medical groups. Periodic screening for breast cancer using mammography as well as clinical physical examination was performed for women aged 40 to 64 years who were chosen at random. In this study, 307 out of 20,211 women, who were test positive by either physical examination or mammography, underwent biopsy for the classification of two disease states: no cancer (d = 1) or cancer (d = 2). The second study is the multicenter study comparing cervicography with the standard pap smear cytology test for detecting cervical cancer between November 1991 and December 1992 . In this study, 228 out of 5,192 women, who were test positive by either cervicography or the standard pap smear cytology test, underwent biopsy for the classification of three disease states: not present (d = 1), low grade (condyloma) (d = 2) and high grade (invasive cancer) (d = 3). Table 1 presents the observed frequencies in the two screening studies.
Table 2 presents the estimates of the conditional dependence parameters (i.e., α, θ and ρ) when using both the maximum likelihood method and the Bayesian method. We use the triple of percentiles, 2.55097.5, to display a parameter estimate (or posterior median) with its 95% confidence (or credible) interval, as suggested by Louis and Zeger . In summary, both approaches suggest statistically significant dependence when using all three models for the two studies. Tables 3 and and44 present the estimated probabilities of the disease classes under the three homogenous dependence models as well as under the independence model, when using the maximum likelihood method and the Bayesian method, respectively. The twice negative likelihood is presented in Table 3 for comparing the goodness-of-fit of the independent model, and the homogeneous dependent α, θ, and ρ models, which demonstrate that the α, θ, and ρ models give exactly the same goodness-of-fit for both studies. In addition, the BMA estimates across the three conditional dependence models are presented in Table 4. In summary, the estimates were consistent between the maximum likelihood and Bayesian approaches except for the probabilities of not present and low grade cervical cancer in the multicenter study detecting cervical cancer using the ρ model, potentially due to the constrains implemented in the Markov chain Monte Carlo samplings. Specifically, the estimated probability of low grade cervical cancer is estimated to be 54255883 per 1000 women using the Bayesian approach, but only 0115308 per 1000 women using the maximum likelihood method.
As an interesting observation, we found that the difference between the estimated probabilities of disease states using different dependence assumptions (i.e., the α, θ or ρ model) can be statistically significant and practically meaningful. For example, in the Health Insurance Plan Study for breast cancer screening in New York, the estimated probability of having breast cancer using the maximum likelihood method is 34893 per one thousand women assuming the α model, while the estimate is 2875122 per one thousand women assuming the θ model, and 2711 per one thousand women assuming the ρ model. The difference between the estimated probabilities of having breast cancer assuming α and θ models is 142740 per one thousand women with a p-value less than 0.001 by a Wald-type test. The non-overlapping 95% confidence intervals between the estimated probabilities assuming the ρ model and the α (or θ) model suggests a statistically significant difference at least at the 5% significant level. In addition, using the maximum likelihood approach, the estimated probability of having invasive high grade cervical cancer is 286194 per one thousand in the multicenter study for detecting cervical cancer assuming the θ model, which is about eight times higher than the estimate of 5811 per one thousand women assuming the ρ model, and the 95% confidence intervals do not overlap. The Bayesian approaches gave similar inferences to the frequentist approaches. This substantial difference in estimated probability high grade cervical cancer can have an impact on cancer surveillance and prevention. Unfortunately, the data does not contain any information to differentiate those dependent models since they all give the same goodness-of-fit. Thus, without some sensible assumptions, the disease prevalence may not be estimable from the data set, even with Bayesian model averaging, particularly if proposed models in BMA do not contain the correct model (which is arguably true in practice given that an infinite large number of models exist and potentially many can give same goodness of fit).
To further study how the disease status probability estimates vary with the dependent model assumption and to evaluate the impact of misspecification of different dependent models on the estimation of the probabilities of disease classes, we performed four sets of simulations assuming the independent model, the α, θ, and ρ dependent models, respectively. For ease of presentation and interpretation, we considered two disease strata. The simulation parameters are: the probabilities of disease classes Pd = (0.8,0.2), the marginal conditional probabilities of test T1 being positive , and the marginal conditional probabilities of test T2 being positive . In the α and θ models, we used two values of α (or θ) = 1.25 and 3.0. In the ρ model, we used two values of ρ = 0.2 and 0.6. The sample sizes considered were n = 5000 and 25000. For each combination of α (or θ, or ρ) and n values, we generated 2,000 replicates. For each replicate, we computed the estimators under the independent model, the dependent α, θ, and ρ models, using both the maximum likelihood and Bayesian approaches. In addition, the BMA estimators across the three dependent models were computed. We used the true values of Pd, and , α = 1, θ =1, and ρ = 0 as the starting values in the maximum likelihood optimization procedures and the Bayesian Markov chain Monte Carlo sampling procedures.
Table 5 presents the means of the estimated disease prevalence across 2,000 replicates, using both the maximum likelihood and Bayesian approaches. For the Bayesian models, posterior medians were used as estimates for disease prevalence for a single replicate. If the true underlying model is the conditional independence model, fitting the α, θ and ρ dependence models will provide unbiased estimates for the disease prevalence. However, if the underlying model is one of the three dependence models, assuming independence provides biased estimates for disease prevalence. In addition, if the underlying model is a dependence model, assuming an incorrect dependence structure leads to biased estimates for disease prevalence. One interesting observation is that Bayesian model averaging (BMA) estimates tend to be less biased than the estimates under a misspecified dependence model. Furthermore when the underlying model is the α dependent model, the BMA estimates lead to nearly unbiased estimates. For all the scenarios, the maximum likelihood and Bayesian approaches provide similar estimates.
Table 6 presents the average length of the 95% confidence/credible intervals or the precision of the disease prevalence estimates across 2,000 replicates when using the maximum likelihood and Bayesian approaches. We found that if the true underlying model is the conditional independence model, assuming the α and θ dependence models leads to intervals that are too wide. For example, the 95% confidence/credible interval length using the θ dependence model is about twice than that using the true independence model. This suggests a substantive efficiency loss when conservatively assuming the α and θ dependence models. However, if the ρ dependence model is assumed, the average interval lengths are only slightly inflated. On the other hand, if the underlying model structure is one of the three dependence models, assuming independence leads to intervals that are too narrow (and biased). In addition, if the underlying model is the θ model, incorrectly assuming the α and ρ dependence models also leads to underestimation of the interval length. Furthermore, the results are highly concordant between the maximum likelihood and Bayesian approaches. Note that the average interval lengths of the BMA estimates are generally larger than those under any dependence model alone, regardless of whether the dependence model is correctly or incorrectly specified. This is due to the fact that the BMA estimates incorporate the additional uncertainty from model specification.
Table 7 presents the coverage performance of the 95% confidence/credible intervals of the disease prevalence across 2,000 replicates using both the maximum likelihood and Bayesian approaches. The coverage upon misspecification using dependent models is still around 95% if the true underlying model is the conditional independence model, possibly due to the negligible bias and wider confidence/credible intervals upon such misspecification, as suggested in Tables 5 and and6.6. However, if the underlying model structure is one of the three dependent models, the coverage upon misspecification decreases as the degree of dependence increases and as the sample size increases. In addition, the results suggest that if the underlying model is the ρ model, decent coverage tends to be difficult when the model is misspecified. In general, the Bayesian 95% credible intervals show slightly better coverage compared with the maximum likelihood 95% confidence intervals. More importantly, the coverage of the BMA intervals generally exceeds 90%, which has much better performance than the intervals from any single misspecified model. One reason for the better coverage using BMA is that such intervals are generally wider than those under a single model alone, and the true underlying model is included in the model averaging.
For screening studies where a categorical disease status is verified only if at least one out of the two binary screening tests being positive, we investigated three homogeneous dependence models (i.e., the α, θ, and ρ models) with two case studies and four sets of simulation studies, in which the ρ model is proposed in this paper. If the true underlying model is the conditional independence model, assuming the α and θ dependence models leads to intervals that are too wide (i.e., the 95% confidence/credible interval length of the θ model can be as twice as that of the model), while the ρ dependence model only slightly inflated the average interval lengths. By two real data analyses and simulation studies, we demonstrated that the three homogeneous dependence models can provide substantively different estimates for a study although with the same goodness-of-fit. We discussed both the frequentist and Bayesian approaches, and evaluated the impact of model misspecification on the estimation of disease class probabilities. Furthermore, we discussed Bayesian model averaging as an alternative way to partially address this particularly challenging estimation problem. Although we focused on the inference of disease class probabilities in this article, the same conclusion applies to the inference of the cell probabilities for two negative tests, i.e., , and the unknown cell frequency . We did not discuss the impact of misspecification of dependence structures on the estimation of test accuracies because it has been well studied from frequentist perspective [5,10,39]. It might be of interest to compare the performance of frequentist and Bayesian approaches on the estimation of test accuracy parameters under different settings such as low, moderate and high sensitivities and specificities.
The results imply that large differences in the estimated disease class probabilities may occur when assuming different dependence models, which can have a substantial impact on disease surveillance and prevention. Other more robust statistical methods, e.g. the generalized estimation equations [40,41], may be used to reduce the impact from misspecification of the dependence structure in this setting. We do not intend to suggest that these homogeneous dependence models are useless in practice because we cannot statistically differentiate between them based on the data alone. Caution against using these models due to the possible misspecification should be balanced with the need to estimate disease status probabilities. Furthermore, we realize that there are many more potential dependence structures than what we have considered, e.g., one could argue that the tests are dependent only for the cases but are independent for the controls . Depending on the problem in hand, some assumptions may be justifiable and preferable. In addition, note that the indistinguishable characteristic of these models is based on goodness-of-fit statistics alone. We can always use additional information such as expert opinion, historic information on sensitivities and specificities of the two binary diagnostic tests, and/or the range of dependence parameters to assist our choice of selecting a homogeneous dependence model. For the Bayesian approach, the additional information can be formulated as informative priors to improve the posterior inference. However, how to solicit and formulate informative priors in this case deserves thorough investigation and is beyond our current scope.
A potential strategy to justify the homogeneous assumptions of the α, θ and ρ is to incorporate a design element into the screening study that allows the selection of homogeneity models. This strategy could be randomly selecting a subset of both test negatives for ascertainment by a gold standard. However, in cases when a gold standard test is invasive and/or expensive, it is generally considered unethical to apply it to subjects whose screening tests are negative. In this case, if historical data or additional sample from a set of confirmed cases and controls in a similar population is available for determining test accuracy parameters, one can use the data to guide the selection of homogeneity dependence models.
In cases when there is no scientific justification to prefer a particular dependence model over the others, we suggest to treat those dependence models (including the three homogeneous dependence models that we have considered) as sensitivity analyses, and investigate how the dependence structure will impact the estimation of probabilities of disease classes. If there is a clinically significant difference, caution should be taken with any statistical inference. As a last choice, if the dependence structure assumption cannot be reasonably determined, Bayesian model averaging (BMA) may be preferable to any single model, but there is a heavy price to pay for the BMA: 1) the computations become more complex and 2) the credible intervals get much larger in some cases (to reflect the added uncertainty).
Assuming that models used in the Bayesian averaging includes the correctly specified model, the simulation results show that BMA inference generally performs better than any misspecified model alone, especially with respect to the interval coverage performance. In practice, all candidate models can be misspecified and thus one can argue that Bayesian model averaging may not be effective in reducing bias. Intuitively, if some models tend to overestimate and the other models tend to underestimate the parameters of interest, then the Bayesian model averaging will be effective in reducing bias compared to a specific misspecified model. However, we realize that if all models tend to overestimate (or underestimate) the parameters of interest or if the estimates from the incorrect models are far away from the correct model estimates, then the Bayesian model averaging may not be effective in reducing bias. In addition, because the data do not contain information to distinguish between conditional dependence models, one should not expect that the posterior model probabilities to be accurately estimated in practice, casting some doubt on the utility of the BMA estimate in this case.
In this article we consider only homogeneous dependence models which are identifiable from are data setting. Some researchers [43,44] have argued that one can do better in using a non-identifiable model with some informative prior information compared to a less realistic model with strong assumptions that is identifiable. Further research on an expanded model, potentially with heterogeneous dependence structure, may shed more light on the impact of prior misspecification versus model misspecification and the trade-off between an expanded non-identifiable model with less model assumption but more prior assumption and an identifiable model with stronger model assumption but less prior assumption on the estimation of disease prevalence in the case that we discussed.
We assumed that a perfect gold standard (or definitive) test exists, which may limit the usage of the proposed methods, because arguably all diagnostic tests are imperfect and even those with theoretically perfect properties can be rendered imperfect by laboratory or human errors. It may be fruitful for further methodological research to incorporate measurement errors of the third stage gold standard test, e.g., by a sensitivity analysis  or multiple imputation . However, this is beyond our present scope.
Another important potential bias in the estimation of disease prevalence is selection bias as to who participates in the screening program. We acknowledge that the estimates from our method can be biased if those who participate in the screening program are not representative of the target population whose prevalence is being estimated. If the information on who tend to participate in the screening program is available, further adjustment for the selection bias can be done by e.g., multiple imputation or inverse probability weighting (i.e., weighting each participant by the inverse of its estimated probability of participating the screening program).
Haitao Chu was supported in part by the Lineberger Cancer Center Core Grant CA16086 from the U.S. National Cancer Institute. The authors are grateful to the editor and two anonymous referees for their constructive comments and suggestions which have greatly improved this manuscript.