Two commonly used measures of performance of diagnostic medical procedures are sensitivity and specificity. Customarily they are modeled directly. That is, we explain Pr (positive screening result| occurrence of outcome) and Pr (negative screening result non-occurrence of outcome) using binary regression models, introducing risk factors associated with the particular outcome. We obtain estimated functional forms for these probabilities and, as we change the levels of risk factors, we can see how these probabilities change, how they co-vary as functions of the risk factors. Often such dependence is depicted through receiver operating characteristic curves (see, e.g. Pepe [

1]), which assist in revealing the nature of the indirect co-variation.

Under a Bayesian perspective, these probabilities are random variables with a posterior distribution for each, at each set of risk factor levels. Then, we could refer to the investigation of the previous paragraph as `first order', i.e. we examine say posterior means and the functional dependence between these means.

Our contribution here is to introduce a `second-order' analysis in this setting. More precisely, at any set of risk factor levels, there is a 2×2 table of screening result by outcome result with entries that are probabilities which sum to 1. In fact, suppose we model these random cell probabilities as functions of the risk factors, we obtain dependent random variables. Since sensitivity and specificity are functions of these cell probabilities, they are dependent as well. It is this stochastic dependence that we seek to learn about here. In particular, we illuminate the nature of this dependence under different stochastic models for the random cell probabilities. Apart from quantifying this stochastic dependence, it may be of interest in order to learn how predictable one is given the other.

The key point is that in order to study this dependence, we need to proceed from a model for the joint cell probabilities. If we build a model for sensitivity and then, separately, a model for specificity, they become independent *by assumption*. Moreover, though we illustrate within the Bayesian framework, the same argument applies to a classical analysis. To study the stochastic dependence between an estimated sensitivity and an estimated specificity, we need to induce them from a model for the joint cell probabilities.

There are many possible ways to jointly model the four cell probabilities. Evidently, only three specifications are needed. Regardless, any such model will be a (nonlinear) reparametrization of any other. However, this does not imply that similar dependence structure will be induced. In its simplest form, if we add a model for the probability of the outcome to the models for sensitivity and specificity we uniquely determine the cell probabilities but impose independence for the latter two probabilities. In fact, if for a given vector of risk factors, *X*, we denote the three random variables by *Se*(*X*), *Sp*(*X*), and *O*(*X*), then the joint density for these variables takes the form *f*_{1}(*Se*(*X*)) *f*_{2}(*Sp*(*X*) *f*_{3}(*O*(*X*))).

Hence, an important issue that underlies what we are exploring is *coherence*. Clearly, we cannot write down three arbitrary functions of the cell probabilities and then model them. This will not ensure that the resulting four cell probabilities are non-negative and sum to 1. And, if this is not guaranteed, then probabilities computed from these cell probabilities need not be non-negative and less than 1. Hence, a specification is coherent if it does ensure this. Unfortunately, the natural coherent specification can be challenging to fit; an alternative version proves more tractable. We discuss these issues in detail below.

Within the range of medical screening procedures, we are motivated by screening mammography. Screening mammography examinations are used by radiologists to provide a preliminary assessment regarding the presence or absence of breast cancer in healthy, asymptomatic women and therefore help determine whether a patient requires more advanced examination. More accurate diagnosis of this class of tumors requires other procedures such as diagnostic mammograms, biopsy, or ultrasound.

Despite attempts to standardize radiologist performance [

2] and to limit the level of subjectivity involved, it is well established that there is a considerable variability in radiologist performance in reading/interpreting film (see 3–7]). There is also a growing literature that attempts to explain [

8,

9] and quantify the extent of variability (see [

10,

9]). Strategies to compare physicians in a more fair fashion, by taking into account the differences in case mix have been proposed (see [

11–

13]).

In particular, Woodard

*et al*. [

14] propose a Bayesian approach to model radiologist performance rates. They focus on the probability of correctly detected cancers at the screening level (sensitivity) and the probability of correctly identified non-cancer cases at the screening level (specificity), given a collection of patient and radiologist characteristics. As a result, assessment of clinical performance for particular patient types becomes available. The work of Woodard

*et al*. employs a retrospective data set to model sensitivity and specificity. In other words, the two performance measures are modeled separately, conditioning on the occurrence (or absence) of cancers. Evidently, if one were to look at the four joint probabilities for screening outcome (+/−) and cancer outcome (presence/absence) as random, these probabilities are dependent, in fact negatively associated since their sum is 1. Then, the induced sensitivity and specificity would be random and would be expected to be dependent as well. Intuitively, why should the unknown (random) probability of recall given cancer present be independent of the unknown (random) probability of recall given cancer absent.

Hence, in the spirit of the above discussion, our contribution is to develop a joint model for sensitivity and specificity through hierarchical specifications that uniquely and coherently determine the cell probabilities in the foregoing 2×2 table. Using two different models we will be able to assess the nature of the dependence between these two performance measures. In particular, we do this for a large screening data set taken from three registries that are part of the Breast Cancer Surveillance Consortium (BCSC), see [

15].

The potential clinical relevance of this work is to encourage those who study performance of screening tests to think jointly with regard to explaining screening result and disease outcome. Understanding the behavior of the joint process enables enhanced insight into the joint behavior of the induced sensitivity and specificity.

Thus, the format of the paper is as follows. In Section 2, we briefly review the data set used to infer about the foregoing dependence. In Section 3, we discuss coherent modeling for the joint distribution of screening outcome and disease outcome. In Section 4, we briefly discuss computational issues associated with fitting the models in Section 3. In Section 5, we analyze the data set under two different models presented in Section 3, in particular, with regard to the dependence between sensitivity and specificity.