A key feature of clustered data is the potential for dependence between interpretations made by the same radiologist. Intuitively, observations for the same radiologist are more “similar” than those for another radiologist. This dependence arises because of heterogeneity across radiologists: differences in skill levels, thresholds for recalling patients, patient populations, and/or practice or facility characteristics (

4-

6,

17,

18). One can account for such between-radiologist differences by including appropriate, radiologist-specific covariates into a regression model; however, in many instances unexplained heterogeneity, and hence dependence, will remain.

Statistical models that assess predictors of interpretive performance must take into account the potential dependence among multiple interpretations made by the same radiologist. Naïve methods that ignore clustering, such as the traditional chi-square test or logistic regression, yield biased standard error estimates. These methods rely on the data being a sample of independent observations, while clustered data are inherently dependent. This dependence typically lessens the amount of statistical information about parameters of interest below what the overall sample size would suggest. For example, consider a study consisting of 50,000 mammography examinations interpreted by 10 radiologists. It is tempting to think 50,000 independent observations are available for analysis; however, the *effective number of independent observations* is closer to 10 (*i.e*., the number of radiologists). Therefore, naïve standard error estimates will be too small and inference based on confidence intervals and *p*-values will be statistically invalid.

Two broad classes of regression models have been used to account for potential dependence in the analysis of interpretive performance: marginal and conditional models (

19-

21). Historically, the two approaches were developed specifically to account for dependence in clustered or longitudinal data. While both approaches achieve this goal, careful consideration of the model assumptions highlight important differences in the interpretation of the model results; a consequence of these differences is that the two modeling approaches address different scientific hypotheses. Indeed, the approaches are distinguished by the labels “marginal” and “conditional” because of implicit differences in the interpretations of the component parameters.

Conditional or cluster-specific models

Dependence among observations within a cluster can be induced by between-radiologist heterogeneity that is not explained by measured covariates. Conditional or cluster-specific models are a general class of regression models that approach the problem of accounting for dependence within clusters by introducing cluster-specific parameters directly into the model specification. These parameters serve to capture unmeasured between-cluster heterogeneity. An example of a cluster-specific logistic regression model for a performance measure, say recall rate, is

Where

*b*_{i} is a radiologist-specific parameter, and

*π*^{C}(

*X*_{ij},b_{i}) is the conditional probability of recall given the covariates

*X*_{ij} = (

*X*_{ij,1}, …,

*X*_{ij,p}) and the radiologist-specific parameter

*b*_{i}. Intuitively, the radiologist-specific effect

*b*_{i} induces dependence across the multiple observations from the

*i*^{th} radiologist, because a large positive value of

*b*_{i} indicates that each mammogram-specific probability of being recalled, given by (

3), will be high, while a large negative value of

*b*_{i} indicates that each mammogram-specific probability of recall will be low. In more general conditional models, the single radiologist-specific effect

*b*_{i} can be replaced with a vector of effects, potentially depending on observed covariates.

Inspection of model (

3) reveals that estimation is required for

*p*+1+

*N* parameters:

*β*_{0}^{C},

*β*_{1}^{C}, …,

*β*_{p}^{C},

*b*_{1}, …,

*b*_{N}, where

*p* is the number of covariates and

*N* is the number of radiologists. Thus, the number of parameters in model (

3) is directly linked to the sample size; the number of radiologist-specific effects increases with the number of radiologists. In such settings, traditional estimation methods (such as maximum likelihood) can break down (

22). One way to overcome this problem is to use conditional logistic regression (

23). This approach, which has its roots in the analysis of matched case-control studies, takes the

*N* radiologist-specific effects to be nuisance parameters and uses the statistical technique of conditioning to eliminate them from the likelihood. As a result, the task of estimation is concentrated on the

*p*+1 regression parameters. An additional consequence of the conditioning, however, is that one is no longer able to estimate the effect of any covariate that varies solely between radiologists, such as gender or average annual interpretive volume. In such settings, an alternative to removing the

*N* radiologist-specific effects from the task of estimation is to impose some distributional assumptions on how the radiologist-specific effects vary across the population of radiologists. A common distributional assumption, for example, is that the

*b*_{i} are normally distributed, with zero mean and constant variance, σ

^{2}. In this case, the task of estimation reduces to

*p*+2 parameters: the

*p*+1

*β*^{C} regression coefficients and the unknown variance term, σ

^{2}. With this approach, the

*b*_{i} are treated as random variables and distinguished from the fixed model terms (i.e., the

*β*^{C} regression coefficients). As such, the combination of model (

3) with distributional assumptions concerning the

*b*_{i} parameters is often referred to as a random effects or hierarchical model.

In observational, community-based settings such as the BCSC, mammography cases are typically interpreted by one or two radiologists. Multiple-reader multiple-case (MRMC) studies provide researchers with a potentially more efficient design where each case is interpreted by multiple radiologists, thereby reducing one source of study variability-- differences across the cases (

24). While several random effects models have been proposed for analyzing continuous performance measures collected from MRMC studies (

24), extending the framework outlined above to analyze a binary performance measure is straightforward. Specifically, equation (

3) could be modified to include an additional random effect corresponding to the case number, to account for correlation among multiple interpretations made on the same case.

Marginal or population-averaged models

An alternative approach to incorporating a radiologist-specific parameter into the mean model to account for dependence within clusters is to model the population mean as a function of covariates only, as in case of independent data, and then adjust for the dependence within clusters in the calculation of the standard errors. Consider a model for a performance measure *π*_{ij} based solely on the observed *X*_{ij}:

A common technique for estimating the parameters in model (

4) is that of generalized estimation equations (GEE) (

25). Specifically, an estimate of the vector of marginal regression coefficients

**β**^{M} = (

*β*_{0}^{M},

*β*_{1}^{M}, …,

*β*_{p}^{M}) is obtained by solving the estimating equations

where

**D**_{i} is the first derivative of the vector

*π*^{M} (

**X**_{i}) with respect to the regression parameters

**β**^{M} and

**V**_{i} is the assumed variance-covariance matrix for the vector of observed outcomes for the

*i*th radiologist

**Y**_{i}. Intuitively, the solution to the estimating equations,

^{M}, is the value of the regression parameters

**β**^{M}, that provides the closest correspondence between the observed outcomes,

**Y**_{i}, and what is expected under the assumed model,

*π*^{M} (

**X**_{i}). Provided the mean model (

4) is correctly specified, the estimating equations (

5) are unbiased (

*i.e*., have zero expectation) regardless of the specific choice of the assumed variance-covariance

**V**; hence, the corresponding regression parameter estimates

^{M} are consistent (asymptotically unbiased).

Estimation of standard errors that take into account the dependence within clusters is straightforward. The

*sandwich* or

*robust* variance estimator is most commonly used and is well-known to be robust in the sense that valid inference is obtained for the marginal regression coefficients even if the variance-covariance matrix is misspecified. That is, the sandwich variance estimator accounts for arbitrary dependence among observations within a cluster, thereby ensuring valid inference (

25). GEE methods that use standard software have also been proposed for non-nested clusters or crossed studies, such as MRMC studies, where the same cases are interpreted by multiple radiologists (

8,

26).

Interpretation of regression parameters

The nomenclature adopted to distinguish the two regression-based approaches for analyzing clustered data (conditional and marginal) arose from differences in the interpretation of their component parameters. To illustrate this, consider the interpretation of the conditional and marginal log-odds ratios

*β*_{1}^{C} and

*β*_{1}^{M} from models (

3) and (

4) respectively. The interpretation of both parameters relate to differences in performance (on the log-odds scale) between two populations of mammograms (the unit of analysis here); the two populations differ in terms of their covariates

*X*_{ij,1}, while all other remaining components are held constant. Suppose, for example, we are interested in the effect of a binary measure of annual interpretive volume

*X*_{ij,1} which takes the value of 1 if the

*i*^{th} radiologist had a high volume (based on some criteria) during the year the mammogram was interpreted and 0 otherwise, after adjusting for patient age

*X*_{ij,2}. From (

3), the interpretation of the conditional log-odds ratio can be derived via

Hence, in addition to holding patient age (*X*_{ij,2}) constant, interpreting the conditional log-odds ratio *β*_{1}^{C} requires holding constant, or conditioning on, the value of the radiologist-specific effect, *b*_{i}. Consequently *β*_{1}^{C} is referred to as a “conditional” or “cluster-specific” parameter. In contrast, the interpretation of the marginal log-odds ratio, derived via

does not require conditioning on anything beyond the two measured predictor variables. In particular, the interpretation of the marginal log-odds ratio does not require conditioning on the radiologist-specific effect *b*_{i}; hence *β*_{1}^{M} describes differences in performance between two populations of mammograms, averaging across all radiologists. Consequently, *β*_{1}^{M} is referred to as a “population-averaged” or “marginal” parameter. Here, the term “marginal” is a statistical term referring to marginalizing (integrating or averaging) over the distribution of a random variable (in this instance, the random variable is the radiologist-specific effect *b*).

Connections between the two models

Although the two regression frameworks are presented separately, and have differing interpretations, the marginal and conditional means are connected mathematically via the convolution equation

In this expression, G(*b*) is the distribution of the random effects across clusters, often taken to be Normal with zero mean and a constant variance. Examination of this expression reveals that the marginal mean is equal to the average of the conditional mean, averaging with respect to the distribution of the random effect *b*.

The relationship between the marginal and conditional means, given by equation (

9), indicates that both are well-defined in any given context. That is, given specification of the random effects distribution G(

*b*), the two associations could be considered simultaneously. In practice, one typically decides which of the models is of primary scientific interest, and the modeling framework is chosen accordingly.

Numerical differences in conditional and marginal regression parameters

Comparing equations (

7) and (

8) indicates that the crucial difference in interpreting the two types of regression parameters is whether or not one conditions on the radiologist-specific random effect,

*b*_{i}. For linear regression and ANOVA models for analyzing continuous performance measures collected from MRMC studies (

27-

32), the marginal and conditional regression coefficients can be shown to be numerically equivalent. However, for logistic regression models, such as those considered here, the values of the marginal and conditional odds ratios will typically not be numerically equivalent. Exceptions include when the random effects have no variability across clusters or the true value of the conditional log-odds ratio

*β*_{1}^{C} equals zero

*and* the variability of the random effect distribution does not depend on the covariate

*X*_{ij,1} (in which case the marginal log-odds ratio

*β*_{1}^{M} also equals zero).

In most settings, the numerical difference between the marginal and conditional regression coefficients depends on the various components of the model as well as the underlying variation (magnitude and shape) of the distribution of the radiologist-specific random effects in the population. If the random effects are normally distributed with constant variance (specifically, if the variance does not depend on the covariate

*X*_{ij,1}), the marginal odds ratio will be attenuated toward 1.0 compared to the conditional odds ratio (

16,

20). shows a hypothetical example that illustrates this attenuation. The solid line represents the average radiologist-specific effect of volume on sensitivity of mammography for a hypothetical conditional odds ratio of 2.0 measuring the increased odds of an abnormal mammogram among women with cancer corresponding to an increase in volume of 2,000 and a radiologist-specific effect standard deviation of 2.0. The dashed line shows the relationship for the corresponding marginal odds ratio of 1.5, which is attenuated relative to the conditional effect.

More generally, the value of one parameter given the other and the distributional assumptions of the radiologist-specific random effects can be derived via the relationship given by equation (

9). shows how the numerical values of the conditional and marginal odds ratios, exp(

*β*_{1}^{C}) and exp(

*β*_{1}^{M}), differ under various conditions in the simple setting of a single binary predictor. As noted above, when the random effect variance does not depend on the predictor

*X*, the marginal odds ratio is attenuated toward 1.00, with the extent of the attenuation depending on the value of the conditional odds ratio, the intercept, and the random effects standard deviation. For example, when the conditional odds ratio is 2.00, the overall baseline mean is 0.50, and the random effects standard deviation is 0.50 in both the

*X*=0 and

*X*=1 groups, then the marginal odds ratio is 1.93. If the (common) standard deviation is 2.00, the attenuation is greater and the marginal odds ratio is 1.52.

| **Table 1**Marginal odds ratios for various values of the conditional odds ratio, the conditional intercept (*β*_{0}^{C}) and corresponding mean response when *X*=0 (*π*_{0}^{C}), and the standard deviation (SD) of the normally distributed radiologist-specific effects. (more ...) |

In contrast, if the radiologist-specific effect variability depends on *X*, the marginal odds ratio can be either attenuated or increased relative to the conditional odds ratio. In some cases, the marginal effect can even be in the opposite direction as the conditional effect. For example, when the conditional odds ratio is 2.00, the overall baseline mean is 0.10, and the random effects standard deviation is 2.00 when *X*=0 and 0.50 when *X*=1, then the marginal odds ratio is 0.94. Last, it is also important to note that if a covariate has no conditional effect (*i.e*., the conditional odds ratio is 1.00), the marginal odds ratio could be different from 1.00 if the variability of the radiologist-specific effect depends on *X*. In other words, if high-volume radiologists have the same conditional performance as low-volume radiologists, but high-volume radiologists are less variable in their interpretations, they will have a larger marginal performance than low-volume radiologists for performance measures with means above 50%.

For binary covariates, differences between the numerical values of the conditional and marginal parameters do not depend on the covariate distribution (

*i.e*., the prevalence of the binary covariate). While we have focused here on a binary covariate, for continuous covariates the differences between the two parameters may depend on the covariate distribution, in addition to the factors considered in . In addition, in the case of a continuous covariate, it is important to note that both the marginal and conditional effects of that covariate will not be linear on the same scale (

*e.g*., logit), unless the random effects are assumed to follow a specific distribution called a bridge distribution (

33,

34).

Implications for science

Given the possible differences in the magnitude and direction of the conditional and marginal effects, it is important to consider carefully whether inference should be made at the radiologist or population level before analyzing clustered data with nonlinear regression models. For instance, the question of whether interpretive volume influences radiologists’ interpretive performance can be thought of in two ways. First, we may be interested in whether the sensitivity and specificity of mammography examinations interpreted by high-volume radiologists in the United States are better than these performance measures for mammography examinations interpreted by low-volume radiologists. This is a population-level question comparing the performance of mammography examinations interpreted by two different types of radiologists. In contrast, we may want to know whether an individual radiologist’s interpretive performance improves when his or her interpretive volume increases, controlling for other traits of that radiologist that influence performance. This is a radiologist-specific question that examines changes in an individual’s performance when one condition is changed but everything else about that radiologist remains constant. Both questions may be of interest, for example, to policy makers considering whether to increase the current interpretive volume requirements for certification. If the current requirement of ≥960 mammograms over the prior 2 years was increased to 2,000 mammograms, radiologists with 2-year volumes below 2,000 would have to either (a) stop interpreting mammography, leaving these mammograms to be interpreted by the group of remaining high-volume radiologists (the effect of this on the performance of mammography in the United States is estimated from the marginal model) or (b) increase their annual volume to meet the new guidelines (the effect of this on an individual radiologists-performance is estimated from the conditional model).