A key feature of clustered data is the potential for dependence between interpretations made by the same radiologist. Intuitively, observations for the same radiologist are more “similar” than those for another radiologist. This dependence arises because of heterogeneity across radiologists: differences in skill levels, thresholds for recalling patients, patient populations, and/or practice or facility characteristics (4
). One can account for such between-radiologist differences by including appropriate, radiologist-specific covariates into a regression model; however, in many instances unexplained heterogeneity, and hence dependence, will remain.
Statistical models that assess predictors of interpretive performance must take into account the potential dependence among multiple interpretations made by the same radiologist. Naïve methods that ignore clustering, such as the traditional chi-square test or logistic regression, yield biased standard error estimates. These methods rely on the data being a sample of independent observations, while clustered data are inherently dependent. This dependence typically lessens the amount of statistical information about parameters of interest below what the overall sample size would suggest. For example, consider a study consisting of 50,000 mammography examinations interpreted by 10 radiologists. It is tempting to think 50,000 independent observations are available for analysis; however, the effective number of independent observations is closer to 10 (i.e., the number of radiologists). Therefore, naïve standard error estimates will be too small and inference based on confidence intervals and p-values will be statistically invalid.
Two broad classes of regression models have been used to account for potential dependence in the analysis of interpretive performance: marginal and conditional models (19
). Historically, the two approaches were developed specifically to account for dependence in clustered or longitudinal data. While both approaches achieve this goal, careful consideration of the model assumptions highlight important differences in the interpretation of the model results; a consequence of these differences is that the two modeling approaches address different scientific hypotheses. Indeed, the approaches are distinguished by the labels “marginal” and “conditional” because of implicit differences in the interpretations of the component parameters.
Conditional or cluster-specific models
Dependence among observations within a cluster can be induced by between-radiologist heterogeneity that is not explained by measured covariates. Conditional or cluster-specific models are a general class of regression models that approach the problem of accounting for dependence within clusters by introducing cluster-specific parameters directly into the model specification. These parameters serve to capture unmeasured between-cluster heterogeneity. An example of a cluster-specific logistic regression model for a performance measure, say recall rate, is
is a radiologist-specific parameter, and πC
) is the conditional probability of recall given the covariates Xij
, …, Xij,p
) and the radiologist-specific parameter bi
. Intuitively, the radiologist-specific effect bi
induces dependence across the multiple observations from the ith
radiologist, because a large positive value of bi
indicates that each mammogram-specific probability of being recalled, given by (3
), will be high, while a large negative value of bi
indicates that each mammogram-specific probability of recall will be low. In more general conditional models, the single radiologist-specific effect bi
can be replaced with a vector of effects, potentially depending on observed covariates.
Inspection of model (3
) reveals that estimation is required for p
, …, βpC
, …, bN
, where p
is the number of covariates and N
is the number of radiologists. Thus, the number of parameters in model (3
) is directly linked to the sample size; the number of radiologist-specific effects increases with the number of radiologists. In such settings, traditional estimation methods (such as maximum likelihood) can break down (22
). One way to overcome this problem is to use conditional logistic regression (23
). This approach, which has its roots in the analysis of matched case-control studies, takes the N
radiologist-specific effects to be nuisance parameters and uses the statistical technique of conditioning to eliminate them from the likelihood. As a result, the task of estimation is concentrated on the p
+1 regression parameters. An additional consequence of the conditioning, however, is that one is no longer able to estimate the effect of any covariate that varies solely between radiologists, such as gender or average annual interpretive volume. In such settings, an alternative to removing the N
radiologist-specific effects from the task of estimation is to impose some distributional assumptions on how the radiologist-specific effects vary across the population of radiologists. A common distributional assumption, for example, is that the bi
are normally distributed, with zero mean and constant variance, σ2
. In this case, the task of estimation reduces to p
+2 parameters: the p
regression coefficients and the unknown variance term, σ2
. With this approach, the bi
are treated as random variables and distinguished from the fixed model terms (i.e., the βC
regression coefficients). As such, the combination of model (3
) with distributional assumptions concerning the bi
parameters is often referred to as a random effects or hierarchical model.
In observational, community-based settings such as the BCSC, mammography cases are typically interpreted by one or two radiologists. Multiple-reader multiple-case (MRMC) studies provide researchers with a potentially more efficient design where each case is interpreted by multiple radiologists, thereby reducing one source of study variability-- differences across the cases (24
). While several random effects models have been proposed for analyzing continuous performance measures collected from MRMC studies (24
), extending the framework outlined above to analyze a binary performance measure is straightforward. Specifically, equation (3
) could be modified to include an additional random effect corresponding to the case number, to account for correlation among multiple interpretations made on the same case.
Marginal or population-averaged models
An alternative approach to incorporating a radiologist-specific parameter into the mean model to account for dependence within clusters is to model the population mean as a function of covariates only, as in case of independent data, and then adjust for the dependence within clusters in the calculation of the standard errors. Consider a model for a performance measure πij based solely on the observed Xij:
A common technique for estimating the parameters in model (4
) is that of generalized estimation equations (GEE) (25
). Specifically, an estimate of the vector of marginal regression coefficients βM
, …, βpM
) is obtained by solving the estimating equations
is the first derivative of the vector πM
) with respect to the regression parameters βM
is the assumed variance-covariance matrix for the vector of observed outcomes for the i
th radiologist Yi
. Intuitively, the solution to the estimating equations, M
, is the value of the regression parameters βM
, that provides the closest correspondence between the observed outcomes, Yi
, and what is expected under the assumed model, πM
). Provided the mean model (4
) is correctly specified, the estimating equations (5
) are unbiased (i.e
., have zero expectation) regardless of the specific choice of the assumed variance-covariance V
; hence, the corresponding regression parameter estimates M
are consistent (asymptotically unbiased).
Estimation of standard errors that take into account the dependence within clusters is straightforward. The sandwich
variance estimator is most commonly used and is well-known to be robust in the sense that valid inference is obtained for the marginal regression coefficients even if the variance-covariance matrix is misspecified. That is, the sandwich variance estimator accounts for arbitrary dependence among observations within a cluster, thereby ensuring valid inference (25
). GEE methods that use standard software have also been proposed for non-nested clusters or crossed studies, such as MRMC studies, where the same cases are interpreted by multiple radiologists (8
Interpretation of regression parameters
The nomenclature adopted to distinguish the two regression-based approaches for analyzing clustered data (conditional and marginal) arose from differences in the interpretation of their component parameters. To illustrate this, consider the interpretation of the conditional and marginal log-odds ratios β1C
from models (3
) and (4
) respectively. The interpretation of both parameters relate to differences in performance (on the log-odds scale) between two populations of mammograms (the unit of analysis here); the two populations differ in terms of their covariates Xij,1
, while all other remaining components are held constant. Suppose, for example, we are interested in the effect of a binary measure of annual interpretive volume Xij,1
which takes the value of 1 if the ith
radiologist had a high volume (based on some criteria) during the year the mammogram was interpreted and 0 otherwise, after adjusting for patient age Xij,2
. From (3
), the interpretation of the conditional log-odds ratio can be derived via
Hence, in addition to holding patient age (Xij,2) constant, interpreting the conditional log-odds ratio β1C requires holding constant, or conditioning on, the value of the radiologist-specific effect, bi. Consequently β1C is referred to as a “conditional” or “cluster-specific” parameter. In contrast, the interpretation of the marginal log-odds ratio, derived via
does not require conditioning on anything beyond the two measured predictor variables. In particular, the interpretation of the marginal log-odds ratio does not require conditioning on the radiologist-specific effect bi; hence β1M describes differences in performance between two populations of mammograms, averaging across all radiologists. Consequently, β1M is referred to as a “population-averaged” or “marginal” parameter. Here, the term “marginal” is a statistical term referring to marginalizing (integrating or averaging) over the distribution of a random variable (in this instance, the random variable is the radiologist-specific effect b).
Connections between the two models
Although the two regression frameworks are presented separately, and have differing interpretations, the marginal and conditional means are connected mathematically via the convolution equation
In this expression, G(b) is the distribution of the random effects across clusters, often taken to be Normal with zero mean and a constant variance. Examination of this expression reveals that the marginal mean is equal to the average of the conditional mean, averaging with respect to the distribution of the random effect b.
The relationship between the marginal and conditional means, given by equation (9
), indicates that both are well-defined in any given context. That is, given specification of the random effects distribution G(b
), the two associations could be considered simultaneously. In practice, one typically decides which of the models is of primary scientific interest, and the modeling framework is chosen accordingly.
Numerical differences in conditional and marginal regression parameters
Comparing equations (7
) and (8
) indicates that the crucial difference in interpreting the two types of regression parameters is whether or not one conditions on the radiologist-specific random effect, bi
. For linear regression and ANOVA models for analyzing continuous performance measures collected from MRMC studies (27
), the marginal and conditional regression coefficients can be shown to be numerically equivalent. However, for logistic regression models, such as those considered here, the values of the marginal and conditional odds ratios will typically not be numerically equivalent. Exceptions include when the random effects have no variability across clusters or the true value of the conditional log-odds ratio β1C
equals zero and
the variability of the random effect distribution does not depend on the covariate Xij,1
(in which case the marginal log-odds ratio β1M
also equals zero).
In most settings, the numerical difference between the marginal and conditional regression coefficients depends on the various components of the model as well as the underlying variation (magnitude and shape) of the distribution of the radiologist-specific random effects in the population. If the random effects are normally distributed with constant variance (specifically, if the variance does not depend on the covariate Xij,1
), the marginal odds ratio will be attenuated toward 1.0 compared to the conditional odds ratio (16
). shows a hypothetical example that illustrates this attenuation. The solid line represents the average radiologist-specific effect of volume on sensitivity of mammography for a hypothetical conditional odds ratio of 2.0 measuring the increased odds of an abnormal mammogram among women with cancer corresponding to an increase in volume of 2,000 and a radiologist-specific effect standard deviation of 2.0. The dashed line shows the relationship for the corresponding marginal odds ratio of 1.5, which is attenuated relative to the conditional effect.
Figure 2 Hypothetical radiologist-specific (solid lines) and population-averaged (dashed line) curves showing the effect of annual interpretive volume on sensitivity. The thick solid line is the radiologist-specific sensitivity by volume for an average radiologist, (more ...)
More generally, the value of one parameter given the other and the distributional assumptions of the radiologist-specific random effects can be derived via the relationship given by equation (9
). shows how the numerical values of the conditional and marginal odds ratios, exp(β1C
) and exp(β1M
), differ under various conditions in the simple setting of a single binary predictor. As noted above, when the random effect variance does not depend on the predictor X
, the marginal odds ratio is attenuated toward 1.00, with the extent of the attenuation depending on the value of the conditional odds ratio, the intercept, and the random effects standard deviation. For example, when the conditional odds ratio is 2.00, the overall baseline mean is 0.50, and the random effects standard deviation is 0.50 in both the X
=0 and X
=1 groups, then the marginal odds ratio is 1.93. If the (common) standard deviation is 2.00, the attenuation is greater and the marginal odds ratio is 1.52.
Table 1 Marginal odds ratios for various values of the conditional odds ratio, the conditional intercept (β0C) and corresponding mean response when X=0 (π0C), and the standard deviation (SD) of the normally distributed radiologist-specific effects. (more ...)
In contrast, if the radiologist-specific effect variability depends on X, the marginal odds ratio can be either attenuated or increased relative to the conditional odds ratio. In some cases, the marginal effect can even be in the opposite direction as the conditional effect. For example, when the conditional odds ratio is 2.00, the overall baseline mean is 0.10, and the random effects standard deviation is 2.00 when X=0 and 0.50 when X=1, then the marginal odds ratio is 0.94. Last, it is also important to note that if a covariate has no conditional effect (i.e., the conditional odds ratio is 1.00), the marginal odds ratio could be different from 1.00 if the variability of the radiologist-specific effect depends on X. In other words, if high-volume radiologists have the same conditional performance as low-volume radiologists, but high-volume radiologists are less variable in their interpretations, they will have a larger marginal performance than low-volume radiologists for performance measures with means above 50%.
For binary covariates, differences between the numerical values of the conditional and marginal parameters do not depend on the covariate distribution (i.e
., the prevalence of the binary covariate). While we have focused here on a binary covariate, for continuous covariates the differences between the two parameters may depend on the covariate distribution, in addition to the factors considered in . In addition, in the case of a continuous covariate, it is important to note that both the marginal and conditional effects of that covariate will not be linear on the same scale (e.g
., logit), unless the random effects are assumed to follow a specific distribution called a bridge distribution (33
Implications for science
Given the possible differences in the magnitude and direction of the conditional and marginal effects, it is important to consider carefully whether inference should be made at the radiologist or population level before analyzing clustered data with nonlinear regression models. For instance, the question of whether interpretive volume influences radiologists’ interpretive performance can be thought of in two ways. First, we may be interested in whether the sensitivity and specificity of mammography examinations interpreted by high-volume radiologists in the United States are better than these performance measures for mammography examinations interpreted by low-volume radiologists. This is a population-level question comparing the performance of mammography examinations interpreted by two different types of radiologists. In contrast, we may want to know whether an individual radiologist’s interpretive performance improves when his or her interpretive volume increases, controlling for other traits of that radiologist that influence performance. This is a radiologist-specific question that examines changes in an individual’s performance when one condition is changed but everything else about that radiologist remains constant. Both questions may be of interest, for example, to policy makers considering whether to increase the current interpretive volume requirements for certification. If the current requirement of ≥960 mammograms over the prior 2 years was increased to 2,000 mammograms, radiologists with 2-year volumes below 2,000 would have to either (a) stop interpreting mammography, leaving these mammograms to be interpreted by the group of remaining high-volume radiologists (the effect of this on the performance of mammography in the United States is estimated from the marginal model) or (b) increase their annual volume to meet the new guidelines (the effect of this on an individual radiologists-performance is estimated from the conditional model).