A diagnostic test, in its simple form, tries to detect presence of a particular condition (disease) in a sample. Usually there are several studies where performance of the diagnostic test is measured by some statistic. One may want to combine such studies to get a good picture of performance of the test, a meta-analysis. Also, for a particular disease there may be several diagnostic tests invented, where each of the tests is subject of one or more studies. One may also want to combine all such studies to see how the competing tests are performing with respect to each other, and choose the best for clinical practice.
To pool several studies and estimate a summary statistic some assumptions are made. One such assumption is that differences seen between individual study results are due to chance (sampling variation). Equivalently, this means all study results are reflecting the same "true" effect [
1]. However, meta-analysis of studies for some diagnostic tests show that this assumption, in some cases, is not empirically supported. In other words, there is more variation between the studies that could be explained by random chance alone, the so-called "conflicting reports". One solution is to relax the assumption that every study is pointing to the same value. In other words, one accepts explicitly that different studies may correctly give "different" values for performance of the same test.
For example, sensitivity and specificity are a pair of statistics that together measure the performance of a diagnostic test. One may want to compute an average sensitivity and an average specificity for the test across the studies, hence pooling the studies together. Instead, one may choose to extract odds ratio (OR) from each paper (as test performance measure), and then estimate the average OR across the studies. The advantage is that widely different sensitivities (and specificities) can point to the same OR. This means one is relaxing the assumption that all the studies are pointing to the same sensitivity and specificity, and accepts that different studies are reporting "truly different" sensitivity and specificity, and that the between-study variation of them is not due to random noise alone, but because of difference in choice of decision threshold (the cutoff value to dichotomize the results). Therefore the major advantage of OR, and its corresponding receiver-operating-characteristic (ROC) curve, is that it provides measures of diagnostic accuracy unconfounded by decision criteria [
2]. An additional problem when pooling sensitivities and specificities separately is that it usually underestimates the test performance [[
3], p.670].
The above process may be used once more to relax the assumption that every study is pointing to the same OR, thus relaxing the "OR-homogeneity" assumption. In other words, in some cases, the remaining variation between studies, after utilizing OR as the summary performance measure, is still too much to be attributed to random noise. This suggests OR may vary from study to study. Therefore one explicitly assumes different studies are measuring different ORs, and that they are not pointing to the same OR. This difference in test performance across studies may be due to differences in study design, patient population, case difficulty, type of equipment, abilities of raters, and dependence of OR on threshold chosen [
4]. Nelson [
5] explains generating ROC curves that allow for the possibility of "inconstant discrimination accuracy", a heterogeneous ROC curve (HetROC). This means the ROC curve represents different ORs at different points. This contrasts with the fact that the homogeneous-ROC is completely characterized by one single OR.
There are a few implementations of the heterogeneous ROC. One may classify them into two groups. The first group is exemplified by Tosteson and Begg [
6]. They show how to use ordinal regression with two equations that correspond to location and scale. The latent scale binary logistic regression of Rutter and Gatsonis [
4] belong to this group. The second group contains implementations of Kardaun and Kardaun [
7], and Moses et al [
8]. Moses et al explain a method to plot such heterogeneous ROC curve under some parametric assumptions, and they call it summary ROC (SROC).
When comparing two (or more) diagnostic tests, where each study reports results on more than one test, the performance statistics (in the study results) are correlated. Then standard errors computed by SROC are invalid. Toledano and Gatsonis [
9] use the ordinal regression model, and account for the dependency of measurements by generalized estimating equations (GEE). However, to fit the model they suggest using a FORTRAN code.
We propose a regression model that accommodates more general heterogeneous ROC curves than SROC. The model accommodates complex missing patterns, and accounts for correlated results [
10]. Furthermore, we show how to implement the model using widely available statistical software packages. The model relaxes OR-homogeneity assumption. In the model, when comparing two (or more) tests, each test has its own trend of ORs across studies, while the trends of two tests are (assumed to be) proportional to each other, the "proportional odds ratio" assumption. We alleviate dilemma of choosing weighting schemes such that do not bias the estimates [[
11], p.123], by fitting the POR model to 2-by-2 tables. The model assumes a binomial distribution that is more realistic than a Gaussian used by some implementations of HetROC. Also, it is fairly easy to fit the model to (original) patient level data (if available).
Besides accounting better for between-study variation, we show how to use the POR model to "explain why" such variation exists. This potentially gives valuable insights and may have direct clinical applications. It may help define as to when, where, how, and on what patient population to use which test, to optimize performance.
We show how to use "deviation" contrast, in parameterization of categorical variables, to relax the restriction that a summary measure may be reported only if the respective interaction terms in the model are insignificant. This is similar to using grand mean in a "factor effects" ANOVA model (compared to "cell means" ANOVA model).
We show how to use nonparametric smoothers, instead of parametric functions of true positive rate (TPR) and/or false positive rate (FPR), to generate heterogeneous ROC for a single diagnostic test across several studies.
Our proposed POR model assumes the shape of the heterogeneous ROC curve is the same from one test to the other, but they differ in their locations in the ROC space. This assumption facilitates the comparison of the tests. However, one may want to relax the POR assumption, where each test is allowed to have a heterogeneous ROC curve with a different shape. One may implement such generalized comparison of the competing diagnostic tests by a mixed effects model. This may improve generalizability of meta-analysis results to all (unobserved) studies. Also, a mixed effects model may take care of remaining between-study variation better.