The formal statistical framework that we develop here allows for a rigorous analysis of the effect of the pattern of correlations between markers on the added predictive ability of marker combinations as measured by the area under the ROC curve (AUC), a common metric of predictive performance. Added predictive ability is defined as the increase in AUC over the level obtained with the primary marker alone. In order to derive analytic results, we make two assumptions, one concerning the distributions of the marker levels in the populations of interest and the other concerning the nature of the algorithm for combining the multiple marker values. Later, we will show that in practice the basic findings of this analysis are still generally upheld even under deviations from these assumptions.
In biology, the term “correlation” is often used loosely to denote a relationship between two factors or effects. In statistics, correlation has a specific definition with respect to the values of a pair of quantitative variables (eg, concentrations of two markers). The level of correlation, as summarized in the correlation coefficient r, measures the extent or the tendency of one variable to increase (positive correlation) or decrease (negative correlation) as the other variable increases. The correlation coefficient r ranges from −1 to 1, with 1 indicating perfect correlation, 0 no correlation or independence, and −1 perfect negative correlation. On a two dimensional scatter plot, the more closely the points cluster around a positively (negatively) sloped regression line, the higher the magnitude of positive (negative) correlation.
The first assumption we make, about the distribution of marker levels, is that, within cases and within controls, each marker is lognormally distributed, ie, the log of the marker concentration is normally distributed. For many, but not all, markers, levels are approximately lognormal. Further, we assume that the multivariate normal (MVN) distribution describes the distribution (in cases and in controls) of log values of a set of markers of interest. The MVN specifies not only the parameters (mean and standard deviation) of the normal distribution of each (log) marker, but also the correlations between the markers.
The second assumption concerns the nature of the multi-marker algorithm. Specifically, we assume that the algorithm is linear, ie, that it is a weighted sum of (log) marker concentrations. With a linear algorithm, along with the MVN assumption about marker distributions, one can analytically compute the following: (1) the weights for the optimal algorithm involving all the markers in the set and (2) the AUC of the resulting optimal algorithm.6
Formulas for these are given in the Appendix
We now demonstrate, and also mathematically prove (in Appendix
), several important qualitative properties relating the pattern of marker correlations with the ability of a multi-marker linear algorithm to provide increased predictive ability (AUC). We initially concentrate on the case with only two markers, but later sketch out how the results can be extrapolated to three or more markers.
With two population groups, cases and controls, there is a separate MVN distribution for each group; thus for any pair of markers there are two correlation parameters, one in cases (say r1) and one in controls (say r0). Note that, in practice, these two correlation coefficients are often quite different. The pivotal quantity in determining the ability of the 2nd marker to add predictive value to a primary marker turns out to be a weighted average of r1 and r0, which we denote by C. Specifically, C = [σ11σ12 r1+ σ01σ02r0]/A, where σij is the standard deviation of the distribution of (log) marker j (j = 1,2) in group i (1 = cases, 0 = controls) and A = σ11σ12 + σ01σ02. Note that because the weights are positive, both correlations being negative assures that C is negative and both being positive assures that C is positive.
illustrates, for the situations of C > 0, C = 0, and C < 0, how the increase in AUC of the optimal two marker combination over that of the primary marker alone (denoted ΔAUC) is related to the AUC of the 2nd marker alone (denoted AUC-2). For C ≤ 0, ΔAUC is monotonically increasing as a function of AUC-2. Further, the greater the negative value of C, the higher the curves are throughout the entire range of AUC-2. In contrast, for C > 0, the curves have a quadratic shape. As AUC-2 increases from the null level (AUC = 0.5), ΔAUC decreases initially, eventually reaches the 0 mark, and finally begins increasing. Note that ΔAUC = 0 implies that the 2nd marker does not add at all to the AUC of the primary marker alone; in this case the optimal weight for the 2nd marker is 0. Unlike the C ≤ 0 situation, the curves for different C > 0 values intersect each other, with none being above another for the entire range of AUC-2.
Figure 1 Relationship between correlation and ΔAUC for 2-marker combinations. ΔAUC for 2-marker combination is plotted against AUC of marker 2 alone; each curve represents different values of the correlation of marker 1 with marker 2. Solid black (more ...)
From the figure, it is clear that the AUC of the 2nd marker alone does not by itself determine the level of increase in AUC with the combination. A marker with negative correlation (C < 0) may have lower AUC-2 than a marker with positive correlation but still have substantially greater ΔAUC. Even for the same level of positive correlation, a lower AUC-2 value may give rise to a greater value of ΔAUC.
To illustrate how, in the C > 0 situation, a second marker with predictive ability of its own may fail to add anything to the AUC of the primary marker alone, suppose that the second marker is simply the primary marker plus some additive noise (eg, measurement error). Then, although the second marker has predictive value, it clearly cannot add any predictive value to the noiseless version of itself. Note in this (added noise) situation the two markers will necessarily be positively, but not perfectly, correlated in cases and in controls.
Note also from the figure that both in the C > 0 and the C < 0 situation (but not with C=0), a second marker with no predictive ability of its own (AUC-2 = 0.5) will necessarily add something to the AUC of the primary marker alone when optimally combined with it, a finding that may seem counter-intuitive. To help understand this phenomenon, consider predicting a person’s sex based on their own height (primary marker) and their father’s height (marker 2). Clearly, one’s father’s height is not at all predictive of sex, but is correlated with one’s own height. Consider an individual who is 5 foot 10 inches. With that information alone, one would predict that the person is male, since there are many times more men than women of that height. However, suppose we had additional information that the person’s father was 6 foot 6 inches. Then, such a man’s son would likely be much taller than 5′10″, and such a man’s daughter might likely be 5′10″, which could shift the prediction to female.
All of the above results can be demonstrated and proven mathematically; this is described in the Appendix
The effect of correlation on the ability of a second marker to add to the predictive ability of a primary marker can also be demonstrated with scatter plots. show the situation with C = 0, C > 0 and C < 0, respectively. Compared to the no correlation (C = 0) situation, the ability of the two markers to differentiate between cases and controls is substantially diminished when C < 0 and is substantially enhanced when C < 0 (note that the univariate AUCs of the markers are the same over all three plots). Also shown is , where the correlation in cases is the same as in but the correlation in controls is zero; the degree of differentiation between cases and controls is a bit less here than in 2C.
Figure 2 Scatter plots of marker 1 by marker 2 values for cases (blue triangles) and controls (red dots). Individual AUCs of marker 1 and 2 are 0.760 and 0.714, respectively. Correlation (in both cases and controls) is 0.0 in (A), 0.7 in (B) and −0.7 in (more ...)
Some of the mathematical results described above for two markers can be extended to the realm of three or more markers. For example, if markers B and C each do not add value when linearly combined (optimally) with marker A, then B and C together will also not add any value when linearly combined (optimally) with A.
We note here a technical consideration. Heretofore, we have been assuming that each marker is up-regulated in cases, ie, that it tends to have higher levels in cases than in controls. Mathematically, it can be shown that, for the purposes of assessing the increase in predictive value, positive correlation for a marker that is up-regulated is equivalent to negative correlation for a marker that is down-regulated and vice-versa (note we are assuming that the primary marker is up-regulated). Therefore, all of the conclusions above about negative and positive correlation are reversed if the secondary marker(s) is down-regulated in cases (or more generally, has the opposite regulation of the primary marker). Thus, if a second marker is down-regulated in cases, positive correlation of it with a primary (up-regulated) marker will lead to greater added AUC and negative correlation to a lesser added AUC. Since most cancer biomarkers are up-regulated in cases, we stated in the abstract and introduction that negative correlations are conducive to added predictive value; the caveat should be added that this is assuming that both markers have the same direction of regulation.