We may be able to better understand the what is happening in borrowing strength by adopting a Bayesian,6
multilevel modeling perspective. (Though there are important differences between the original James-Stein method and the Bayesian one, we chose this approach because it helps us develop a more intuitive understanding of the material; other approaches exist as well.) We know that at the end of the season each baseball player will have a particular batting average which we will assume (as Efron and Morris did) is the true value of interest. However, while we are still collecting data, the batting averages will continue to fluctuate: If a particular player is on a hot streak, his average may shoot higher than his end-of-season batting average. On the other hand, if he is facing a series of top-notch pitchers, his average thus far may underestimate his eventual season performance. This is the first level of the model—the individual players.
The lower half of shows the first level of this model for three players: one player (the center distribution) is fairly average in his batting ability, one (on the right) is above average, and one (on the left) is below average. We have indicated the players’ true averages with the lettered ticks (a–c). The data collected from each player at the beginning of the season give us one observed point (i.e., one observed batting average); these are labeled in the graphs with the filled shapes. Note that the distribution itself shows us the probability of observing a particular batting average early in the season given the underlying true batting average. Remember, though, that the distributions are constructed around points (a–c) that we cannot observe or have not yet observed. This should seem very similar to flipping three strange coins, where a–c would represent the unknown (true) probability of heads for a particular coin, and a filled shape would indicate the percentage of heads in the first ten tosses. The widths of the distributions are the same indicating that we have a roughly equal amount of information available for each player.
Figure 1 Two-level, Hierarchical Model of Individual Baseball Players (Level 1) and the Players as a Group (Level 2). Note that in keeping with standard notation, we have labeled the individual level ‘1’ and the group level ‘2.’ (more ...)
The upper half of shows the second level of the model. This distribution describes, collectively, all eighteen batters of interest to us. It is centered around the average ability of the group, and its width corresponds to how variable the batting ability is within the group. This distribution is thus made up of a bunch of (relatively) fair individual batters, as well as some (relatively) excellent and poor ones.
A link exists between the first and second levels because the observed batting averages for players in the first level are the points that we have available to us for defining the group distribution in the second level. While there exists a theoretical distribution which appropriately captures the mean and variance of the players’ abilities (which is drawn in the second level of ), we cannot directly observe the parameters that define that distribution. Instead, we have only the observed batting averages available to us. The question then becomes, how do we best determine the parameters (i.e., mean and variance) of the group (level 2) distribution?
We may, for instance, be willing to make a rough assumption about the shape of the group performance distribution (e.g., that it is bell shaped), but we may not feel that we can make an educated guess about where that distribution should be centered (its mean) or how wide it should be (its variance). In order to learn about our group distribution, we find the values for the mean and variance that best explain the entire collection of observed batting averages. At the beginning of the baseball season, though we have only some of the information that will eventually be available to us, we can still find the best parameters for the group distribution. The specific approach used to pick the group mean and variance depends on the statistical tack being used. When conducting an empirical Bayesian analysis,7
the observed group mean and variance are used to define the theoretical level 2 distributions while, in a fully Bayesian analysis, prior information and Bayes theorem are used. With the James-Stein approach (which is not Bayesian), we do not explicitly consider the group distribution; rather we derive a shrinkage factor based on the individual values, the overall average, and the overall variance.
Now that we have constructed our first and second levels and shown how they are linked, we can describe conceptually how shrinkage works. To correct a particular player’s batting average for early season streaks, we first take into account the data generated thus far by that player. We then look at where that player appears to be in terms of the group distribution. Since it is unlikely that we would find a player many standard deviations away from the mean, we suspect that a player whose early data are quite extraordinary is on a streak that is not entirely representative of where he will end up, especially if he has had relatively few at-bats so far. The closer the data are to the center of our group distribution, the less suspicious we are. The correction comes from a weighted average of the player’s individual data and the group mean, with the weights determined by how narrow we feel the group distribution appears to be: The narrower the group distribution appears to be, the more we will pull the data points toward the center, because players have to be less far from the mean to be considered exceptional.
Considered from a slightly different perspective, the shrinkage correction occurs because we believe that all the players’ means come from a single distribution (the one at level 2). The information we gain about 17 players gives us some insight into how the 18th will perform—it is unlikely, though possible, that his performance will be considerably different from theirs. If, for instance, the pack of 17 players seems to be performing particularly well, then we are more willing to believe (unless we see sufficient contrary data) that the 18th player is also performing well. And how well the group is performing is what determines the location (mean) of the level 2 distribution.
As mentioned earlier, we add an additional level of complexity when we recognize that the individuals’ distributions also have associated variances or uncertainties. This is easier to comprehend in terms of information—the more information available about a single individual, the smaller the uncertainty or variance and the narrower the distribution. How far we shrink each player’s individual average toward the group mean should also take into account these information (or inverse variance) weights. If we have a lot of information available for a particular individual, we have less reason to worry that his exceptional performance (good or bad) represents primarily a random fluctuation; we would therefore not want to push his estimated batting average as much toward the center of the group distribution. In other words, the batting average in a series of many at-bats (as opposed to few at-bats) would more likely approach the true batting average for that player, reducing the need for correction toward the group mean. Similarly, when we calculate the group mean, we should weight more heavily the averages of players with more at-bats.
The technique we described above requires inferring the parameters for the group distribution using only information from the individuals who comprise it, an approach known as empirical Bayes estimation.7
In a fully Bayesian approach, we would make a prior educated guess about the group (level 2) distribution before
considering the accumulating players’ data, and describe our prior beliefs by introducing a third level to from which the parameters for the level 2 distribution would be sampled. Whether empirical or fully Bayes, the players’ individual data can be thought of as being pulled toward the center; the only difference is whether that center is determined using the data alone (empirical Bayes) or using both the data and prior beliefs (fully Bayes).8
Further, the closer the center is to the true center (whether derived empirically or with prior information), the more efficient these methods will be in improving the estimates.
With James-Stein estimation, if we are considering three or more independent level 1 estimates, the expected total mean squared error (across all the estimates) is guaranteed to be lower than if we were to use the naïve estimates. With four or more level 1 estimates (and the assumption of exchangeability in place of the stronger independence assumption), this expected error is reduced further using empirical Bayes estimation. (Although counterintuitive, the James-Stein method is valid with any independent groups, so that estimating the cost of various teas in China may enjoy reduced error if the batting averages are considered in the same problem. Of course, the more different these groups are, the less efficient the reduction in expected error.)
All of the techniques mentioned thus far, whether James-Stein or Bayesian, demonstrate the principle of bias-variance tradeoff.8
When we shift the original estimates toward a specific point to reduce the squared error, we are introducing bias in the sense that we can no longer expect that our estimates, on average (i.e., across multiple repetitions of the experiment), are equally likely to fall on either side of the true value. While this may seem statistically disconcerting, from a clinical perspective the tradeoff is generally worth it to achieve more accurate estimates. To reiterate an earlier point, we choose to use shrinkage not because it guarantees better results for estimating any particular baseball player’s true average (it doesn’t), but rather because as a strategy it is likely to yield more correct estimates overall than if we were to use the naïve (non-shrunk) estimates.