For each grant proposal reviewers from the relevant scientific community are asked to report their evaluations within a pre-defined scale. The average grade obtained through this process is considered a valid estimate of the “true” value of the proposal.

The survey sample size is a crucial parameter in determining whether we can rely on these mean estimates. Elementary sampling techniques give us the minimum number of respondents that are needed for the evaluation procedure to deliver reliable estimates:

In expression (1), n is the minimum required sample size or number of evaluators. Z

_{α/2} is the upper percentile of the standard normal distribution. For a 95% confidence interval and an alpha (type I error, i.e. the probability of rejecting the null hypothesis when it is true) of .05, Z

_{α/2} is equal to 1.96. The parameter σ represents the underlying standard deviation. Finally, L indicates the desired half-width of the interval between two consecutive evaluations or the precision of the evaluation.

There are two important implications of this equation. First, the inverse correlation between n and L indicates that more reviewers are needed to obtain a more fine-grained or precise evaluation. Moreover, this relation is exponential so that greater precision comes with an increasingly greater number of reviewers.

Second, typically the standard deviation σ of a population is not observed and needs to be estimated. Since the data necessary to estimate σ for the review of biomedical research proposals have not been collected in a statistically robust sampling system, we have relied on a model system of peer review with short movie proposals reviewed on a scale from 1 to 5 by undergraduate students [Lacetera, Kaplan, Kaplan, submitted]. We used short movie proposals in order to increase the potential sample size since all undergraduate students could be considered expert enough to grade the proposals. In this study 10 proposals were scored by an average of 48 reviewers. The average standard deviation was approximately 1.0 with a standard deviation considerably less than 0.1. Therefore, we estimate σ to be equal to 1. Obviously, a more accurate estimate of the standard deviation can eventually be obtained for each form of application requested by NIH, although it should be clear that a large number of independent evaluators is required to make any estimate of σ reliable.

Using equation (1), we can assess the effect of having 4 reviewers for each proposal. With four reviewers and a standard deviation of 1, the review would be expected to distinguish applications at the level of the unit interval:

Thus, four reviewers would be able to distinguish among whole integer scores.

Yet, in the evaluation of grant proposals NIH currently uses a 41-grade scale with a range of scores from 1.0 to 5.0

[3]. Moreover, these scores are averaged to yield a score with 3 significant figures instead of 2

[3]. It is this number, inappropriately expanded to 3 significant figures by averaging, that is used by NIH in their scoring decisions. Although NIH does not explain the rationale for the conversion of their scores to 3 significant figures, with 80,000 applications per year it seems likely that the NIH peer review system needs that level of precision to facilitate their making choices close to the funding line. As a consequence to the use of scores with 3 significant figures, differences as small as 0.01 are used in making funding decisions. Nevertheless, in order to obtain reliable scores with a precision level of 0.01, an unrealistically large number of reviewers would be needed:

Expression (3) implies that, in order for a mean score of 3.56 to be taken as reliable and therefore as identifying a better, more promising proposal than one receiving a rating of 3.57, the evaluation of almost 40 thousand referees would need to be obtained.

In the exponential relationship between the number of reviewers and the precision of the ratings that would provide reliable estimates of the mean is shown. On the x-axis, smaller numbers indicate higher precision. Even for a precision level of 0.1, as many as 384 reviewers would be required.

The disconnect between the needed precision in order to allocate funds in a fair way and the number of reviewers required for this level of precision demonstrates a major inconsistency that underlies NIH peer review. With only four reviewers used for the evaluation of applications, an allocation system that requires a precision level in the range of 0.01 to 0.1 is not statistically meaningful and consequently not reliable. Moreover, the 4 reviewers NIH proposes are not independent which degrades the precision that could be obtained otherwise.

Consequently, NIH faces a major challenge. On the one hand, a fine-grained evaluation is mandated by their review process. On the other hand, for such criterion to be consistent and meaningful, an unrealistically high number of evaluators, independent of each other, need to be involved for each and every proposal.

Further insights can be derived from the analysis of expression (1). The value of σ is a measure of the underlying variability in the ratings. The minimum number of reviewers for any given degree of ratings precision decreases with decreasing standard deviations. The standard deviation across ratings is also an indicator of the degree of agreement among different reviewers. If the standard deviation is small, for instance equal to 0.01 instead of our previous working estimate of 1.0, there is essentially consensus among the referees. If σ

=

0.01, then the following relation holds:

Therefore, 4 independent evaluators can provide statistical legitimacy only under the circumstance of all evaluators giving essentially the same evaluation. For proposals that are expected to be more controversial, as potentially transformative ideas have been proposed to be

^{5}, a small number of evaluators would lead to unreliable mean estimates.

Our estimate of σ is not based on an analysis of biomedical research experts judging research projects close to their area of specialty. Scoring standard deviations for large numbers of experts obtained in a statistically acceptable sampling system have not been collected. Instead, as described above, we have used a model system that has allowed us to readily collect opinion data about proposals with undefined potential. Although we believe our estimate is reasonable, it is informative to visualize how the sample size estimate varies with different values of standard deviation for a level of precision of 0.1 (). It is evident that small sample sizes are able to provide levels of precision only when the standard deviation is exceptionally small. We used a level of precision of 0.1 because the NIH peer review system mandates scoring at this precision level. For greater levels of precision, as suggested by the conversion from 2 significant figures to 3, the increase in sample size is steeper with increasing standard deviation.

The importance of scoring accuracy ultimately relates to the rank ordering of proposals. In our model system there were 5 movie proposals with mean scores ranging from 3.46 to 3.64. We have analyzed how the rank ordering of these 5 proposals varied as reviewers were randomly included in the analysis from 1 to 40 reviewers (). What is most striking in these graphs is the extreme variability in the rank ordering with low numbers of reviewers. For instance, the upper-left and lower-left panels of show proposals that had relatively good rankings with less than 10 reviewers but that ended with relatively poor rankings with over 30 reviewers. Conversely, the lower-right panel shows a proposal that began with poor rankings but settled at with the best ranking after 25 reviewers. Even the addition of 1 reviewer can markedly change the rank ordering of the proposals and consequently the funding decision. This effect is especially apparent when there are few reviewers. The number of reviewers has profound implications in terms of the actual funding decisions that are eventually made.