Peer review is now the principal mechanism for selecting grant applications for funding

[1],

[2]. In this process, inter-reviewer agreement is important for ease in application ranking. Both Wiener

*et al* [3] and Hartmann

*et al* [4] found high inter-reviewer agreement in rating proposals. Green

*et al* [5] demonstrated that the rating intervals of the scale (0.5 or 0.1) did not influence the final assessment. Nevertheless, reviewers still have disagreements about some proposals because of differing scientific backgrounds, perceptions of the proposal, or non-declared conflicts of interest. Proposals with discordant peer-review ratings need to be discussed before a global ranking of proposals. We propose a simple method to help selection committees identify proposals that require discussion because of lack of agreement in peer-reviews.

Example

Let us consider the example of 20 proposals submitted to a fictitious funder and assessed by 3 reviewers. Ratings are displayed in , and, for each proposal we have estimated the intra-proposal mean rating and standard deviation. Disagreement among ratings translates into a high intra-proposal standard deviation for proposals 3, 14, 19, 20 and 15, for example.

| **Table 1**Fictitious example of a number of proposals submitted for funding and rated by 3 raters for application of the formula by Giraudeau *et al.* [9] to identify proposals with discordant peer-review ratings (see appendices). |

A simplistic approach

A simple way to identify proposals with discordant peer-review ratings would be to specify a ceiling intra-proposal standard deviation: each proposal with an intra-proposal standard deviation greater than this ceiling value would be considered as having discordant peer-review ratings. Nevertheless, such an approach would have 2 limits. First, this ceiling standard deviation would highly depend on the rating scale (and would therefore differ for each funder). Second, the ceiling standard deviation should be fixed relative to the inter-proposal heterogeneity rather than be an absolute value. Thus, in our example, if we consider the proposal rating means (i.e., the series 15.0, 11.1 … 13.9 in ), the inter-proposal standard deviation is estimated at 2.3. Then, an intra-proposal standard deviation of 3 or 4 would be unacceptably high but would not be high had the estimated inter-proposal standard deviation been around 5.

Underlying concept of the proposed approach

Considering that the underlying question of our research is agreement, we focus on the intraclass correlation coefficient (ICC), the parameter usually assessed for continuous outcomes

[6]. This coefficient is defined as the ratio of the inter-subject variance (here the inter-proposal variance) to the whole variance (here the inter-proposal variance plus the intra-proposal variance). Thus, the ICC theoretically varies between 0 and 1

[7], where 0 is total lack of agreement among ratings and 1 is perfect agreement with no intra-proposal variance. In our example the ICC is estimated at 0.366 (using the ANOVA estimator in absence of an explicit maximum likelihood estimator when the number of ratings per proposal varies

[8]), which can be interpreted as 36.6% of the total variation being due to inter-proposal variability (i.e., the “true” variability) and 63.4% to lack of agreement among reviewers.

Giraudeau

*et al.* [9] derived an analytical formula that assesses the influence of a subject (here, a proposal) on the estimate of the ICC (

Appendix S1). For a given proposal (named

*i*_{0} for convenience), this influence is actually the sum of 2 antagonist effects: the positive effect, related to the

*i*_{0} mean rating (the ICC would be high with a very low [or very high] mean rating for a proposal) and a negative effect, related to the variance of the

*i*_{0} ratings (the ICC would be low with high heterogeneity of ratings). Giraudeau

*et al* developed an explicit formula in the balanced case (i.e., with a common fixed number of ratings per proposal), but this formula still approximates accurately the influence of a proposal in the unbalanced case (i.e., when the number of peer-review ratings varies among proposals) (

Appendix S2). In our example, if we focus on proposal 3, the first term (effect) is estimated as 0.0134 and the second term −0.0618 (). Because this proposal has a mean rating not very different from the global mean (i.e., 13.9

*vs* 15.7), the first term is small. In contrast, because of disagreement in ratings for this proposal, its intra-rating standard deviation is estimated as 4.2 and the second term is high, in absolute value. If this proposal were to be discarded from the sample, the re-estimated ICC would be 0.415 which is derived from 0.366 (the whole sample ICC estimate) minus 0.0134 (the positive effect of the mean ratings) minus −0.0618 (the negative effect of the intra-proposal standard deviation).