The legitimacy of the approval procedure at funding agencies for basic research depends strongly on whether the reliability, validity, and fairness of the procedure are guaranteed 
. Quality control undertaken by peers in the traditional peer review of proposals for grants is essential in most research funding organizations to establish valid and evidence-based approval decisions by the board of trustees 
. According to Marsh, Jayasinghe, and Bond 
, one of the most important weaknesses of the peer review process is that the ratings given to the same proposal by different reviewers typically differ. This results in a lack of inter-rater reliability (IRR). Cicchetti 
defines IRR as “the extent to which two or more independent reviews of the same scientific document agree” (p. 120). Overviews of the literature on the reliability of peer reviews for grant applications
come to similar conclusions as those for journal peer review (e.g., 
): There is on the average a low level of IRR.
To calculate IRR in the case of continuous ratings, intraclass correlations (ICCs) are often used; roughly speaking, the ICC is defined as a ratio of the variance of the mean ratings across all reviewers of a grant proposal and the total variance across all reviewer? ratings of a proposal.
Studies on the IRR of peer reviews mostly report average values across all submitted proposals 
. But this can lead to biased estimation of the actual IRR, if reviewers’ ratings are not homogeneous. Reviewers in some scientific disciplines can rate proposals on average more strictly than reviewers in other scientific disciplines do (heterogeneity with respect to the mean
). According to Marsh et al. 
, ratings of reviewers are affected by a number of covariates called bias factors – including applicant’s gender, reviewer’s gender, grant sum requested – that have nothing to do with the quality of a proposal 
. As a result, the variance of the mean ratings as well as the total variance can also be explained by these covariates.
Over and above that, properties of grant proposals can also affect the total variance of reviewers’ ratings (heterogeneity with respect to the variance
). For instance, it can be supposed that in the humanities and social sciences reviewers’ ratings vary more greatly than in the natural sciences. This may be due to the lack of uniform evaluation standards 
or to greatly varying quality of proposals. Mallard et al. 
pointed out, “noting that evaluators focus on the intellectual merits of proposals or articles provides little leverage for analyzing procedural fairness when conflicting criteria are used to define intellectual merit, as is generally the case in the social sciences and humanities” (p. 577). A greater variance of ratings in certain scientific disciplines can, but must not necessarily, be accompanied by a lower ICC. If it is found that the IRRs in the humanities and social sciences are comparable to the IRRs in the natural sciences, as Jayasinghe et al. 
found. The higher variability of reviewers’ ratings in the humanities and social sciences is due to the more greatly varying quality of the proposals in these disciplines. Including further covariates (such as the grant sum requested, time point of the final approval decision) makes it possible to determine the specific combination of conditions that leads to differences in the variability of reviewer’s ratings. In addition to the heterogeneity of ratings with regard to the mean and variance proposals can also differ with regard to the ICC itself (heterogeneity with respect to the ICC
). Thus, the ICCs can vary with regard to various covariates, as Marsh et al. 
As this overview of studies on reliability shows, when examining the IRR of peer reviews using ICCs, it is necessary to include all three components (mean, variance, ICC) in the statistical analysis to obtain reliable information about the IRR.
In this study, we will determine the ICCs controlling for the specific bias factors. This means that the ICCs are calculated on condition that all proposals have the same values of the included covariates.
The generalized estimating equations approach (GEE) (especially the further 
development of the approach by Yan and Fine 
) makes it possible to model the heterogeneity of ICC statistically with a set of covariates while simultaneously considering the heterogeneity of variances and impacts of bias factors. However, empirical analysis of the heterogeneity of reviewers’ ratings requires a large database. We decided to conduct an empirical study of the heterogeneity of ICCs and its multiple determinants, taking as an example the Austrian Science Fund (FWF). The data consisted of all proposals of the FWF, generated by the FWF review procedure between the years 1999 and 2009; all scientific disciplines were represented in the database. This is an ideal database for the purpose of this study.
In the following, the FWF will be described in more detail and the research questions presented. The data on which the analysis was based will be characterized and the statistical approach explained. The results are then reported and discussed.
The Austrian Science Fund (FWF)
The FWF is Austria’s central funding organization for basic research. The body responsible for funding decisions at the FWF is the board of trustees, made up of 26 elected reporters and 26 alternates 
. For each grant application, the FWF obtains at least two international expert reviews. The number of reviewers depends on the amount of funding requested. The expert review consists (among other things) of an extensive written comment and a rating providing an overall numerical assessment of the application. At the FWF board’s decision meetings, the reporters present the written reviews and ratings of each grant application. The FWF does not enforce any quotas or specific budgets for individual scientific disciplines, and as a result, all applications from all fields and disciplines compete with one another at the five decision meetings held each year. In the period under study here (from 1999 to 2009), the approval rate of proposals was 44.2% 
From 1999 to 2004, the approval rate dropped continuously from 53.4% in 1999 to 36.2% in 2004; after that, it increased slightly to 42.9% in 2008 (but dropped again to 32.2% in 2009) 
. The year 2004 thus represents a turning point in the development of the approval rate over time. One reason for this development is that in the years from 2002 to 2004, the number of grant applications and the grant sum requested exceeded the funding budget, so that the approval rate dropped from 49.2% to 36.2%. With a low approval rate, to be approved for a grant a proposal had to achieve a higher mean reviewers’ rating.
Specifically, our paper addresses the following four research questions (in parentheses: in terms of GEE):
- How reliable are the reviewers’ single ratings of the quality of the projects (that is, the overall evaluation of the proposed research by a single reviewer)?
- Is the ICC homogeneous across all proposals, or does it vary with certain characteristics of proposals or reviewers (intraclass correlation)?
- Is the total variability of reviewers’ ratings equal for all proposals, or does it vary with certain characteristics of proposals or reviewers (variance)? Do the ICC changes if variance heterogeneity is considered?
- Is there any impact of covariates on reviewer? overall ratings of a proposal (mean)? Do the ICC changes if this impact is permitted?