We found evidence that biomedical journal peer review largely succeeds in selecting high impact articles for publication and dispatching lower impact articles, but the process is far from perfect. While 71% are correctly classified, 29% are not, with some accepted articles having lower than average impact, and some rejected articles having higher than average impact. This rate of successful sorting of submissions is nearly identical to that seen in the study of peer review in a high impact chemistry journal 
. We found that raters had good internal consistency in the ratings they gave in the 6 quality domains and good agreement between these ratings and their recommendation regarding publication, but low inter-rater reliability. This is similar to findings from previous studies 
. However, we found that the editor decisions regarding publication were fairly accurate in discriminating high from low impact articles.
There are several possible explanations for this finding. First, editors commonly solicit reviewers with different background and perspectives. For example, an article using qualitative methods about patient-doctor communication may prompt the editor to obtain a review from a reviewer with expertise in qualitative methods and another from an expert in patient-doctor communication. It may not be surprising that two experts looking at the same paper from different perspectives may rate the articles differently and make divergent recommendations. Secondly, we had no assessment of the quality of each of the reviews. Review quality varies widely from reviewer to reviewer. This could contribute to lack of agreement. It is uncertain whether two highly rated reviewers would have better agreement rates. One study found that there was low agreement between the editor's decision and reviewer recommendations regarding publication, even among reviews that were rated as high quality 
. It is also possible that different reviewers value some article traits more highly than others: some may emphasize clarity of writing, others the timeliness or originality of the material. It may not be surprising that we, like others, have found low inter-rater agreement among the reviewers of scientific articles.
What is notable is that from this morass of conflicting advice comes a decision that fairly accurately discriminates high from low quality articles. While editors are clearly being influenced by reviewer's recommendations, they appear to synthesize the comments and ratings and arrive at decisions that are more accurate than would be suggested by the low relationship between individual reviewer quality ratings or recommendations and article impact.
There are a number of limitations to our study. First, an alternative explanation for the internal consistency of reviewer ratings is a halo effect, in which a rater might tend to assign the same number for all quality domains assessed. While this could partially explain the Cronbach alpha for the six quality domains, it would not explain the consistency of the relationship between quality ratings and the specific recommendation made. Secondly, an alternative explanation for the finding that rejected articles have lower impact is that there is a natural selection that occurs as authors decide where to submit their articles. The typical submission pattern is for authors to submit first to higher then to lower impact journals. While it is likely this bias contributes to our findings, this is probably not as strong a factor for a journal like the Journal of General Internal Medicine, with a modest impact factor than it would be for a more highly rated journal. A second explanation for the incremental increase in citation rates between articles rejected without review and those rejected after review is that the authors in their submission to another journal incorporated the advice they received from the JGIM reviewers and editors. While it is possible that this attenuates some of the difference between rejected with and without review citation rates, it is unlikely to explain the entire difference. Moreover, if such an effect existed, it would tend to reduce the difference we found between those articles published in JGIM and those that were reviewed but published in a journal other than JGIM.
We also found evidence that 3 reviewers are better than 2 as the percent of submissions correctly classified increased from 35% to 69%. It is impossible to determine from our data the optimal number of reviews. It is also uncertain whether the extra costs associated with obtaining additional reviews would be worthwhile since the editor's decisions appear to reasonably discriminate high from low impact articles.
There is interest in using absolute cut points of quality scores to make decisions about accepting or rejecting articles. Our data suggests that making editorial decisions based on total quality scores or the score on a specific quality domain would not adequately discriminate between high and low impact articles, as nicely demonstrated in the ROC curve.
Like most journals, the JGIM peer review process has an element of subjectivity. While the deputy editors undergo some training to standardize the process of decision-making, external peer reviewers are volunteers. They are given limited written instructions and may access the JGIM website for further guidance or attend an annual workshop for reviewers but are not required to undergo training before submitting reviews. External peer reviewers are asked to self-select their interests and expertise and this information is used in selecting reviewers for articles. Reviewers may have personal biases for or against particular types of research that may influence their recommendation and may possess varying degrees of knowledge in the area. In addition, the decision to accept an article includes other factors that may not be fully captured by our data, such as timeliness or importance of the topic to the journal's parent organization, the Society of General Internal Medicine.
Despite these limitations, peer review appears to be useful. Article selection by journals based on peer review may be important as journals compete for higher impact ratings, as measured by the ISI citation index. A journal's calculated ISI score affects journal prestige, influences authors' decisions about where to submit their best work, and may affect advertising revenue. It was also identified by the Cochrane collaboration as the best surrogate marker for article importance 
. However, the ISI impact factor measures just one aspect of article quality – the extent to which other researchers cite the manuscript. It does not capture how often the information is read (let alone used) by practitioners, read by the public, disseminated in the media, or used to make policy decisions. (Suitable surrogate metrics for these outcomes might include eigenfactors, article downloads, websearches, mentions in the popular press, or citations in public speeches, respectively.) Additionally, article type can affect impact. Important health policy topics have a shorter half-life of interest, and may have lower citations. Medical education topics have a relatively narrow audience (primarily medical educators), even when well done and useful. Thus, the proportion of article topics within a journal will profoundly affect a journal's perceived value, even with rigorous peer review. It is thus not surprising that studies that use the citation index as the only measure of “usefulness” of an article may find only weak correlations with the final decision or with individual rater recommendations.
In summary, this study shows that peer review in combination with editorial judgment at JGIM is reasonably good at picking future “winners”. While the individual reviewers have good consistency, they have low agreement. There also does not appear to be a particular quality cut point that will discriminate high from low impact articles. Journal editors take these often conflicting recommendations into account and appear to synthesize them in reaching publication decisions. It also appears that a larger number of reviewers is better, though the ideal number cannot be determined from our data. Nevertheless the process could be improved, especially with respect to the hidden gems that are rejected by JGIM and then go on to garner many citations. While JGIM is not alone in its imperfections (Nature initially rejected Stephen Hawking's paper on black hole radiation), more work is needed to improve the reliability and validity of the peer review process. Wrong decisions are inevitable; fortunately there are numerous opportunities for authors to publish medical articles. Hawking did eventually publish his seminal work. It is likely that worthy articles eventually find a place in the published literature.