I’ve been reading about the NIPS Experiment. Calm down at the back there. NIPS stands for Neural Information Processing Systems. It’s all very serious and you can read about the experiment here and here.
In essence, the experiment aimed to examine the process by which papers are accepted or rejected by peer review committees for conference presentation. Obviously, it’s all to do with scientific quality and the scientific community is built around a common understanding of what that means. Or is it?
The NIPS experimenters split their panel of conference peer reviewers into two committees. Most of the papers went to one committee or the other for review, but 10% of them (166 papers) were reviewed by both committees without the members knowing which papers they were. It was then possible to see how similar the two committees were in their evaluation of those papers. A full write-up of the results is still to come, apparently, but Eric Price has revealed the essence.
The committees disagreed in their evaluation of 43 of the 166 papers. Naïvely, you might think that’s not too bad. They disagreed on 25.9% of cases, so they must have agreed on 74.1%. However, Eric Price points out that the committees were tasked with a 22.5% acceptance rate which means that the number of disagreements was larger than the number of acceptances each committee was expected to make. This means that most (more than half) of the papers accepted by either committee were rejected by the other.
Price considers a theoretical model which treats the peer review process as a combination of “certain” and “random” components. He assumes that there will be some papers that every reviewer agrees should be accepted (acceptance is certain) and some that everyone agrees should be rejected (rejection is certain). For the rest, Price’s model assumes that committee members make their decision by (metaphorically at least) flipping a coin. This is the random component and the level of randomness in peer review is the proportion of papers that get this treatment. The divergence in reviewing committees’ decisions seen in the NIPS experiment imply that there is quite a lot of this coin-flipping randomness in peer review; perhaps more than most people thought.
Is this “randomness” in reviewers assessments a cause for concern? Price points out that “consistency is not the only goal” and, indeed, it can arise for reasons that are not necessarily welcome. For instance, unanimously accepted papers may simply be feeling the benefit of appearing under the name of well-connected authors that reviewers favour for reputational reasons. Conversely, papers that reviewers unanimously reject may just be suffering the penalty of pursuing unfashionable research topics that reviewers see as a drain on funding for more popular topics. It may well be that it is precisely in the “random middle” – between the certain acceptances and certain rejections – that we see peer review at its best.
But how can it be any good if it’s random? The truth is, it’s pretty implausible that it really is random. I don’t see much reason to believe that peer reviewers actually flip coins and as humans are not good random number generators, it seems unlikely that conceptual flipping of imaginary coins would produce genuinely random results. What really goes on in this middle zone is not random at all. Rather, it’s a process of deliberation where each reviewer considers a variety of factors and makes a decision on the basis of balancing those factors. Even having made the decision, the reviewer probably still feels a fair degree of uncertainty as to whether it was the right one.
Because reviewers are usually allowed to decide for themselves which factors to consider in their deliberations, there is a good deal of variation between reviewers as what factors they consider. Putting it more formally, the weight they give to each factor is not prescribed. What’s more, there’s no guarantee that even individual reviewers will attach the same weight every time: the same reviewer could reach different conclusions about the same a paper considered under different circumstances.
In short, the degree of “randomness” seen in the NIPS experiment undermines one of the cornerstone assumptions of the peer review process – that reviewers share a coherent common notion of what qualities to value in a paper. Instead, it suggests that the criteria that reviewers use in practise are quite divergent. If this is the case, it is hard to see how peer review could possibly be “fair”. Certainly, steps such as making reviewers comments and identities open to authors would seem to miss the point. What is more in order is a dialogue over the criteria used to evaluate research in the first place and whether traditional peer review has any useful role to play in this.