Monday, 27 December 2010

Does Peer Review Work?

Scientific peer review is based on the idea that some papers deserve to get published and others don't.

By asking a hand-picked team of 3 or 4 experts in the field (the "peers"), journals hope to accept the good stuff, filter out the rubbish, and improve the not-quite-good-enough papers.

This all assumes that the reviewers, being experts, are able to make a more or less objective judgement. In other words, when a reviewer says that a paper's good or bad, they're reporting something about the paper, not just giving their own personal opinion.

If that's true, reviewers ought to agree with each other about the merits of each paper. On the other hand, if it turns out that they don't agree any more often than we'd expect if they were assigning ratings entirely at random, that would suggest that there's a problem somewhere.

Guess what? Bornmann et al have just reported that reviewers are only slightly more likely to agree than they would be if they were just flipping coins: A Reliability - Generalization Study of Journal Peer Reviews.

The study is a meta-analysis of 48 studies published since 1966, looking at peer review of either journal papers or conference presentations. In total, almost 20,000 submissions were studied. Bornmann et al calculated the mean inter-rater reliability (IRR), a measure of how well different judges agree with each other.

Overall, they found a reliability coefficient (r^2) of 0.23, or 0.34 under a different statistical model. This is pretty low, given that 0 is random chance, while a perfect correlation would be 1.0. Using another measure of IRR, Cohen's kappa, they found a reliability of 0.17. That means that peer reviewers only agreed on 17% more manuscripts than they would by chance alone.

Worse still, the bigger the study, the worse the reliability it reported. On the other hand, the subject - economics/law, natural sciences, medical sciences, or social sciences - had no effect, arguing against the common sense idea that reviews must be more objective in the "harder" sciences.

So what? Does this mean that peer review is a bad thing? Maybe it's like the police. The police are there to prevent and punish crime. They don't always succeed: crime happens. But only a fool would argue that, because the police fail to prevent some crimes, we ought to abolish them. The fact that we have police, even imperfect ones, acts a deterrent.

Likewise, I suspect that peer review, for all its flaws (and poor reliability is just one of them), does prevent many "bad" papers from getting written, or getting submitted, even if a lot do still make it through, and even if the vetting process is not itself not very efficient. The very fact that peer review is there at all, makes people write their papers in a certain way.

Peer review surely does "work", to some extent - but is the work it does actually useful? Does it really filter out bad papers or does it on the contrary act to stifle originality? There are lots of things to say about this, but I will just say this for now: it's important to distinguish between whether peer review is good for science as a whole, and whether it's good for journals.

Every respectable journal relies on peer review to decide which papers to publish: even if the reviewers achieve nothing else, they certainly save the Editor time, and hence money (reviewers generally work for free). It's very hard to see how the current system of scientific publication in journals would survive without peer review. But that doesn't mean it's good for science. That's an entirely different question.

ResearchBlogging.orgBornmann L, Mutz R, & Daniel HD (2010). A reliability-generalization study of journal peer reviews: a multilevel metaanalysis of interrater reliability and its determinants. PloS ONE, 5 (12) PMID: 21179459

25 comments:

bsci said...

It's hard to dig into a meta-analysis, but, like you say, the basic assumption of inter-rater reliability is that each reviewer is supposed to bring the same objective perspective to an article. I'm not sure this is the case. At least from my reviewing experience, it seems that editors try to select reviewers that bring slightly different perspectives (i.e. on a paper that discusses scientific topics A & B using method C, one reviewer my dig into the reasoning behind topic A, another B, while other focuses on how C is applied. This type of system would maximize the chance that at least one reviewer would hold up a bad paper. This would close low IRR, but that would be a good thing.

I don't know this literature at all, but one way to test this would be to see if there's higher IRR in subfield journals where it's more likely that reviewers share the same perspective.

Anonymous said...

I agree that if there is more than one reviewer (which in my field, astronomy, is very rare), they do not necessarily have to agree with each other, because they will judge your paper (either objectively or not) from another perspective. What would be the use of more than one reviewer if they all agree anyway?

Further, on the remark that people write and submit carefully because there is something like a referee process: i agree, but it is not necessarily a good thing. First, I know that there are people who leave obvious things open in the paper. They hope the referee will catch that, then they do it in the time the referee is reviewing the paper and after his/her report the paper will be out very quickly. This is a way of misleading the referee that will not always work, but it certainly often does. Secondly, many good ideas will get lost, or wrongly credited because of the refereeing process. A paper cannot just be an idea, it has to be worked out to some level of satisfaction (of at least the reviewer). A human being can only do so many things at a time, but can have ideas for more than that. These will (or will not, which is even worse) be worked out by others and the origin of the idea is not always clear anymore. This leads to frustration and puts personal relations to the test.

A last thing, which is especially bad when there is only one reviewer, is that personal opinions play too strong a role. Not only opinions about the work to be reviewed, but also opinions about the authors. I would plead, in the case of papers _and_ proposals of any kind, for anonymous submission.

JRQ said...

To follow up what bsci said, we have to ask: what is the nature of the construct being measured? Does it make sense to say that there is a true score for paper quality, and that every reviewer is tapping this true score to some degree? From the perspective of classical test theory, reliability is the ratio of true score variance to total observed variance. Thus, the application and interpretation of standard reliability models to a particular problem relies the assumption that the true-score /error-score model applies to the problem. It's not clear that this is always the case.

Typically we take multiple measures because we want to estimate the true score with greater precision, i.e., to reduce the contribution of random error to our estimate. But if, for example, the purpose of taking multiple measures is not to increase precision of a true-score estimate, but to increase COVERAGE across many facets of a multifaceted construct, we might actually expect --or even desire-- less inter-rater agreement.

I think surely it's undesirable to have reviewer disagreement where reviewers telling you opposite things. But I'm not sure its undesirable to have reviewer disagreement where, say, one reviewer gives a negative review pointing out the analysis was not sufficient to support the conclusion the authors are trying to draw, and another reviewer gives a positive review because the method and results are novel anyway.

In any event, low agreement among reviews is a more tolerable situation than agreement inflated by systematic bias.

passionlessDrone said...

Hello friends -

I wonder what the peer review for this paper looked like?

- pD

[Note: stolen joke from comments section at In the Pipeline for commentary on same study. My peers determined it was a joke worth sharing.]

Tiel Aisha Ansari said...

I'd also wonder about the number of papers that were rejected overall. It could be that the papers in many samples were both acceptable and of fairly uniform quality, so there wasn't much to choose between them (consistent with your suggestion that the _existence_ of peer review enforces a certain standard independent of the quality of the review process).

There's also more to peer reviewing than just "accept/don't accept" (though that's the measurable outcome). Peer review sometimes results in editorial feedback that can improve a paper's presentation, or cause the authors to resubmit to a different journal.

GamesWithWords said...

I have concerns similar to JRQ. What are we measuring? To the best of my knowledge, most of the variance in reviews is in interpretation and importance, not in analysis of the data. That is, if it was the case that reviewers couldn't agree on whether the data was bullshit or not, that would be disturbing. If what they're disagreeing about is whether the question being studied is worthy of study, well then beauty is in the eye of the beholder and you want a good sample of eyes (in order to best predict whether readers will care). If the disagreement is about how the data are interpreted, well, then that's very complicated. Some of those disagreements are important, some are not, and it's really hard to tell from the outside which is which.

Maybe we should turn the question around: What is it we wanted peer review to accomplish? Without answering that question, how can we know if it is succeeding? I suspect there is little agreement about the answer, beyond some vacuous sense of "indentifying good papers."

GamesWithWords said...

PS I wrote about some of these issues and related issues previously <a href="http://gameswithwords.fieldofscience.com/2010/07/honestly-research-blogging-get-over.html>here</a>.

petrossa said...

Given the following it is quite certain peer review isn't worth much in general.

Why Most Published Research Findings Are False

Summary

There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0020124

Neuroskeptic said...

The point first made by bsci about different kinds of reviews is certainly a valid one, but I'm not sure whether it applies to the studies in this meta. I guess it applies to some but nor others.

For example I know of some journals which do adopt this policy, and others which don't, they just assign the paper to the "leading experts" whatever their skills may be.

So it would be interesting to break down the results by that variable but probably impossible because it's hard to measure (and might vary with the Editor's whim).

Neuroskeptic said...

petrossa: That's the kind of thing I meant when I said about the importance of distinguishing value to science vs. value to journals.

Most published findings may be false, but they didn't stop journals publishing them, and benefiting from it in terms of Impact Factor, etc.

Kapitano said...

There is the question of what peer review would ideally accomplish, as opposed to what it does in practice.

Ideally, I think, reviewers exist to weed out
(a) Papers written by cranks and incompetents
(b) Papers that may have merit but really belong in a different publication
(c) Papers which are almost good enough, but need a redraft and resubmission.

In practice, they may not so much deselect as select - papers which
* Have findings they find amenable, or are in fields they're interested in
* Are in a fashionable area
* Might increase the prestige of the publication
* Fit any space limitations
* Might broaden the readership, possibly by being contentious
etc.

I'd be intrigued if there were a correlation between a reviewer's liklihood of accepting a paper, and whether they personally know the submitter.

petrossa said...

@neuro
Problem is caused mostly by the dilution of what constitutes Science and by the changing of Scientific endeavor towards being goal oriented.

First of all anything is called Science nowadays, from Economy to Climatology. The margins for manipulation in these non-empirical sciences are so wide it's hard not to do so.

Second of all in ye olden days Scientists tried to solve a riddle, something that puzzled them. Nowadays Scientists are told to attain goal X.

So Science tends to work towards a fixed goal which might or might not exist. Being paid/recognized/ego boosted by your rate of attaining said goals it would take a Vulcan to be able to so objective not to color your findings.

Confirmation bias is by now the main driving force of most studies.

Like a stopped clock is twice correct every day, also the goals get attained somewhat, re-enforcing the mechanism.

GamesWithWords said...

@Petrossa:

Have you ever read any climatology or economics papers? You do realize that they pose hypotheses and test them experimentally, right? True, it's pretty hard for climatologists to independently manipulate variables of interest, but then that's true for astronomy or astrophysics as well: do you deny, e.g., that Steven Hawking is a scientist? Or evolutionary biologists?

Confirmation bias is no doubt present in some circumstances. Do you have any evidence as to just how important a factor it is? My own anecdotal evidence is that most of my colleagues complain about how they always make the wrong predictions about how an experiment will come out. So in my non-representative anecdotal sample, there's little evidence that confirmation bias is an important force.

petrossa said...

@GamesWithWords

Evidently i did, otherwise i couldn't comment on it. Ad hominem is about the lamest argument ever.

We could set up a nice sidetrack discussing the scientific quality of said experiments but that would not add much to the discussion.

Let me just say that an economic experiment with valid empirical results is impossible for obvious reasons, you can't put the economy in a testube. You can only model it. And a model is as good as its maker/user.

Climate, a looped non-linear chaotic system, isn't scienceable. Not by us anyway.
Proxy temperatures are not in any way, shape or form scientific, they are extrapolated assumptions, the main one being that nothing ever changes . Pretty unscientific and unrealistic if you ask me. As such there is no valid climate change timeline so a valid model based on past results is impossible.

And experiments with climate are impossible as well for even more obvious reasons.

You can't put the climate in a testube and experiment. You can model it and again the model will be as good (or in this case) as bad as their creator/users.

Quote myself (a systems analist/ programmer with experience writing models):
The quality of the results of any model are linearly inverse to the complexity of the data.

deevybee said...

I agree with other commentators that reliability between reviewers is a bit of a red herring.
In my experience (neuropsychology), the value of the review process is that it forces you to confront alternative perspectives, or even just explain more clearly what you mean. If we didn’t have reviewers, it would be very easy to write papers in a wholly egocentric way, ignoring viewpoints and facts we didn’t like.
If you are very unlucky, reviewers may indicate your work is fatally flawed or needs more experiments, but if that is really the case, then you can be grateful not to have published work that would damage your reputation. I’ve had numerous instances, especially when moving into a new area, where reviewer comments have led me to read a literature I was unaware of, or to learn to do new analyses that were difficult but relevant. Of course, reviewers can be wrong/stupid/biased/etc, but that is rare in my experience, and you should be able to argue your corner if this is the case. I was told many years ago by a wise mentor that if the reviewer is too stupid to understand what you said, it’s your fault for not saying it more clearly.
So I reckon it’s broadly a positive experience, though it certainly doesn’t feel like that when you first receive the comments.
The real problem is not with reviewers, but with certain journal editors. I’ve given my reflections on this topic in a blog http://tinyurl.com/33lzsvp. I avoid journals whose editors treat the review process as a vote. I also avoid the very high impact journals which won’t publish your work unless all reviewers agree that it is not only good work but also ‘newsworthy’. Unless you have discovered life on Mars, this just adds an unnecessary delay to the publication process.

pj said...

Ioannidis may well be right (although he has a very naive, almost Popperian, view of how scientists view p-values) but I love the way that his paper is now quoted as gospel when it is simply some cod-Bayesian reasoning multiplying together some numbers he's plucked out of the air.

Neuroskeptic said...

deevybee: It's certainly true that peer reviewers can improve papers. However they can also make them worse.

For example, in my experience, peer reviewers will often notice bits where I was unclear, or where I assumed something and didn't reference it properly, etc. and by making the required changes, it became a better paper for the reader which is all that matters.

On the other hand though, a lot of reviewer recommendations, while they satisfy the reviewer and while they may be scientifically solid, make the end result worse to read, e.g. if they ask you to insert a paragraph discussing a certain point, which breaks the flow of the paper.

Quite often when reading papers, I come across bits which leave me puzzled as to why exactly it's been put there, and I tend to attribute these to reviewers.

More broadly I think reviewers almost always end up making papers longer rather than shorter, and I think this is a bad thing, because most experimental papers are too long.

deevybee said...

I know what you mean - you can sometimes spot the paragraphs that have been shoe-horned in just to please a reviewer. But I think that is the fault of the editor for letting it happen. Too many editors just don't have the energy, interest or courage to tell authors when they can ignore reviewers.
But numerous reviewers have saved me from making egregious errors, or from publishing something I thought was great but nobody could understand.

eastgatesystems@mac.com said...

The most important role for peer review in the physical sciences is detecting blunders that are subtle, unexpected, or far from obvious.

For example, a paper in chemistry might present an original synthesis of a novel compound, together with spectroscopic evidence that the compound is indeed what the investigators expected. A particularly valuable review might argue that the compound might alternatively be something different, and completely unexpected.

That's extremely valuable -- there's a famous Frank Westheimer anecdote along these lines -- and in such cases it is very far from likely that all reviewers would anticipate the same objection.

SleepRunning said...

If it's done by humans, it's going to be really imperfect. Horrible? However, it still seems the best approach we have and has produced a few results worth having -- maybe.

petrossa said...

Professor Higgs couldn't get his paper on the higgs boson published at Cern. Didn't get past the reviewers.

He had to turn to another magazine to get it published.

I'm quite sure out there are untold interesting papers not getting past the starting gate because the gate is jealously guarded by scientific dino's rejecting anything that could threaten their nice status quo.

Nothing human is strange to a scientist, so if someone offers a paper that puts your views out to pasture it's pretty likely you are going to stop that from happening.

Imo the size of an ego is directly related to the number of papers published. Since reviewers are 'respected'(ie well published) scientists ego's are enormous.

If anything peer review holds back, and doesn't do much for the quality of the material.

Jim Birch said...

At one time, in the not so distant past, the publication space for scientific papers was very expensive, so needed to be treated as a precious resource with a high level of gate keeping. This no longer applies. It would be quite possible to publish everything now, and have a different open process for review. (Peer review has always continued post publication.)

What the ideal form for this is uncertain but the impact of publication bias against null results and unfashionable research is undeniable and undesirable. Peer review does weed out nutty and incompetent papers but so would being torn apart in an online comments section.

Neuroskeptic said...

jim: My thoughts exactly. Peer review is a 20th century solution to the technical problem that there was only limited space for publication, and also, that it was very awkward to discuss and criticize papers post-publication (Letters to the Editor were the only way but they were, and are, incredibly clunky).

It's no longer technically necessary. I don't think we have a viable alternative in place yet, but I think we're groping towards one with open access online journals which allow comments, like PLoS, and stuff like arxiv.

Tiel Aisha Ansari said...

neuro: of course, the problem with that is someone would have to review the comments to separate the substantive, educated ones from the dinosaur-herding loony-toon ones... The space limitation may have gone away but we still have time and effort limitations.

John Harpur said...

The question is, perhaps, better rephrased as what causes peer review to break down. When I was an academic I had a look at this as part of a wider investigation into the impact of research commercialization policies on higher education. Meta-analyzes going back the 80s fingered a bias for 'big name' institutions. Social bias towards the work of heavy hitters is also not unknown. Moreover as fields narrow the number of true specialists that can be wheeled into play also diminishes. Almost any area that has commercial backing is rife with conflicts of interest. Controlling them is a leaky undertaking. Curiously in the case of medical research, it appears that ghost-written papers often rank higher than might be expected statistically. Peer review often doesn't pick up significant fraud. Short of standing over an investigator's shoulder and demanding a complete replication of experiments, it simply can't. Fundamentally, the review process makes an act of faith in the data and outcomes reported.
From what we now know of the many frauds in science, and thankfully evidence for fraud here is almost always in published form, often co-authors are either not really in the loop or are not unhappy to have their names tagged to a rising star. A different take on 'don't ask, don't tell'. The medical journals have tightened up on authorship accountability considerably but the determined fraudster is unlikely to be deterred. That's the ingenuity of human nature. Even the most secure high street shops lose product to the shoplifters. The financial and professional benefits from 'misconduct' are far from intangible. Perhaps, and it is only a suggestion, if more whistle-blowing was encouraged and if whistle-blowers (I don't like that term either) were reassured that revealing evidence of fraud would not damage their careers, more fraud would be caught earlier in the review process.