Saturday, 26 November 2011

Beware Dead Fish Statistics

An editorial in the Journal of Physiology offers some important notes on statistics.


But even more importantly, it refers to a certain blog in the process:
The Student’s t-test merely quantifies the ‘Lack of support’ for no effect. It is left to the user of the test to decide how convincing this lack might be. A further difficulty is evident in the repeated samples we show in Figure 2: one of those samples was quite improbable because the P-value was 0.03, which suggests a substantial lack of support, but that’s chance for you! A parody of this effect of multiple sampling, taken to extremes, can be found at http://neuroskeptic.blogspot.com/2009/09/fmri-gets-slap-in-face-with-dead-fish.html
This makes it the second academic paper to refer to this blog as far. Although I feel rather bad about this one, since the citation ought to have been to the original dead salmon brain scanning study by Craig Bennett. I just wrote about it.

Actually, though, this editorial was published in five separate journals: The Journal of Physiology, Experimental Physiology, the British Journal of Pharmacology, Advances in Physiology Education, Microcirculation, and Clinical and Experimental Pharmacology and Physiology. Phew.

In fact, you could say that this makes not two but six citations for Neuroskeptic now. Yes. Let's go with that.

Anyway, after discussing the history of the ubiquitous Student's t-test - which was invented in a brewery - it reminds us that the p value you get from such a t-test doesn't tell you how likely it is that your results are "real".

Rather, it tells you how often you'd get the result you did, if there was no effect and it was just random chance. That's a big difference. A p value of 0.01 doesn't mean your results are 99% likely to be real. It means that there's a 1% chance that you'd get them, by chance. But if you did say 100 experiments, or more likely, 100 statistical tests on the same data, then you'd expect to get at least one result with a p value of 0.01 purely by chance.

In that case it would be silly to think that the finding was only 1% likely to be a fluke. Of course it could be true. But we'd have no particular reason to think so until we get some more data.

This is what the dead salmon study was all about. This multiple comparisons issue is very old, but very important. Arguably the biggest problem in science today is that we're doing too many comparisons and only reporting the significant ones.

ResearchBlogging.orgDrummond GB, & Tom BD (2011). Statistics, probability, significance, likelihood: words mean what we define them to mean. British journal of pharmacology, 164 (6), 1573-6 PMID: 22022804

9 comments:

Noah Motion said...

Rather, it tells you how often you'd get the result you did, if there was no effect and it was just random chance.

Actually, it tells you how often you'd get the result you did or a more extreme result under the assumption that the null hypothesis is true.

Neuroskeptic said...

True.

Peter Hildebrand said...

Great post, thanks. Do you have any recommendations on where else I can go to brush up on my statistics?

David said...

Wow, congrats on your blog citation!

Marcel Falkiewicz said...

I'd like to recommend reading http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf . Classical p values are less intuitive than they seem...

Eric Charles said...

In fairness to Gossett... and not to imply that psychologists are doing statistics correctly... but Gossett never proposed modern null hypothesis testing, with its simple cut-off p-value. His proposal, as with Neyman and the younger Pearson, was that people could calculate the probability, and then use that as a piece of logical information for determining future action. To make an informed decision you would need to know the potential risks of being wrong, and the potential benefits of being correct, meaning that in some situations you might tolerate a very high probability of error, and in others demand near certainty. I've never understood why Fischer (who I otherwise think extremely highly of) felt the need to create the p = .05 cut-off.

Laura E. Mariani said...

Nice! Can you calculate the h-index of Neuroskeptic?

Re: statistics, I've always enjoyed this XKCD comic on the same subject: http://xkcd.com/882/

DS said...

Hidden, I think, in this latest topic and at its core, when we look beyond the faulty application of conventional inductive reasoning, is the question of the role of inductive reasoning in learning. Inductive reasoning can not be deductively justified. So what compels us to do it? Are we hard wired to do it? This is not to raise deduction to some apical position. In fact we could ask the same questions of deduction. Are deduction and the unjustifiable "self evident" rules of conventional logic hardwired as well?

Neuroskeptic said...

I think the h index is one, because I don't know of any posts with more than 1 cite (unless you count this one as 5, but then it's still h=1)