Last month, neuroscientists were warned about potential biases in SPM8, a popular software tool for analysis of fMRI data.
Now a paper highlights another software pitfall: The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements
FreeSurfer is one of the major image analysis packages and amongst other things, you can use it to measure the size of different parts of the brain.
They found substantial differences in regional volumes, depending upon the version of FreeSurfer used. Running the same version of the software on a Mac vs a PC also created differences, and even the version of Mac OS had an impact.
How much of a difference it made varied by brain location. The differences were 5-15% with version changes. For Mac vs PC and Mac OS updates it was less bad, 2-5% mostly, but in the worst regions - the parahippocampal and entorhinal cortex - it was still almost 15% different. Why those regions are so variable is unclear.
The paper goes into lots more detail, but the lesson for researchers is extremely simple: don't cross the streams of data-analysis. Set up your analysis stream and then use it on all of your data. Same hardware, same software, same settings.
Imagine you're doing a study comparing brain structure in two groups. Halfway through analyzing your data, you upgrade your MacOS. All of the brains you analyze after that will be, say, 5% "bigger". That'll certainly make your data much noisier, and if you happen to analyze most of Group A before Group B, it'll give you a false positive finding.
Sometimes you just can't avoid changes in hardware or software - IT techs have a habit of upgrading things without asking - but in these cases, you should run the same data under the old and the new regime to see if it's making a difference.
Finally, it would be wrong to blame FreeSurfer for this. I'd be surprised if they were any worse than the other software packages. Mixing and matching versions is something that the FreeSurfer developers specifically warn against. This paper shows why.

46 comments:
It's always a good idea to provide the version number and OS of the analysis software used to allow replication of results - algorithms change from version to version, sometimes without notification.
Doesn't this suggets that perhaps some versions of the software have coding errors?
you really *missed* the point about this paper. The peer review of this paper was at least awkward.
The editor decided to publish it against the majority of the opinions AFAIK.
Anonymous: Hmm. Why, what was the problem?
Whatever the peer review said, we always see a lot of people writing to software mailing-lists because they have used a newer version of a software and they found a difference in their results. This is not a surprise.
People should include more information in their papers about what software/hardware they used to do their analysis, it is really easy.
This is true with computational neuroscience too. A colleague once ran the exact same code on the exact same software, but on a different computer and the results were slightly off. The code accessed random numbers generated by the computer, and the two computers used different algorithms to generate random numbers. Lesson: You can never be too careful or too consistent.
Actually, given that it is apparently known, and accepted, that different versions and combinations of software and OS produce different results, you should run the analyses on as many different setups as possible, and record all the results with the technical details.
Isn't this a bit of a problem for analysis with a computer cluster? Anatomical studies seem often to be run on clusters. If the computer cluster consists of heterogeneous servers I guess you have a problem.
Does it mean you need to rerun the analysis if the system administrator does a security update from 2.6.32-41-generic to 2.6.32-42-generic kernel?
I would guess security updates in kernels shouldn't do much, but you can't be too careful... I guess it depends on how the analysis works as to whether you'd need to be careful on a cluster. If local machines are generating random numbers themselves, for instance, then yes it could have quantitative implications for reproduceability (though hopefully not qualitative).
@The CellularScale: There was a good paper with some recommendations about model reporting in neuronal network modelling a couple of years ago, which mentioned including information on software versions etc used, but didn't emphasise it greatly. I guess this should apply anywhere computational tools are used, though: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000456
I thinks PLoS One wants to publish every rubbish it get its hands on if the guy provide enough supplemental data.
I'm willing to throw sh*t on the fan and publish the internal discussion among editors and reviewers about this paper.
Anonymous: Please do let us in on the discussion. This could be revealing both to the quality of the paper, and to possible politics of reviewers trying to block bad (or inconvenient?) papers
I think the comment area has limited space to publish the large review. Maybe someone can come with a better idea on how to leak it
I believe www.scribd.com is the traditional way...
yes! leak it!
anyhow, @neuroskeptic, why do you say "it would be wrong to blame FreeSurfer for this"
who else *is* to blame?
Different results for the same version, same architecture, different OS version are odd. APIs can change internally, but they are supposed to return the same values. Getting matching results between architectures is may present some tricks, but standard libraries should solve that. Sometimes you need to be extra sure, and so developers tolerate things like Java.
What is troubling is that the authors went to some trouble to prove that there is a bug, but don't seem to care where or what is going on. Also, as noted, results are expected to vary between FreeSurfer versions, so why even test that unless you are trying to see which version is "better." Just feels awkward to me to read someone using neuroscience on a computer science problem.
It seems like this article should have been a bug report, or discussion on a mailing list.
Also, geeks prefer pastebin.com for leaking dox. :P
neuromusic: What I mean is, you can hardly fault FreeSurfer for behaving in this way, given that the developers specifically warn against doing this, and also, because (I assume) other software also behaves the same way (given that other software warns against doing this.)
FreeSurfer may be the worst for this, but there's no evidence for that (AFAIK), would be wrong to assume that they're worse at this stage.
@neuroskeptic - gotcha. I would agree that there's no evidence to suggest that FreeSurfer is any worse than other software, so I would argue that "it would be wrong to blame FreeSurfer *differently* for this". and it is ultimately the responsibility of the scientist to ensure that their results are robust to the particular software package chosen. but it is totally OK to criticize ANY publisher of scientific analytic software for yielding inconsistent results across versions.
seems like a straightforward example of chaos notions... slight variations in initial settings/inputs lead to much bigger fluctuations downline. Can happen a lot in certain computational applications.
Sorry, but if software used in scientific analysis is getting results that change by a measurable factor when varying the operating system (not the version of the analysis software) then it's shonky.
Yes, the scientist should control all factors possible, but so should the makers of scintific software!
neuromusic - I see your point, but what if they improve their algorithm to get better estimates, what should they do? Not release it because it would give different results from the old version? Then there'd be no progress.
Anonymous: OK. That's interesting but I don't see any problem here, from what you've said.
PLoS explicitly will publish anything that's not fundamentally flawed; leaving it to readers to decide if it's important or good... which I think is a good system.
Were there fundamental flaws pointed out?
Neuroskeptic: If the algorithm is changed, it should be implemented as a *different* version, be it different function call, command line switch, whatever.
It really will not do for @Anonymous to hint darkly about fundamental flaws while giving no hint about what he/she thinks they are, and without giving is a hint about who he/she is. That is not the way to do post-publication peer review.
The problem seems to me to arise from blind use of programs that you didn't write yourself, programs which you have only a rough idea of what they are doing to your data.
David Colquhoun: "It really will not do for @Anonymous to hint darkly about fundamental flaws while giving no hint about what he/she thinks they are"
I agree. If Anonymous has specifics on the criticisms of this paper, then please share them; without doing so we have nothing to go on.
I am not a user of FreeSurfer or SPM. Do the measurements in question come with a propagated error analysis? If they do not then the problem is deeper than you think. No error estimate on a measurement is next to worthless.
Somehow neuroscience gets away with not providing error estimates and propagting these estimates through whatever analysis. It's how the snake oil is sold.
For starters: Using which device? MRI, CT, PET?
Images are calculated using raw statistic data and has nothing to do with OS.
It depends of the company but for some equipment, they provide hardware as well.
You can't let the FreeSurfer developers off the hook for this.
Changes from version to version due to algo tweaks are acceptable, but if you're going to release precision software on multiple platforms, it's your responsibility to ensure that you properly encapsulate the underlying system, and achieve results as close to identical as possible across platforms. They are definitely not doing that right now. If you can't achieve the same results on a new platform, you should either a) not release yet, or b) release, but make it very, *very* clear to everyone that results are OS dependent.
That this discussion was brought on by a third party investigation shows that the FreeSurfer developers went with option c) irresponsibility.
My bet: somewhere they're using floating point numbers without thinking carefully about how those numbers are represented.
b: I quite agree with you.
I have nothing to do with neuroscience but am a software developer. If you write a program that uses a specific formula / algorithm applied to a set of data to produce a result then that result should never change. I don't care what OS or version of the OS you are using if the result changes then you are doing something wrong. There are ways of isolating code from these issues.
The fact that this is happening in medical software, an area where details could be life changing, is very wrong!
Hi. I read the paper and it was pretty obvious that the neuroscientists were out of their depth trying to analyse a computer problem.
it is pretty obvious that different versions of the software could have modifications to the algorithm; probably the reason for releasing a new version!
But the difference between architectures is of course worrying. Floating point arithmetic and maths libraries should be old news and use well validated algorithms to prevent exactly this kind of inconsistency. Apart from the famous Pentium floating point bug, this sort of variation in result should only be caused by very subtle differences in handling of rounding, or possibly storing intermediate results at too low precision. (The processor can store intermediate results to 80 bits, truncating them to 64 bits when stored in memory.)
Something that does occur to me is that modern systems often use the GPU on the graphics system to handle arithmetic, and I would not be surprised if the algorithms in the GPU is not as well tested for conformity to the IEEE floating point spec as the libraries run on the host processor. No,this doesn't explain the difference between minor OS versions unless the newer version differ in the precise operations they hand over to the GPU, for example. For a GPU, speed is more important than accuracy, as if a pixel is placed wrongly in a game, no one really cares.
One variation implied in the paper was that earlier versions of OSX ran in 32bit mode whereas the later version ran more processes in 64bit mode, but to the best of my knowledge, the difference between 32 and 64 bit mode is only to do with addressing and integer handling, specifically NOT floating point operations.
This does seem to hint at some sort of numerical instability (due to looking at small differences between large numbers, for example). Differences in the math libraries between OS versions may be the problem. The developers should know that this can happen and assure that it doesn't.
That said, this is a free software package written by many neuroscientists over many years to do their own work, which they then share with the community. The people writing the code are for the most part not computer scientists, but they've still done a rather impressive job of creating a pretty bulletproof suite of tools to do some very complicated analyses. Now that they know of the problem, I'm sure they are more motivated than anyone to get to the bottom of it. But not everybody who writes scientific analysis software is well versed in unit testing and the like, so I don't think people should be so harsh.
Also, regarding the comment:
The fact that this is happening in medical software, an area where details could be life changing, is very wrong!
This is absolutely wrong. This is research software, NOT medical software. You SHOULD NOT be using it for diagnoses. It doesn't have FDA clearance (which would require validation) and nobody ever said it did.
I know a guy, with a PhD in Mathematics who used to work for Cray, optimizing the inner-loops of client software to make the most of their hardware. These days, he works for an aerospace company, where, as best as I can understand, his job is to make sure that the people using the computational fluid dynamics software they license from NASA are using it in a way that gives valid results.
In areas of computational science where lives and/or millions of dollars of machinery are on the line, there is clearly some understanding that the tools are fallible and need to be validated and used properly. It sounds like those practices need to spread more widely.
I'm disturbed by how easily people are willing to allow FreeSurfer to escape responsibility. The reality is that FreeSurfer is accountable for the consistency of the results provided by their software; simply using the system-provided random number generator and then saying "results are inconsistent across platforms" is a defect and is not remotely acceptable. This kind of poor quality software wouldn't be tolerated in a multiplayer game, and the idea that we would hold software that has a medical use to a LOWER standard is ludicrous.
Oh, I should also add, this sounds like something that could be helped by running a test-suite whenever the application detects that it is running on a new machine.
From a superficial perusal of the source code, it looks like there are at least some tests already written.
bannedinboston: "This is research software, NOT medical software. You SHOULD NOT be using it for diagnoses. It doesn't have FDA clearance (which would require validation) and nobody ever said it did."
You're absolutely right, and it appears that the idea that FreeSurfer has a medical use originated on Gizmodo who link to this post and say
"That means that not only might different hospitals choose to treat patients differently, but the same hospital could in theory change its diagnosis as a result of an IT upgrade."
Perhaps I should have made it clearer that FreeSurfer was purely of research interest; I assumed it was implicit but I guess not.
I'll tell Gizmodo to sort it out.
Don't forget Neuroquant is a FreeSurfer derived and has FDA clearance.
Hmm. I didn't know that. That's rather interesting... but this wasn't a study of NeuroQuant and can't be assumed to apply to it.
Why nobody from FreeSurfer commented this article? Are they afraid?
Yes, we're sending a query to NeuroQuant about this (since we've been interested in their service). Since they run the analyses on their servers (or on yours if you get a very tightly specced configuration) they may not get a lot of variability, but unless they've identified and fixed the source of the variability, they may just be reliably converging on the wrong answer...
Ehm, Ed Gronenschild and his colleagues are Dutch, not German...
Yes they are. Silly me.
Last time I checked, FreeSurfer is FREE SOFTWARE! If Brain Voyager behaved like that, which it may well do, then I would want my $5,000+ back or I would sue them. But FreeSurfer it's another story. Feel free to contribute your time and expertise to help them fix the code instead of complaining!
For all we know, commercial software such as Brainvoyager has similar problems. It's a shame the authors focused just on one package, and not its competitors.
Let me guess: the thing measures volume by tessellating. The volume. of course, has a dependence on the granularity of the tessellation. The granularity of the tessellation my depend on the CPU speed ... it may depend on the order in which multi-processing threads are scheduled by the operating system. The tessellation may depend on rounding errors on the least significant digit in the input data. And so on.
Precisely speaking, none of these differences are due to a "bug in the software"; they are more-or-less legitimate behaviors of the software. If so, then perhaps the failure is on the part of the users, the user interface, or even the documentation: that a legitimate tool performing legitimate algorithms non-the-less has a systemic variability in it, that perhaps is catching the users by surprise.
Anyway, just speculation...
Remember there are different blind spots. Before you get into your van, make sure you know where all your blind spots are. Second Hand Vans
The authors' view
Dear Bruce, Doug, Nick, and other co-workers,
As authors of the recent paper “The effects of FreeSurfer version,
workstation type, and Macintosh operating system version on
anatomical volume and cortical thickness measurements” we would like
to take the opportunity to respond to the numerous e-mails and
reactions written on mailings lists and websites. We have never
expected that our paper would have such an adverse impact. We have
noticed that several so-called “journalists” have misinterpreted and
erroneously extrapolated our conclusions. Even though this is not in
our hands, we want to express our disappointment of these events and
that we feel very uncomfortable with the created harassment.
We want to stress that it has never been our intention to put
FreeSurfer in a bad light or to blame its developers, rather we
wanted to quantify the effects that you frequently warned the users
about. The results of our study confirm your recommendations and
increase the awareness of such effects to (novice) users.
It is of course unavoidable that modifications to algorithms will
produce different results; we, nevertheless, must admit that we were
surprised that some effects were rather large.
We are using FreeSurfer frequently for our MRI data and we definitely
intend to do so in the future. We are impressed by the efforts of
your team to improve the algorithms and to extend the capabilities of
FreeSurfer. We are most grateful to you and the other developers
because these continuing efforts enabled and will enable us to
perform our research in a valid and proper manner.
Unfortunately, the media attention gave the impression that our
conclusions would only be directed to FreeSurfer. We have written in
our discussion section that some of the conclusions may apply to
other packages in the field of neuroimaging as well. We hope that you
can acknowledge our sincere intention to examine and increase
awareness of uncontrolled variation in MRI data analyses in general
and not to pinpoint Freesurfer. Therefore, we have informed also the
users and developers of FSL about possible similar effects.
Ed Gronenschild,
Petra Habets,
Heidi Jacobs,
Ron Mengelers,
Nico Rozendaal,
Jim van Os,
Machteld Marcelis
I feel some responsibility for the negative coverage, as many of the early inaccurate articles linked to this blog post. I believe the story "went viral" after Gawker.com wrote a piece about this post.
I feel that my post was accurate and balanced - I emphasized that it was not necessarily a FreeSurfer specific issue, and that the authors of FreeSurfer should not be blamed because they specifically warn against mixing and matching versions.
However what I didn't do was to make it clear that FreeSurfer is purely research software and not for clinical use.
I assumed that this was implicit, and for most readers it was but unfortunately some (namely Gawker) wrongly concluded that this was of direct medical relevance.
I actually wrote to Gawker as soon as I realized what was happening and asked them to correct it, they did but by then it had spread far and wide.
In future however I will try to make caveats such as this one explicit to avoid misunderstandings.
@Finn, it might be a "bit of a problem for analysis with a [heterogeneous] computer cluster", but only a real problem if the job scheduling software is a malicious demon that knows which aspects of the data are of interest to you (e.g. which scans come from patients, which from controls) and then distributes the jobs to different hardware/software-versions so as to confound this. Otherwise, it's just a bit of extra variability (probably no worse than whether a subject was scanned in the morning or the afternoon, or how much they had to drink before).
Post a Comment