Thursday, November 04, 2010

The real reason (climate) scientists don't want to release their code

Recently there have been three articles that discuss releasing scientific software. Nature had a piece called Computational science: ...Error, the bloggers at RealClimate wrote about Climate code archiving: an open and shut case? and Communications of the ACM has an article entitled Should code be released?.

Nestled in amongst the arguments about the scientific method requiring independent verification is what I believe is the real human motivation. Here's RealClimate:

Very often, novel methodologies applied to one set of data to gain insight can be applied to others as well. And so an individual scientist with such a methodology might understandably feel that providing all the details to make duplication of their type of analysis ‘too simple’ (that is, providing the code rather carefully describing the mathematical algorithm) will undercut their own ability to get future funding to do similar work. There are certainly no shortage of people happy to use someone else’s ideas to analyse data or model output (and in truth, there is no shortage of analyses that need to be done).

And here's Communications of the ACM:

"There are downsides [to releasing code]", says Alan T. DeKok, a former physicist who now serves as CTO of Mancala Networks, a computer security company. "You may look like a fool for publishing something that's blatantly wrong. You may be unable to exploit new 'secret' knowledge and technology if you publish. You may have better-known people market your idea better than you can and be credited with the work. [...]"


1. When the scientist's results are invalidated later by others because the code was blatantly wrong (and they are retracting papers) (see story in the Nature article) they are going to look like much more of a fool. And, frankly, if they think their code is that bad one has to wonder how they can think their paper is worth publishing.

2. The argument about others using your code seems bogus because if everyone released code then there would be (a) an improvement in code quality and (b) an 'all boats rise' situation as others could build on reliable code.

It's tragic that there's a conflict between science and scientific careers. But I think you can put aside the high-minded arguments about the integrity of the scientific method, and see the real reason (climate) scientists don't want to release their code: management of a scientific career and fear of looking foolish.

17 comments:

RBerenguel said...

Another reason to not release code is because "code is ugly". I have released a few pieces of code I use, but the code to generate some images in my only published paper (I'm a mathematician, the paper is about complex dynamics, the programs mostly draw fractals) will not be released (any time soon, at least) for two reasons:

1) They are simple to replicate. Any undergrad can code the same as I did (maybe without some of the hacks I added, but a working piece) in one or two afternoons.

2) They are ugly as hell. I wanted the code to work, but didn't bother commenting or following any programming advice that I didn't find useful at that moment. Releasing such a piece of code is like self-publishing a maths paper written in word... Not really good for your resumé.

A lot of mathematicians programs are like these: quick hacks to get the computations done, but with big ugly pieces.

But in mathematics usually the program is used to show what the proof was about (or see some insights on what you are able to proof). There is something consistent in the back room.

If your code is your paper, you have to polish it and publish it (at least, that's what I think, YMMV). If your code is just used to show how your theory is applied or was used just to find out what you could and could not do (with proofs), I don't think it is really useful, but of course, releasing it helps even in a small way to improve research as a whole.

Cheers,

Ruben
Latest in my blog: How I Got More Than 4500 Visits Through Blog Commenting

F said...

I find it interesting the observation that scientists in the climate field seem to make it as difficult as possible, or impossible to replicate their work.

This might be a problem in more scientific fields than just climate. While, I was getting my BS in biochemistry, I worked on making a GUI interface for a molecular mechanics software that is notoriously difficult to use.

Me and my professor were going to publish a methods paper. He then contacted the author of the software who is a Harvard professor. He was informed that the software was deliberately difficult to use and that it should stay that way. So, we didn't publish the methods paper.

ses said...

The problem with not releasing code is it can portray a lack of transparency leading to fears about whether the author has actually done what they've claimed to and how.

While I think publishing code with a scientific report is unnecessary I do think it should be available somewhere or at request unless it for commercial / IP reasons it cannot be divulged.

The problem is everyone's too precious about their code and scared of it being picked apart by people: well here's a newsflash - most scientific papers never actually get read so the chances of someone bothering to take time to figure out you've not implemented some obscure piece of code as efficiently as possible are pretty low.

I'd actually be quite flattered if someone took the time to take my code to pieces.

Jon Hendry said...

Eh. You can't reuse another scientist's mice, either. You have to get your own and follow the described procedures.

Jon Hendry said...

"I find it interesting the observation that scientists in the climate field seem to make it as difficult as possible, or impossible to replicate their work"

I think part of the problem has been that they're sometimes working with data which is collected and controlled by various agencies and organizations, who have varied policies on data sharing.

The scientists may not be able to share the data they worked from, but they can tell you who to contact to request your own copy of the data. Who knows if you'll get it, though.

BruceA said...

This reminds me of the jealousies and tactics revealed in James Watson's book The Double Helix. Watson and Crick were racing with Linus Pauling to discover the structure of DNA, and they went to great lengths to hide their research from him.

Rather than share the knowledge and possibly reach the answer quicker, Watson and Crick preferred to keep the glory to themselves. Of course, if Pauling had published first, Watson and Crick's secrecy would have worked against them.

sore_ron said...

I think that if the work you are doing has implications for people's well being...drug testing etc then if the code is not available then how do we know if the conclusions drawn are valid ?

If I say my wonder drug/product has been simulated in a computer and provides a 100% accurate result...need I say more ?

davidshipley said...

Jon
I think you have taken their explanations at face value, and on the face of it they are perfectly reasonable. However, the "Team" (their terminology) are damned by their selective approach to distribution of their data, in that some were granted access and some were denied, depending on whether they were viewed as friendly or hostile. The confidentiality part was just a post hoc rationalisation. Also the big problem between Jones and McIntyre was that McIntyre was trying to understand what process Jones had applied to the raw data, and which subset he had chosen, so a pass through to the original source data was of no help.

John Graham-Cumming said...

Yes, I have taken them at face value. I'd rather argue with the argument then with something that I only imagine to be the argument.

sore_ron said...

I think that if your code is used as 'proof'...my drug/product has undergone simulations and is 100% effective, then it should be published with the paper.

Peer review tho' is a farce ...see how Climactic Change was used to allow certain papers to be 'published' so they could be used by the IPCC.
Do a search for 'CASPER and the Jesus Paper' if you don't believe

rossmeissl said...

I think John is exactly right about this. I gave a talk about this back in September, trying to argue that, especially with climate science, where methodologies are evolving constantly, the key assurance to quality is transparency. My company, Brighter Planet, does real-time carbon calculations via web services, so this is an ongoing concern for us. The steps we've taken so far:

* We release all of our carbon models as open-source code under the AGPL.

* We provide custom-generated methodology documentation like this for every one of our calculations.

* We're using Rocco to do literate-style documentation of our methodologies, with specific focus on compliance with standard carbon accounting protocols.

The sad truth is that if you enter the same input data into each of the hundreds of carbon calculation software packages available now, you're going to get a different answer each time. To be fair, all science is uncertain, so this is somewhat to be expected. But with transparency--especially documentation and open-source code--at least we'll know why.

ZT said...

Bogus arguments in favor of 'hiding' code abound. For example there is a claim in the Mann inquiry that...

‘Moreover, because he [Mann] developed his source codes using a specific programming language (FORTRAN 77), these codes were not likely to compile and run on computer systems different from the ones on which they were developed (e.g., different processor makes/models, different
operating systems, different compilers, different compiler optimizations).’

Which is utter nonsense.

Scientific fortran 77 programs are portable – hence people publish books like ‘Numerical Recipes in Fortran-77′ http://www.nrbook.com/a/bookfpdf.php

For reference, the full Mann whitewashing is available here:
http://live.psu.edu/fullimg/userpics/10026/Final_Investigation_Report.pdf

Mike said...

The problem with any program code of significant size and complexity is that it probably has bugs. This is something that the computer science fraternity have been grappling with for some time.

It concerns me that many scientific papers now rely on computer programs - "simulations", "models". The published science is going to be no better than the quality of the programs.

Yet, I have not often seen scientists using the sorts of techniques that computer professionals use to combat buggy code - eg. test driven development, thorough test suites at unit and at functional level. Even open source, with open scrutiny of all code is yet another good technique.

Add to that is the use of programming languages that have lower productivity and higher bug rates than some of the more "modern" languages which have been created to make code development easier and more reliable (Fortran comes to mind as one of the older less reliable languages...).

It seems to me as if scientists need to get more advice and help from their brethren in the computer science field...

jeremyrainman said...

The argument that others might simply copy your code to replicate your work is utter nonsense.

When you publish, you are both opening your methods to public scrutiny AND you are putting your original work in print with a date on it so that it becomes clear who came up with what method first. Science has *always* worked this way. The algorithms used in data-crunching software are no different.

For the argument of protecting one's own methods from the unscrupulous to work, we would have to ensure that NO scientists share their code. This in turn means that no scientists who deal primarily in code-driven research could accurately share their methods with others. I think you should be able to see where I'm going by now. If scientists cannot fully share their methods with others, SCIENCE DOESN'T ADVANCE.

Most scientists are (rightfully) trying to find reasons to dislike the methods of other scientists. In fact the arguments over method are how science advances in the first place.

It's sad just how pervasive this idea of intellectual property has sunk it's evil roots. Now you have scientists, the very definition of open-method, creative-commons, advance-human-understanding side of humanity decrying having to share their code.

I wonder what human understanding would be like now if Newton had withheld his full explanation of calculus due to intellectual property concerns...

quickly-now said...

I read the article in the ACM mag and was pretty disappointed.

Some of the arguments presented for not releasing were completely spurious, and rather silly. For example, the Manhattan project.

As a general rule, nothing to do with national security ever gets released - the whole point of secrets is to keep them.

However, the state of ACEDEMIC knowledge is a different matter. Obscurity obtained by publishing results without methods is not advancing the state of knowledge, it is only advancing the state of a point of view. Until something can be verified or disproved, its merely an assertion.

Even ugly code should be released. And if the paper, or knowledge, or whatever, is good enough for publication then part of the process can always be the clean up the ugly code.

All else is to rationalise the indefensible.

j said...

"It's tragic that there's a conflict between science and scientific careers"

Somehow. However this is more critical for people that do not care about the quality of they work, just want to publish it. In a levered field this would not be the case.

Just my 0.02 euro.

Janne Morén said...

There's two sensible arguments against publishing your code (apart from various legal obstacles that may - and sometimes do - apply):

* One-shot code for doing some analysis is often very difficult to understand and to use. It has lots of small gotchas, user interface oddities and corner cases that you need to be aware of. The author and original user is well aware of them, while a naive user is not. Only the original author is likely to be able to run it as intended.

Also, the code is often very hard to understand; much harder than to understand the underlying principles that have been published. A few equations and an elegant abstract algorithm in the paper may be a mess of thousands of lines of spaghetti code, relying on very specific versions of other buggy in-house or rare libraries and environments.

In other words, releasing the code would often be much less useful that you'd think. You really normally are better off understanding the system by reading the published descriptions than the largely incomprehensible analysis tool source code.

* The second argument is that you do want to encourage new, separate implementations. All non-trivial code is buggy, and if everyone uses the same piece of software you risk that all be bitten by the same bugs.

If people implement their own tools based on the published system you get real independent verification that the analysis really is correct.


And really, a second, independent implementation is much easier than the first, hideous, messy one. The reason a lot of academic code is so bad is because the development is exploratory - you really don't know exactly how you'll solve your problem when you start, so you try various approaches over time, and the resulting code will reflect that. A second implementation would be written with much more specific, well-defined aims and would be much faster to create and much cleaner and easier to use too.