Thursday, April 18, 2013

The importance of open code

Last February myself, Professor Darrel Ince and Professor Les Hatton had a paper published in Nature arguing for openness in the code used for scientific papers. The paper is called The Case for Open Computer Programs.

In a coda to that piece Darrel wrote the following:
Our intent was not to criticise; indeed we have admiration for scientists who have to cope with the difficult problems we describe. One thesis of the article is that errors occur within many systems and does not arise from incompetence or lack of care. Developing a computer system is a complex process and problems will, almost invariably, occur. By providing the ability for code to be easily perused improvement will happen. This is the result detailed in both the boxes in the article: the Met Office data is more accurate, admittedly by a small amount, and because of feedback to developers the geophysical software was considerably improved.
Recently, an important paper in economics has been in the news because its conclusions turn out to be inaccurate for a number of reasons. One of those reasons is a programming error using the popular Microsoft Excel program. This error, in an unreleased spreadsheet, highlights just how easy it is to make a mistake in a 'simple' program and how closed programs make reproducing results difficult.

The original paper by Reinhart and Rogoff is Growth in a Time of Debt and it concludes the following:
[...] the relationship between government debt and real GDP growth is weak for debt/GDP ratios below a threshold of 90 percent of GDP. Above 90 percent, median growth rates fall by one percent, and average growth falls considerably more.
They point to a serious problem with growth rates once the debt/GDP ratio is above 90%. As this is an important economic topic at the moment other economists have attempted to replicated their findings from the original data. One such reproduction is Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff which finds:
Herndon, Ash and Pollin replicate Reinhart and Rogoff and find that coding errors, selective exclusion of available data, and unconventional weighting of summary statistics lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period. They find that when properly calculated, the average real GDP growth rate for countries carrying a public-debt-to-GDP ratio of over 90 percent is actually 2.2 percent, not -0:1 percent as published in Reinhart and Rogo ff. That is, contrary to RR, average GDP growth at public debt/GDP ratios over 90 percent is not dramatically different than when debt/GDP ratios are lower.
The coding error referred to there is a mistake in an Excel spreadsheet that excluded data for certain countries. And the original authors have admitted that that this reproduction is correct:

On the first point, we reiterate that Herndon, Ash and Pollin accurately point out the coding error that omits several countries from the averages in figure 2.  Full stop.   HAP are on point.   The authors show our accidental omission has a fairly marginal effect on the 0-90% buckets in figure 2.  However, it leads to a notable change in the average growth rate for the over 90% debt group.
All this brought to mind my own discovery of errors in code (first error, second error) written by the Met Office. Code that was not released publicly.

There's a striking similarity between the two situations. The errors made by the Met Office and by Reinhart and Rogoff were trivial and in the same type of code. The Met Office made mistakes calculating averages, as did Reinhart and Rogoff. Here's the latter's spreadsheet with the error:

The reality of programming is that it is very easy to make mistakes like this. I'll repeat that: very easy. Professional programmers do it all the time (their defense against this type of mistake is to have suites of tests that double check what they are doing). We should expect errors like this to be occurring all the time.

What's vital is that scientists (including the dismal kind) consider their code (be it in Excel or another language) as an important product of their work. Publishing of data and code must become the norm for the simple reason that it makes spotting errors like this very, very quick.

If Herndon, Ash and Pollin had had access to the original Excel spreadsheet along with the data they would have very quickly been able to see the original authors' error. In this case Excel even highlights for you the cells involved in the average calculation. Without it they are forced to do a ground-up reproduction. In this particular case they couldn't get the same results as Reinhart and Rogoff and had to ask them for the original code.

An argument against openness in code is that bad code may propagate. I call this the 'scientists protecting other scientists from themselves' argument and believe it is a bad argument. It is certainly the case that it's possible to take existing code and copy it and in doing so copy its errors, but I believe that the net result of open code will be better science not worse. Errors like those created by the Met Office and Reinhart and Rogoff can be quickly seen and stamped out while others are reproducing their work.

A good scientist will do their own reproduction of a result (including writing new code); if they can't reproduce a result then, with open code, they can quickly find out why (if the reason is a coding error). With closed code they cannot and science is slowed.

It is vital that papers be published with data and code for the simple reason that even the best organizations and scientists make rudimentary errors in code that are hard to track down when the code is closed.

PS It's a pity that one year after the Met Office argued that for open data and code the code to reproduce CRUTEM4 is yet to be released. I hope, one day, that when papers are published the code and data will be available at the same time. We have the networking technology and storage space to do this.

If you enjoyed this blog post, you might enjoy my travel book for people interested in science and technology: The Geek Atlas. Signed copies of The Geek Atlas are available.

<$BlogCommentBody$>

<$BlogCommentDateTime$> <$BlogCommentDeleteIcon$>

<$BlogBacklinkControl$> <$BlogBacklinkTitle$> <$BlogBacklinkDeleteIcon$>
<$BlogBacklinkSnippet$>