Thursday, November 24, 2011

Anatomy of a CRUTEM3 bug

Back in February 2010 I spotted (with the help of a reader) a bug in the software used to generate the CRUTEM3 data from the Met Office. This was subsequently confirmed by the Met Office as having a small effect of overestimating the error bounds on the temperature trend.

Back then I wrote:
I have no idea why the correction given in this blog post by Ilya and I works: perhaps it indicates a genuine bug in the software used to generate CRUTEM3, perhaps it means Ilya and I have failed to understand something, or perhaps it indicates a missing explanation from Brohan et al. I also don't understand why when there are less than 30 years of data the number 30 appears to still be used.
I had no idea, because until today I had never seen the source code used for the generation of CRUTEM3. But the recent release of another batch of emails from CRU led to an intriguing web site that had been hidden from view, but publicly accessible if you knew the URL. In one of the emails a Met Office scientist called Philip Brogan proposes (on October 8, 2007) to release the CRUTEM3 source code having made a copy of it onto his personal site:
We could also put the source code online - I've copied the internal software and documentation to a temporary site so you can see it (http://xxxxxxxx/xxxxxx/xxxxxx/crutem3/xxxxxxx/ - please keep this URL secret) - this could be added to if that would help to damp the conspiracy theories. What do you think?
Amazingly, as reported here the 'secret URL' is still working four years later with a complete archive (apparently) of the CRUTEM3 code.

One particular file called contains the function that performs the 'station error' calculation I had looked at way back:
# Calculate the station error for a grid cell
sub station_error {

    my $Climatology = shift;
    my $Sd          = shift;

    # Climatology error
    my $Clim_error = 0;
    for ( my $i = 0 ; $i < scalar(@$Climatology) ; $i++ ) {
        if ( $Climatology->[$i] eq 'D' ) {
            $Clim_error += ( $Sd->[$i]**2 ) / 30;
        }    # Small error in data-based climatology
        elsif ( $Climatology->[$i] eq 'E' ) {
            $Clim_error += $Sd->[$i]**2 / 15;
        }    # Modest error in extrapolated climatology
        elsif ( $Climatology->[$i] eq 'W' ) {
            $Clim_error += 0.3 * $Sd->[$i]**2;
        }    # Largest error in WMO climatology
        else { die "Dud value for climatology flag: $Climatology->[$i]"; }
    $Clim_error /= scalar(@$Climatology);

    # Observations error
    my $Obs_error =
      ( 0.03**2 ) / scalar(@$Climatology)
      ;      # 0.03C per station observational error

    # Homogeneity error
    my $Hom_error =
      ( 0.4**2 ) / scalar(@$Climatology);   # 0.4C per station homogeneity error

    # Return the total error
    my $Total_error = sqrt( $Obs_error + $Clim_error + $Hom_error );

    return $Total_error;
The code is written in Perl (as was the code that the Met Office released for calculating partial CRUTEM3 results). It's satisfying to see this file because it answers my questions above and is a simple example of how errors happen in real code. I can easily imagine myself making a similar error.

First, the answer to my question "I also don't understand why when there are less than 30 years of data the number 30 appears to still be used." is simple. The number 30 is hard-coded. That's a small difference with the published paper.

The answer to why the correction (dividing the number 30 by the number of stations in the grid square) works requires following through the code and its related mathematics. Brace yourself from some forumlae. What I believe happened was that the programmer sensibly realized that there was no point doing all these square roots and then squaring the result and the forumla could be simplified as it was being turned into code. This sort of simplification is a good idea because it helps reduce floating point errors that might get introduced. But in the simplification process a term got lost.

Here's a review of the key terminology and the relationship with the code:

And here's a simplification of the formula that shows how to go from what's in the paper to what's in the code:

The $Hom_error and the $Obs_error are correctly calculated by the code, but the $Clim_error is not. The critical line is $Clim_error /= scalar(@$Climatology);. The line either needs to be repeated or changed to $Clim_error /= (scalar(@$Climatology)*scalar(@$Climatology));.

The effect is that rather than dividing by n^2 the formula is divided by n. And that explains what I saw all that time ago. It's as if the value of m (30 in the code) is divided by the number of stations n. If that's done the 1/n cancels one of the ns in n^2.

What had puzzled me at the time was why the programmer would have introduced 'dividing the number of stations by n'. He didn't, it's just that the effect is the same.

1 comment:

Francis Turner said...

Just in case they decide to remove that archive I've mirrored it to and added that tree to my FOIA search tool (which also includes all the new emails)

I note that the email breaking this discovery (2955) is a reply to 2024 and both have extracts from an Ian Harris project log that doesn't make it into HARRY_READ_ME (though harry step 22 is clearly related).

Also 2955 clearly demonstrates the abysmal lack of reproducability that seems to permeate this whole project:

" I'm sorry to admit that I can't find the 4349 list, or the program I
used to find that number. Attempting to reproduce it now, I get the same
count as you: 4138 stations provide at least 1 observation to CRUTEM3.
I can't think of any set of 4349 stations that I might have got by
mistake, so I suspect that it's a typo - It's certainly an error, and
(unless you object) I'll write to Steve McIntyre to say so (and add a
note to the CRUTEM3 web page)."