Wednesday, September 08, 2010

On the 5 sigma 'problem' in CRUTEM

Over on the Watts Up With That blog there was a story entitled Analysis: CRU tosses valid 5 sigma climate data. When I saw the headline I thought, "Yes, that's right, it's specifically mentioned in Uncertainty estimates in regional and global observed temperature changes: A new data set from 1850 that outlier anomalies above (and below) 5 sigma are removed. What's the big deal?"

Of course, to the readers of WUWT the big deal is the term 'valid' data. It's a tenet of climate skeptics that CRU was up to all sorts of shenanigans with data and no stone must be left unturned at attempting to achieve perfection. And also for WUWT readers the fact that the blog post highlighted a year when it was exceptionally cold adds fuel to the fire. Since 1936 was really, really cold its anomaly exceeded 5 sigma and so was excluded. It's not hard to see how someone who's a climate skeptic could think that implies that the temperature trend is incorrectly too warm.

But let's ignore the subtext and look at some of the claims. The blog post states: "When they toss 5 sigma events it appears that the tossing happens November through February". That's not hard to check, let's use the Met Office's own program to verify it. I made a small modification to the code available here.

Here's a chart showing the number of anomalies dropped by month.


For November through February there are a total of 219 anomalies removed. But many more are dropped in the summer. So the claim seems odd, but it's based on a claim that during those months the size of the removed anomaly is greater (oddly the analysis only looks at the top 100 months, not sure why they couldn't just look at them all).

Here's a chart that shows the average anomaly excluded by month.


So clearly the excluded data is greatest in the winter. But how much difference do these exclusions make? There are 805 removed anomalies and 3,274,355 used. So a total of 0.02% of the data is not used.

Now using the same program it's possible to plot a graph showing two different pieces of information to see whether dumping this tiny fraction of data makes any real difference.

1. The original trend with the 5 sigma removal
2. The trend with all the data included (i.e. nothing excluded).

Here's that picture.


mo_original is the output of the Met Office's program untouched, mo_all is with the 805 excluded data points. So, as might be expected from such a small amount of data, including it doesn't make much difference at all.

1 comment:

Bob said...

If you exclude 5 sigma events, that should happen about once in 3M samples. Much more likely it is a data entry error (about once in 1k samples for manually entered data).

If you are excluding many more events, and you know that they are not data entry errors, then the assumption that the underlying data is Gaussian has been proven wrong.

The distribution is much more fat-tailed.

This seems worth looking into, not passed over.