## Monday, November 30, 2009

### Bugs in the software flash the message 'Something's out there'

The more I look at the software used by the folks at CRU, the more I think: "these guys seriously need to hire a professional programmer." The code is mostly an undocumented, untested tangled mess of little programs. Ugh.

Oh, and it's buggy.

My old colleague Francis Turner found a lovely example of something that's the work of either a genius or a fool (or perhaps a mad genius). To calculate information about the influence of one weather station on another (which requires working out how far apart they are by the great circle route between them) the code draws little coloured circles (actually little 32-sided polygons) on a virtual white screen and then goes looking for non-white pixels to identify areas for which data is missing.

Here's a snippet (full source):
` for i=0.0,nel do begin  x=cos(a)*(xkm/110.0)*(1.0/cos(!pi*pts1(i,0)/180.0))+pts1(i,1)  x=x<179.9 & x=x>(-179.9)  y=sin(a)*(xkm/110.0)+pts1(i,0)  y=y>(-89.9) & y=y<89.9  catch,error_value   ; avoids a bug in IDL that throws out an occasional   ;  plot error in virtual window  if error_value ne 0 then begin   error_value=0   i=i+1   goto,skip_poly  endif     polyfill,x,y,color=160  skip_poly: endfor`

The first bug appears to be in IDL itself. Sometimes the polyfill function will throw an error. This error is caught by the catch part and enters the little if there.

Inside the if there's a bug, it's the line i=i+1. This is adding 1 to the loop counter i whenever there's an error. This means that when an error occurs one set of data is not plotted (because the polyfill failed) and then another one is skipped because of the i=i+1.

Given the presence of two bugs in that code (one which was known about and ignored), I wonder how much other crud there is in the code.

To test that I was right about the bug I wrote a simple IDL program in IDL Workbench. Here's a screen shot of the (overly commented!) code and output. It should have output 100, 102, 103 but the bug caused it to skip 102.

Also, and this is a really small thing, the code error_value=0 is not necessary because the catch resets the error_value.

BTW Does anyone know if these guys use source code management tools? Looking through the released code I don't see any reference to SCM.

Notes.

If, like me, you are trying to actually read and understand the code you might like to know the following:

1. The value 110.0 there is, I believe, the number of km in a degree of longitude at the equator (40,000km of circumference / 360 = 111.1). It is used to convert the 'decay' distance in xkm to degrees.

2. The (1.0/cos(!pi*pts1(i,0)/180.0)) is used to deal with the fact that km 'go further' in terms of degrees longitude when you are not on the equator. This value elongates the polygon so that it 'correctly' covers the stations.

3. The entire thing is plotted on a 144 x 72 display because there are 2.5 degrees in each square grid and so 360 degrees of longitude / 2.5 = 144 and 180 degrees of latitude / 2.5 = 72.

In the HARRY_READ_ME.TXT file there's commentary about a problem where the sum of squares (which should always be positive) suddenly goes negative. Here's what it says:
`17. Inserted debug statements into anomdtb.f90, discovered thata sum-of-squared variable is becoming very, very negative! Keyoutput from the debug statements:OpEn= 16.00, OpTotSq= 4142182.00, OpTot= 7126.00DataA val = 561, OpTotSq= 2976589.00DataA val = 49920, OpTotSq=-1799984256.00DataA val = 547, OpTotSq=-1799684992.00DataA val = 672, OpTotSq=-1799233408.00DataA val = 710, OpTotSq=-1798729344.00DataA val = 211, OpTotSq=-1798684800.00DataA val = 403, OpTotSq=-1798522368.00OpEn= 16.00, OpTotSq=-1798522368.00, OpTot=56946.00forrtl: error (75): floating point exceptionIOT trap (core dumped)..so the data value is unbfeasibly large, but why does thesum-of-squares parameter OpTotSq go negative?!!Probable answer: the high value is pushing beyond the single-precision default for Fortran reals?`

The 'probable answer' is actually incorrect. If you examine the code you'll find that DataA is declared as integer type which is a signed 4 byte number in most Fortran. That means its maximum positive value is 2147483647 but the program is doing 49920-squared which is 2492006400 (which is 344522753 bigger than the maximum value).

When this happens the number wraps around and goes negative. It then gets added to OpTotSq (which is a real) and the entire thing goes negative.

Here's the code:
`    do XAYear = 1, NAYear          if (DataA(XAYear,XMonth,XAStn).NE.DataMissVal) then            OpEn=OpEn+1            OpTot=OpTot+DataA(XAYear,XMonth,XAStn)            OpTotSq=OpTotSq+(DataA(XAYear,XMonth,XAStn)**2)          end if    end do`

Oddly, the solution given is to change the incoming data file to eliminate the value that causes the algorithm to go wrong. i.e. the actual algorithm was not fixed.
`Action: value replaced with -9999 and file renamed:`

Dustin Mitchell said...

I *love* your commentary on this code, by the way - keep it up!

dirty dingus said...

I incorpoated this critique in my follow up post - http://di2.nu/200912/01.htm

I also took the time to add searching of the HARRY_READ_ME file and shortly I'll also add searching of the code files as well

mrsean said...

John,

I came the same conclusion WRT change comments in the code.

I my own code and systems I explicitly caution against leaving a trail of revision history in the code as, IMO, the comments in the code should reflect only the current state of the code: the history of the code should reside in checkin comments etc. on the tool used to manage the code.

So it isn't a given that this management doesn't exist (as I'm sure you realise). But we have no indication that any formal method of control is used. I might expect to see at least an automatically updated date, author, and latest revision for instance.

My gut feeling would be that there is no formal method, perhaps multiple copies with slightly different date-based filenames in the manner of 20 year old practices.

johnjackson said...

You said

Also, and this is a really small thing, the code error_value=0 is not necessary because the catch resets the error_value.

But if catch resets error_value (to 0), then the block beginning with

if error_value ne 0 then begin

will never happen. I think that error_value=0 is needed.