Monday, November 30, 2009

Bugs in the software flash the message 'Something's out there'

The more I look at the software used by the folks at CRU, the more I think: "these guys seriously need to hire a professional programmer." The code is mostly an undocumented, untested tangled mess of little programs. Ugh.

Oh, and it's buggy.

My old colleague Francis Turner found a lovely example of something that's the work of either a genius or a fool (or perhaps a mad genius). To calculate information about the influence of one weather station on another (which requires working out how far apart they are by the great circle route between them) the code draws little coloured circles (actually little 32-sided polygons) on a virtual white screen and then goes looking for non-white pixels to identify areas for which data is missing.

Here's a snippet (full source):
 for i=0.0,nel do begin
x=x<179.9 & x=x>(-179.9)
y=y>(-89.9) & y=y<89.9
; avoids a bug in IDL that throws out an occasional
; plot error in virtual window
if error_value ne 0 then begin


The first bug appears to be in IDL itself. Sometimes the polyfill function will throw an error. This error is caught by the catch part and enters the little if there.

Inside the if there's a bug, it's the line i=i+1. This is adding 1 to the loop counter i whenever there's an error. This means that when an error occurs one set of data is not plotted (because the polyfill failed) and then another one is skipped because of the i=i+1.

Given the presence of two bugs in that code (one which was known about and ignored), I wonder how much other crud there is in the code.

To test that I was right about the bug I wrote a simple IDL program in IDL Workbench. Here's a screen shot of the (overly commented!) code and output. It should have output 100, 102, 103 but the bug caused it to skip 102.

Also, and this is a really small thing, the code error_value=0 is not necessary because the catch resets the error_value.

BTW Does anyone know if these guys use source code management tools? Looking through the released code I don't see any reference to SCM.


If, like me, you are trying to actually read and understand the code you might like to know the following:

1. The value 110.0 there is, I believe, the number of km in a degree of longitude at the equator (40,000km of circumference / 360 = 111.1). It is used to convert the 'decay' distance in xkm to degrees.

2. The (1.0/cos(!pi*pts1(i,0)/180.0)) is used to deal with the fact that km 'go further' in terms of degrees longitude when you are not on the equator. This value elongates the polygon so that it 'correctly' covers the stations.

3. The entire thing is plotted on a 144 x 72 display because there are 2.5 degrees in each square grid and so 360 degrees of longitude / 2.5 = 144 and 180 degrees of latitude / 2.5 = 72.

In the HARRY_READ_ME.TXT file there's commentary about a problem where the sum of squares (which should always be positive) suddenly goes negative. Here's what it says:
17. Inserted debug statements into anomdtb.f90, discovered that
a sum-of-squared variable is becoming very, very negative! Key
output from the debug statements:

OpEn= 16.00, OpTotSq= 4142182.00, OpTot= 7126.00
DataA val = 561, OpTotSq= 2976589.00
DataA val = 49920, OpTotSq=-1799984256.00
DataA val = 547, OpTotSq=-1799684992.00
DataA val = 672, OpTotSq=-1799233408.00
DataA val = 710, OpTotSq=-1798729344.00
DataA val = 211, OpTotSq=-1798684800.00
DataA val = 403, OpTotSq=-1798522368.00
OpEn= 16.00, OpTotSq=-1798522368.00, OpTot=56946.00
forrtl: error (75): floating point exception
IOT trap (core dumped) the data value is unbfeasibly large, but why does the
sum-of-squares parameter OpTotSq go negative?!!

Probable answer: the high value is pushing beyond the single-
precision default for Fortran reals?

The 'probable answer' is actually incorrect. If you examine the code you'll find that DataA is declared as integer type which is a signed 4 byte number in most Fortran. That means its maximum positive value is 2147483647 but the program is doing 49920-squared which is 2492006400 (which is 344522753 bigger than the maximum value).

When this happens the number wraps around and goes negative. It then gets added to OpTotSq (which is a real) and the entire thing goes negative.

Here's the code:
    do XAYear = 1, NAYear
if (DataA(XAYear,XMonth,XAStn).NE.DataMissVal) then
end if
end do

Oddly, the solution given is to change the incoming data file to eliminate the value that causes the algorithm to go wrong. i.e. the actual algorithm was not fixed.

Action: value replaced with -9999 and file renamed:

The 'very artificial correction' flap looks like much ado about nothing to me

The other day I posted that some code from the CRU Hack that people were shouting about was actually dead code. It was subsequently pointed out to me that there was another version in which the code was not dead.

So, I figured I'd have a look at it and see what was going on and try to read the scientific papers behind this. I've come to the conclusion "move along, there's nothing to see here".

Firstly, here's couple of segments of the code that does the 'artificial correction':
2.5,2.6,2.6,2.6,2.6,2.6]*0.75 ; fudge factor
if n_elements(yrloc) ne n_elements(valadj) then message,'Oooops!'

Since I don't have the datasets that underlie this code I can't actually execute it, but I can follow it. Here's a line-by-line explanation of what's happening:

yrloc=[1400,findgen(19)*5.+1904]: this creates a list of numbers which are stored in yrloc. The list starts with the number 1400 and then consists of 1904, 1909, 1914, 1919, etc. all the way up to 1994. findgen(19) creates the list 0, 1, 2, up to 18. The *5. multiplies these numbers by 5 to obtain the list 0, 5, 10, up to 90. The +1904 adds 1904 to them to get the list 1904, 1909, 1914 up to 1994.

So the final value of yrloc is [1400, 1904, 1909, 1914, 1919, 1924, 1929, 1934, 1939, 1944, 1949, 1954, 1959, 1964, 1969, 1974, 1979, 1984, 1989, 1994]. The square brackets around the numbers make it into a list (which is called a vector in this computer language).

valadj=[0., 0., 0., 0., 0., -0.1, -0.25, -0.3 ,0. ,-0.1 ,0.3 ,0.8 , 1.2, 1.7, 2.5, 2.6, 2.6, 2.6, 2.6, 2.6]*0.75 creates a list called valadj where each element of the list in square brackets is multipled by 0.75 (that's the *0.75 bit).

So the final value stored in valadj is [0, 0, 0, 0, 0, -0.075, -0.1875, -0.225, 0, -0.075, 0.225, 0.6, 0.9, 1.275, 1.875, 1.95, 1.95, 1.95, 1.95, 1.95]

if n_elements(yrloc) ne n_elements(valadj) then message,'Oooops!' This is checking that yrloc and valadj have the same number of elements (i.e. the two lists are the same length). This is important because the rest of the code relies on lining these two lists up and treating them as pairs.

For example, the 1400 in yrloc corresponds to the first 0 in valadj and 1994 in yrloc corresponds to the final 1.95 in valadj. Here's a simple plot of the data with yrloc on the horizontal (x-axis) and valadj:

Now it gets a little bit more complicated.

yearlyadj=interpol(valadj,yrloc,x). The interpol function is doing a linear interpolation of the data in valadj and yrloc and it uses that to look up the adjustment necessary for the years that are mention in x. You'll have to go read the original code to see where x comes from but basically there's code to read maximum latewood density (MXD) values from a file and transform that into two lists: x (which contains the years) and densall. As above these have the same number of entries and there's a correspondence between the entries similar to the yrloc and valadj explained above.

What the interpol function is doing is saying given the linear interpolation of the datapoints we know about (the 'aritificial correction') make a list of the corrections necessary for the years in x. It works like this; here's a graph of the linear interpolation from 1400 to 1994.

Now, suppose that x contains data for 1653. To find the adjustment necessary the program essentially looks at the graph above, finds 1653 on the x-axis and then the corresponding value on the y-axis (which, in this case, is 0 since there's no adjustment between 1400 and 1924). Or if it needs a value for 1992 it finds 1992 on the x-axis and find the corresponding point on the y-axis (which in this case will be between the datapoints for 1989 and 1992 along the line that joins them: in this case it will find an adjustment of 1.95 since the graph is flat at that point. Final example, suppose x contains data for the year 1975. Follow the x-axis to 1965 and then up to the line between the datapoints at 1964 and 1969. Here the line is a slope and the value at the intersection is 1.395.

So the program ends up with yearlyadj filled with the adjustments necessary for the data in densall corresponding to the years in x. Finally, the program makes the adjustment with the line densall=densall+yearlyadj. On a year-by-year basis the values in yearlyadj are added to the values in densall. The first value of yearlyadj is added to the first value of densall, the second value of yearlyadj is added to the second value of densall and so on.

Now this program is dated at September 1998 by Ian Harris at CRU. And if we look there's a paper by Harris (with others) published in Philosophical Transactions of the Royal Society B. The paper is Trees tell of past climates: but are they speaking less clearly today?. The entire paper is about how the record in tree-rings doesn't match up in recent decades with actual temperatures, but it did in the past.

There's a nice picture of this in the paper:

Ignore the dotted line. The thin line shows that average summer temperatures, the thick line the predicted temperature from the maximum latewood density. They are going along fine until around 1930 when the actual temperature starts to exceed the predicted temperature and the gap gets (mostly) greater and greater. Real divergence occurs in the mid-1960s.

Now pop back to my first graph. There's a rough correspondence between the correction being made and the gap in CRU's graph. It's certainly not perfect but it looks to me like what's happening here is a scientist trying to make two datasets correspond to each other starting with a rough estimate of what it would take to make them work correctly.

And the paper in fact cautions that such divergence is a cause for concern about relying on the historical tree-ring record until we understand why there is such divergence. To quote the paper:
Long-term alteration in the response of tree growth to climate forcing must, at least to some extent, negate the underlying assumption of uniformitarianism which underlies the use of twentieth century- derived tree growth climate equations for retrodiction of earlier climates.

And the conclusion calls for understanding what's happening here.

So, if there's a smoking gun it was published in Philosophical Transactions of the Royal Society B where the authors point out clearly their concerns about the data and the need to understand why. In doing so they are bound to want to remove the divergence by understanding what happens. A first start is a stab at that with this 'artificial correction'.

In later years, they went looking for the divergence using principal component analysis. this can be seen in this file and this file). Their code tells you what it's doing:
; Reads in site-by-site MXD and temperature series in
; 5 yr blocks, all correctly normalised etc. Rotated PCA
; is performed to obtain the 'decline' signal!

So, they appear to have replaced their hacked-up adjustment above with actual analysis to try to understand what part of the MXD data is caused by some unknown 'decline' causing the difference. The 1998 paper speculates on what might be causing this difference, but the PCA is done just to find it statistically without knowing why.

So, given that I'm a total climate change newbie and haven't been involved in what looks like a great deal of political back and forth I'm going to take this with a grain of salt and say this looks OK to me and not like a conspiracy.

But all the talk of not releasing data and hiding from FOIA requests does make me uneasy. Science is better when other scientists can reproduce what you are doing, just as I once did.

Friday, November 27, 2009

How to fail at data visualization

Today the BBC News made me aware of an upcoming book and associated blog about data visualization. Flicking through the blog I was instantly repelled by the horrible data visualizations presented as something to be proud of.

For example, consider the following snippet:

It purports to show you which countries are the most dangerous to fly to by charting the 'density' of fatal aircraft accidents by destination country. This visualization is wrong on so many levels that it's hard not to laugh:

1. The term 'density' is not defined and in fact he isn't showing a density at all, just the raw total numbers by country.

2. The underlying data is incorrect. The diagram specifically says that it uses 'fatal accidents' drawn from this database. Unfortunately, the person doing the visualization has used the total number of 'incidents' (fatal and non-fatal accidents). For example, he gives a value of 75 for the number of fatal accidents in Ecuador, whereas the database gives 38. The same applies for all the other countries.

3. He uses circles with dots in the middle to represent the size. So is the value being plotted proportional to the radius of these circles or the area? Not clear. For example, try to compare Russia and Canada; they look about the same. Now look at Russia and India, India looks smaller to me. So what's the truth?

Using his incorrect figures we have Russia with 626 accidents, India with 456 and Canada with 452. So Russia is a 1.38x more dangerous destination than Canada. Can you spot that from the diagram?

4. Take a look at Europe. Can you figure out which countries he's indicating in the diagram? It's almost impossible.

5. The US comes out as the top country in terms of fatal accidents with 2613 (actually, it's 1088) but that fails to take into account the one important thing: how many flights are there, or even how many people actually fly? The US could be the most dangerous if there are few flights and lots are fatal, but it could even be the safest if there are lots and lots of flights. What you actually want to answer is 'what's the probability of me dying if I fly to the US?' One way to calculate that would be total number of flights with a fatality / total number of flights; another way would be total number of fatalities / total number of passengers. Either way just knowing the total number of crashes doesn't tell you much.

So the diagram charts the wrong statistics, uses the wrong underlying data, and then presents it in a way that's hard to interpret. And this is what gets a book published?

Thursday, November 26, 2009

About that CRU Hack

In all the fuss around the supposed smoking gun inside one of the source code files from the CRU Hack no one seems to be pointing something out... the so called fudge factor that is in the program isn't used.

Here's the full listing of osborn-tree6/ (I've used bold type to show the bit that everyone is talking about):

; Now prepare for plotting
if ! eq 'X' then begin
endif else begin
/ystyle,yrange=[-3,3],ytitle='Normalised anomalies',$
; title='Northern Hemisphere temperatures, MXD and corrected MXD'
title='Northern Hemisphere temperatures and MXD reconstruction'
; Apply a VERY ARTIFICAL correction for decline!!
2.6,2.6,2.6]*0.75 ; fudge factor
if n_elements(yrloc) ne n_elements(valadj) then message,'Oooops!'
;legend,['Northern Hemisphere April-September instrumental temperature',$
; 'Northern Hemisphere MXD',$
; 'Northern Hemisphere MXD corrected for decline'],$
; colors=[22,21,20],thick=[3,3,3],margin=0.6,spacing=1.5
legend,['Northern Hemisphere April-September instrumental temperature',$
'Northern Hemisphere MXD'],$

This code is written in a language called IDL and IDL uses a semicolon for a comment. So any line beginning with ; is ignored. Here's the same code without those ignored lines:

if ! eq 'X' then begin
endif else begin
/ystyle,yrange=[-3,3],ytitle='Normalised anomalies',$
title='Northern Hemisphere temperatures and MXD reconstruction'
if n_elements(yrloc) ne n_elements(valadj) then message,'Oooops!'
legend,['Northern Hemisphere April-September instrumental temperature',$
'Northern Hemisphere MXD'],$

The critical thing to look for is yearlyadj which is the magic value that everyone is so excited about. But guess what? It's never referenced in the rest of the program. So much for a smoking gun.

Update: Read the comments below. It's been pointed out to me that there's a later version of code in the archive in which similar correction code is not commented out. Details and link below.

Friday, November 20, 2009

Parsing a JSON document and applying it to an HTML template in Google Go

Here's some simple code to parse a JSON document and the transform it into an HTML document using the Google Go packages json and template.

If you've done anything in a scripting language then you'll probably be surprised by the generation of fixed struct types that have to match the parsed JSON document (or at least match some subset of it). Also because of the way reflection works in Google Go the struct member names need to be in uppercase (and for that reason I've used uppercase everywhere).

import (

type Row struct {
Column1 string;
Column2 string;

type Document struct {
Title string;
Rows []Row;

const a_document = `
"Title" : "This is the title",
"Rows" : [ { "Column1" : "A1", "Column2" : "B1" },
{ "Column1" : "A2", "Column2" : "B2" }

const a_template = `
{.repeated section Rows}

func main() {

// The following code reads the JSON document in
// a_document and turns it into the Document structure
// stored in d

var d Document;
ok, e := json.Unmarshal( a_document, &d );

if ok {

// This code parses the template in a_template places
// it in t then it applies the parsed JSON document in
// d to the template and prints it out

t, e := template.Parse( a_template, nil );
if e == nil {
t.Execute( d, os.Stdout );
} else {
fmt.Printf( e.String() );
} else {
fmt.Printf( e );

All the real work is done my main() by calling json.Unmarshal, template.Parse and then Execute.

Here's the Makefile and output:

$ cat Makefile
P := template
all: $P

$P: $P.6
6l -o [email protected] $^

%.6: %.go
6g $<
$ make
6g template.go
6l -o template template.6

<head><title>This is the title</title></head>

Thursday, November 19, 2009

Installing Google Go on Mac OS X

I decided to have a go with Google Go since I'm an old fogey C/C++ programmer. Any new innovation in the C/C++ family gets me excited and Google Go has quite a few nice features (garbage collection is really nice to have and channels make me think of all the work I did in CSP).

I decided to go with the 6g compiler since gccgo doesn't have garbage collection implemented yet and hence there's no way to free memory. The only way to get 6g is to mirror its Mercurial repository. So...

Step 1: Install Mercurial

For that I used prebuilt packages from here and got Mercurial 1.4 for Mac OS X 1.5 (no, I haven't upgraded to Snow Leopard yet).

Step 2. Set GOROOT

I just did a quick cd ; mkdir go ; export GOROOT=$HOME/go to get me started.

Step 3. Clone the 6g repository

That was a quick hg clone -r $GOROOT followed by the hard part: compiling it. You need to have gcc, make, bison and ed installed (whcih I do since I do development work on my Mac).

Step 5. Set GOBIN

This points to where the binaries will go, for me that's $HOME/bin since I'll be doing local development using Go. And I updated PATH to include $GOBIN.

Step 4. Compile 6g

You first need to set GOARCH and GOOS. For me that's amd64 for the architecture (the Intel Core 2 Duo in my Macbook Air is a 64-bit processor) and darwin for the OS (since this is a Mac).

$ export GOARCH=amd64
$ export GOOS=darwin

Then you can actually do the compile:

$ cd $GOROOT/src
$ ./all.bash

This does a build and test of 6g and it was very fast to build (although I'm used to building gcc which is a bit of a monster).

Step 5. Write a Hello, World! program

Here's my first little Google Go program (filename: hw.go) just to test the 6g compiler.

package main

import "fmt"

func main() {
fmt.Printf( "Hello, World\n" );

To simplify building I made a minimal Makefile:

all: hw
hw: hw.6 ; 6l -o [email protected] $^
%.6: %.go ; 6g $<

And then the magic moment:

$ make
6g hw.go
6l -o hw hw.6
$ ./hw
Hello, World!

And now for a real project... get SQLite to interface to it.

Thursday, November 12, 2009

Geek Weekend (Paris Edition), Day 4: Institut Pasteur

Leaving my SO in bed at the hotel with a nasty bacterial infection and some antibiotics, I went with timely irony to visit the home and laboratory of Louis Pasteur at the Institut Pasteur. (It's pretty easy to find since it has a conveniently named stop on the Paris metro: Pasteur).

At the Institut Pasteur there's a wonderful museum that covers the life and work of Louis Pasteur (and his wife). It's housed in the building (above) where the Pasteurs lived. There's a single room of Pasteur's science and the rest of the house is Pasteur's home; so a visit is partly scienfitic and partly like visiting any old home. I was mostly interested in the laboratory (although seeing how he lived---pretty darn well!---was also worth it).

Pasteur wrote standing up at a raised table (much like old bank clerks used to use) and his lab is full of specimens that he worked on. There's a nice display about chirality which Pasteur had initially worked on while study tartaric acid in wine. (Pasteur determined that there were two forms of tartaric acid by painstakingly sorting tiny crystals by hand).

The rest of the lab covers immunization, pasteurization and the germ theory of disease. There was a nice display of Pasteur's bottles of chicken broth that he used to demonstrate the germ theory of disease. The bottles contain boiled broth and have a long tapering curved neck. Although the neck is open the shape prevents dust from entering and the broth sits undisturbed (as it has for 150 years).

In the same room there's also a big bottle of horse's blood that looks fresh despite its age, and there are detailed displays about immunization (and especially Pasteur's rabies vaccine).

The museum also has a lot of equipment used by Pasteur, such as vacuum pumps and autoclaves. It all has that lovely Victorian feel of wrought iron and brass.

The oddest part of the museum is the Pasteurs' burial chamber built beneath the house and in a totally over the top Byzantine style.

Note that the museum is only open in the afternoons during the week and that you must bring photo ID with you to get in since it is inside the Institut Pasteur.

Tuesday, November 10, 2009

Geek Weekend (Paris Edition), Day 3: The Arago Medallions

The old Paris Meridian (which was in use up until 1914) passes not far from The Pantheon which I visited to see Foucault's Pendulum. It's actual longitude today is 2°20′14.025″.

To mark the old meridian the French decided to install some art work and they commissioned an artist called Jan Dibbets to build something appropriate. What he did was embed brass disks in the streets of Paris marking the meridian and turning the whole city into a sort of treasure hunt.

These Arago medallions (which celebrate the meridian and the life of François Arago) cut through the very heart of Paris. They make a wonderful way to see Paris at going on a treasure hunt. And the meridian goes to the very heart of something important: the meter. The original definition of a meter was based on the length of the Paris meridian from the north pole to the equator. Arago surveyed the meridian and came up with a very precise definition for this fundamental unit of measure.

Here's a photo I took of one on Boulevard Saint-Germain:

There's a full list of the medallions (in French) here. And here's my English translation of the list (the numbers in parentheses give the number of medallions to be found there):

Position of the medallions along the meridian from north to south

  • XVIIIe arrondissement

    • 18 av. de la Porte de Montmartre, in front of the municipal library (1)

    • Intersection of rue René Binet and av. de la Porte de Montmartre (1)

    • 45/47 av. Junot (1)

    • 15 rue S. Dereure (1)

    • 3 and 10 av. Junot (2)

    • Mire du Nord, 1 av. Junot, in a private courtyard with controlled access (1)

    • 79 rue Lepic (1)

  • IXe arrondissement

    • 21 boulevard de Clichy, on the pavement (2)

    • 5 rue Duperré (1)

    • 69/71 rue Pigalle (2)

    • 34 rue de Châteaudun, inside the courtyard of the Ministry for National Education (2)

    • 34 rue de Châteuadun (1)

    • 18/16 and 9/11 boulevard Haussmann, in front of the restaurant (2)

    • Intersection of rue Taitbout, in front of the restaurant and 24 boulevard des Italiens (2)

  • IIe arrondissement

    • 16 rue du 4 septembre (1)

    • 15 rue saint Augustin

  • Ie arrondissement

    • 24 rue de Richelieu (1)

    • 9 rue de Montpensier (1)

    • At the Palais Royal: Montpensier and Chartres Colonnades, Nemours Gallery, passageway on place Colette and place Colette in front of the café (7)

    • Intersection of place Colette and Conseil d'État, rue saint Honoré (1)

    • place du Palais royal, on the rue de Rivoli side (1)

    • rue de Rivoli, at the entrance of the passageway (1)

    • At the Louvre, Richelieu Wing: French sculpture room and in front of the escalator (3)

    • At the Louvre, Napoléon Courtyard, behind the pyramid (5)

    • At the Louvre, Denon Wing: Roman antiquity room, stairs and corridor (3)

    • Quai du Louvre, near the entrance to the Daru pavillion (1)

    • port du Louvre, not far from the Pont des Arts (1)

  • VIe arrondissement

    • port des Saints-Pères (1)

    • quai Conti, near the place de l'Institut (2)

    • place de l'institut and rue de Seine (1)

    • 3 and 12 rue de Seine (4)

    • Intersection of rue de Seine and rue des Beaux-Arts (1)

    • 152 and 125-127 boulevard Saint-Germain (2)

    • 28 rue de Vaugirard, on the Sénat side (1)

    • In the Jardin de Luxembourg, on asphalt and cement surfaces (10)

    • rue Auguste Comte, at the entrance to the garden(1)

    • av. de l'Observatoire on the pavement near the garden (2)

    • Intersection of av. de l'Observatoire and rue Michelet (1)

    • jardin Marco Polo (3)

    • Intersection of av. de l'Observatoire and rue d'Assas (1)

    • place Camille Jullian (2)

    • On the ground at the intersection of av. Denfert Rochereau and av. de l'Observatoire, on the Observatoire side (1)

    • av. de l'Observatoire (2)

  • XIVe arrondissement

    • Courtyard of the Observatoire de Paris (2)

    • Inside the Observatoire (1)

    • Terrace and garden in the private area of the Observatoire (7)

    • boulevard Arago and place de l'Ile de Sein (6)

    • 81 rue du faubourg Saint Jacques (1)

    • place Saint Jacques (1)

    • parc Montsouris (9)

    • boulevard Jourdan (2)

    • Cité universitaire, on the axis from the pavillon Canadien to the pavillon Cambodgien, the final one is behind the pavillion (10)

This special Google Map has many of them on it, the rest you'll have find by wandering:

View Paris Meridian in a larger map

Monday, November 09, 2009

Parsing HTML in Python with BeautifulSoup

I got into a spat with Eric Raymond the other day about some code he's written called ForgePlucker. I took a look at the source code and posted saying it looks like a total hack job by a poor programmer.

Raymond replied by posting a blog entry in which he called me a poor fool and snotty kid.

So far so good. However, he hadn't actually fixed the problems I was talking about (and which I still think are the work of a poor programmer). This morning I checked and he's removed two offending lines that I was talking about and done some code rearrangement. The function that had caught my eye initially was one to parse data from an HTML table which he does with this code:

def walk_table(text):
"Parse out the rows of an HTML table."
rows = []
while True:
oldtext = text
# First, strip out all attributes for easier parsing
text = re.sub('<TR[^>]+>', '<TR>', text, re.I)
text = re.sub('<TD[^>]+>', '<TD>', text, re.I)
# Case-smash all the relevant HTML tags, we won't be keeping them.
text = text.replace("</table>", "</TABLE>")
text = text.replace("<td>", "<TD>").replace("</td>", "</TD>")
text = text.replace("<tr>", "<TR>").replace("</tr>", "</TR>")
text = text.replace("<br>", "<BR>")
# Yes, Berlios generated \r<BR> sequences with no \n
text = text.replace("\r<BR>", "\r\n")
# And Berlios generated doubled </TD>s
# (This sort of thing is why a structural parse will fail)
text = text.replace("</TD></TD>", "</TD>")
# Now that the HTML table structure is canonicalized, parse it.
if text == oldtext:
end = text.find("</TABLE>")
if end > -1:
text = text[:end]
while True:
m ="<TR>\w*", text)
if not m:
start_row = m.end(0)
end_row = start_row + text[start_row:].find("</TR>")
rowtxt = text[start_row:end_row]
rowtxt = rowtxt.strip()
if rowtxt:
rowtxt = rowtxt[4:-5]# Strip off <TD> and </TD>
rows.append(re.split(r"</TD>\s*<TD>", rowtxt))
text = text[end_row+5:]
return rows

The problem with writing code like that is maintenance. It's got all sorts of little assumptions and special cases. Notice how it can't cope with a mixed case <TD> tag? Or how there's a special case for handling a doubled </TD>?

A much better approach is to use an HTML parser than knows all about the foibles of real HTML in the real world (Raymond's main argument in his blog posting is that you can't rely on the HTML structure to give you semantic information---I actually agree with that, but don't agree that throwing the baby out with the bath water is the right approach). If you use such an HTML parser you eliminate all the hassles you had maintaining regular expressions for all sorts of weird HTML situations, dealing with case, dealing with HTML attributes.

Here's the equivalent function written using the BeautifulSoup parser:

def walk_table2(text):
"Parse out the rows of an HTML table."
soup = BeautifulSoup(text)
return [ [ col.renderContents() for col in row.findAll('td') ]
for row in soup.find('table').findAll('tr') ]

In Raymond's code above he includes a little jab at this style saying:

# And Berlios generated doubled </TD>s
# (This sort of thing is why a structural parse will fail)
text = text.replace("</TD></TD>", "</TD>")

But that doesn't actually stand up to scrutiny. Try it and see. BeautifulSoup handles the extra </TD> without any special cases.

Bottom line: parsing HTML is hard, don't make it harder on yourself by deciding to do it yourself.

Disclaimer: I am not an experienced Python programmer, there could be a nicer way to write my walk_table2 function above, although I think it's pretty clear what it's doing.

Friday, November 06, 2009

Geek Weekend (Paris Edition), Day 2: Foucault's Pendulum

Not very far from The Curie Museum is the former church and now burial place for the great and good men (and one woman) of France: The Pantheon. Inside the Pantheon is the original Foucault's Pendulum.

The pendulum was first mounted in the Pantheon in 1851 to demonstrate that the Earth is rotating. The pendulum swings back and forth in the same plane, but the Earth moves. Relative to the floor (and to the convenient hour scale provided) the pendulum appears to rotate.

The pendulum is on a 67m long cable hanging from the roof of the Pantheon. The bob at the end of the cable weight 27kg. In the Pantheon the pendulum appears to rotate at 11 degrees per hour (which means it takes more than a day to return to its original position). If it were mounted at the North Pole it would 'rotate' once every 24 hours, the pendulum's period of rotation depends on the latitude diminishing to 0 degrees per hour at the equator (i.e. it doesn't 'rotate' at all).

If you take a look at the photograph above you can see that I was there just after 1200. The scale shows the current time measured by the pendulum.

The actual movement of the pendulum is only hard to understand because the common sense assumption is that the floor is not moving, but of course it is. It appears that what we are observing is a pendulum swinging above a fixed floor.

But the floor is actually moving because of the rotation of the Earth. That makes understanding the pendulum's motion harder. The important factor is the Coriolis Effect (sometimes erroneously called the Coriolis Force).

The simplest way to visualize the Coriolis Effect is to imagine firing a gun at the Equator straight northwards along a meridian. Because the Earth rotates the bullet will not land on the meridian, the Earth will have moved and the bullet will land to the west of the meridian. It looks as though a force has acted on the bullet to push it sideways. Of course, there's no actual force, it's just that the frame of reference (i.e. where the observer is) is not stationary.

Essentially the same thing happens with Foucault's Pendulum. The observer and the floor are not stationary and so the pendulum has an apparent motion.

Security Now #221

I was a guest on Security Now this week and the podcast has now been released (as has a transcript). Steve Gibson and some other people asked me to provide the presentation in some relatively readable format.

The original presentation is here, but it, ironically, requires JavaScript and Adobe Flash. So here are two additional formats: old style Microsoft PowerPoint and PDF.

Tuesday, November 03, 2009

Geek Weekend (Paris Edition), Day 1: The Curie Museum

So, it was off to Paris for the weekend via Eurotunnel and I managed to fit in four places from The Geek Atlas in four days. I was staying in a hotel in the Latin Quarter which is a stone's throw from... The Curie Museum.

Here's Marie Curie's laboratory:

The museum covers the lives and works of two Nobel Prize-winning couples: Pierre and Marie Curie (they discovered Radium and Polonium) and their daughter Irene and her husband Frederic Joliot (they discovered artificial radioactivity: you could make a substance radioactive by bombarding it with alpha particles).

Their Nobel Prizes are on display as is the equipment that they used (including the apparatus for measuring radiation by measuring ionization of air---which itself had been discovered by Becquerel).

Here are the Nobel Prizes:

Although I love the science section of the museum (including the laboratory where they worked with a piece of paper from one of their notebooks with its radioactive thumb print---they weren't too careful about handling radioactive elements), the best bit is the section on the craze for radium products in the 1920s and 1930s.

Here's an ad for a beauty cream that contains radium and thorium. Gives you that special glow!

Here you'll find make up that contains thorium and radium, special radium wool to keep babies warm, a radium dispenser so you could have a radioactive soak in the bath and more...

Seems stupid now, but back then the dangers were either ignored or unknown, and radioactivity seemed like a wondrous thing (especially since it was discovered early on that it would kill or reduce tumors). I wonder what products we are feeding ourselves that in 70 years we'll consider down right dangerous.

There's a nice web site of radioactive quack cures which make my skin crawl. Yes, I'm going to take a radioactive suppository to boost my sex life tonight! Move over Viagra, here's Vita Radium.