Saturday, July 31, 2010

Eating your own dog food in the 1960s

It's common in the software industry to hear the phrase "Eat your own dog food". It's so common that Wikipedia has a page dedicated to the phrase. The idea is that a software developer should also be a user of the software they are developing. By analogy someone making dog food should make it good enough that they'll eat it themselves.

The phrase has been around since at least the 1980s, but I recently came across a much earlier expression of the same idea. And this came from one of the X-15 test pilots: Milton Thompson.

Some models of the X-15 contained a device that mixed the aircraft controls of the X-15 with the rocket controls. Since the X-15 would fly like a plane and a spacecraft it had both conventional airfoils and thrusters. These were originally controlled by different joysticks. The mixing device was the MH-96 and enabled the pilot to fly the X-15 in the atmosphere or in space using a single joystick.

But that meant that the MH-96 was controlling the X-15 with the input of the pilot. And so the pilot had to trust the MH-96 itself. Thompson wrote "As a pilot, you hope the guy who designed this electronic control system knew what he was doing. In fact, you would like him to be in the airplane with you to be exposed to any adverse results." Now, that's eating your own dog food.

More recently I had the pleasure of chatting with Sven Strohban who was the lead engineer on Stanley, the car that won the 2005 DARPA Grand Challenge.

He told me that he had a simple technique for ensuring high quality. When a software engineer told him that their module was ready for testing in the autonomous car, he would say: "OK, we'll load your code and you can sit in the car while it drives you around". Invariably, the engineer would need just a little bit more time before declaring his code ready.

Monday, July 19, 2010

The bandwidth of a fully laden 747

Never underestimate the speed of shipping physical stuff when you want to move large amounts of data. The Internet is actually horribly slow even at 'high' speeds. That's why Amazon Web Services offers an Import/Export service that involves shipping physical disks around.

Back in 1999 I wrote an article for The Guardian explaining latency and bandwidth for modem users. In it I used a jet plane full of people flying across the Atlantic to illustrate the difference.

The analogy still works today, and brings me to the farcical question: what's the bandwidth of a fully laden 747?

Assuming I fill it with DAT 320 cartridges, each of which can contain 160GB of uncompressed data and each of which weigh about 50g then I can fit about 2.8m cartridges in the plane (a single 747 can lift 140 tonnes). That's about 427 PB of storage in the plane. (I'm not sure that many cartridges will actually fit inside the 747, but you get the idea).

Now assume it flies from San Francisco to London in 10 hours. That's a bandwidth of about 12 TB/s.

Which brings me to today's announcement from RackSpace about OpenStack. One of the goals of OpenStack is to remove lock in to specific providers. That's a noble goal, but if you store a lot of data in the cloud you might find yourself needing a 747 (or at least FedEx) if you decide to change providers.

PS Many people have commented that it would take a while to fill up the DAT tapes. Clearly the solution, as one commenter suggests, is to use a 747 as your data centre and fly it where you need.

Sunday, July 18, 2010

Monte Carlo simulation of One Banana, Two Banana to develop a card counting strategy

The children's game One Banana, Two Banana is a high stakes game of probability theory in action. Or something like that. Actually, it's a fun game where you have to take probability into account when deciding what to do.

Conceptually, the game is simple. There are 54 cards of five types. Three of the card types have bananas on them (one, two or three bananas), one card type has a banana skin on it and one card type has a swamp on it. All the cards are placed face down on the table and shuffled about. Each player takes turns at turning over as many cards as they want.

The total distance they move on the board (winning is a simple first past the post system) is the sum of the number of bananas on the banana cards. But if they pick a banana skin card then they stop picking cards and must reverse direction by the sum of the number of bananas on the banana cards. The swamp card simply stops a turn without a penalty.

There are six banana skin cards to start with and as they are removed from the pack they are placed on a special card for all to see. At every stage everyone knows the number of banana skin cards remaining. Thus you can adjust your strategy based on the number of banana skins that have caused others to go back.

So I wrote a little program that simulates One Banana, Two Banana games (or at least board positions) and see what the expected score is depending on the number of cards that the player chooses to pick. I assume a very simple strategy of deciding up front how many cards to pick. A more complexity strategy would be to look at the score as the player turns over cards and decide when to quit.

First, here's the code:

# Monte Carlo simulation of One Banana, Two Banana
# Copyright (c) 2010 John Graham-Cumming
# For game details see
# In the game there are a total of 54 cards of five types:
# 14 cards with three bananas on them
# 14 cards with two bananas on them
# 14 cards with one banana on them
# 6 cards with a swamp on them
# 6 cards with a banana skin
# The cards are placed face down on the table. At each turn the
# player can take as many cards as they wish. If they turn over a
# swamp or banana skin card they must stop; otherwise they stop when
# they choose.
# The total number of bananas is added up (call it X). If the banana
# skin card was turned over then the player moves BACK X spaces, if
# not the player moves FORWARD X spaces.
# Players have perfect knowledge of the state of the cards as they are
# placed face up once drawn. Once all six banana skin cards have be
# drawn the cards are all placed face down on the table again.

# The simulation runs through all the possible card positions and
# plays a large number of random draws for each possible number of
# cards a player might draw. It then keeps track of the score for a
# number of different card counting scenarios.
# Card counting scenarios:
# Number of skins: the player only keeps track of the number of banana
# skins that are on the table.
# Number banana cards: the player calculates the number of banana
# cards on the table.
# Number of three bananas: the player keeps track of the number of
# high scoring three banana cards on the table.

use strict;
use warnings;

# This is a two dimensional array. The first dimension is the number
# of skins (6, 5, 4, 3, 2, 1). The second dimension is the number of
# cards the player will pick. Each entry is a hash with keys score
# (the total score of all trials) and count (the total number of
# trials).

my @skins;

# Similar arrays to @skins, but here the first dimension is the number
# of cards on the table of the specific type.

my @bananas;
my @threes;

# Generate all possible valid board positions using nested loops.
# Note that the skin cards can never be zero because that's when the
# board resets.
# There are 6 x 7 x 15 x 15 x 15 possible board positions (115,248).

for my $sk (1..6) {
for my $sw (0..6) {
for my $on (0..14) {
for my $tw (0..14) {
for my $th (0..14) {

# Arbitrarily chosen run of 100 plays

for my $sim (0..99) {

# Allow the player to pick up to 10 cards

for my $to_pick (1..10) {

# The state of the board at any time can be represented by
# a 5-tuple (Skin, Swamp, One, Two, Three) giving the
# number of cards of each type present on the playing
# board. The initial state is (6, 6, 14, 14, 14).

my @cards = ( $sk, $sw, $on, $tw, $th );

my $score = 0;
for my $pick (1..$to_pick) {
my $total = $cards[0] + $cards[1] + $cards[2] + $cards[3]
+ $cards[4];
my $picked = int(rand($total));
if ( $picked < $cards[0] ) {
$score = -$score;
} elsif ( $picked < ( $cards[0] + $cards[1] ) ) {
} elsif ( $picked < ( $cards[0] + $cards[1] + $cards[2] ) ) {
$score += 1;
} elsif ( $picked < ( $cards[0] + $cards[1] + $cards[2] +
$cards[3] ) ) {
$score += 2;
} else {
$score += 3;

$skins[$sk][$to_pick]{score} += $score;

$bananas[$on+$tw+$th][$to_pick]{score} += $score;

$threes[$th][$to_pick]{score} += $score;
print "$sk\n";

if ( open F, ">skins.csv" ) {
for my $i ( 1..6 ) {
for my $k (1..10) {
print F "$i,$k,$skins[$i][$k]{score},$skins[$i][$k]{count}\n";
close F;

And here's a chart showing the average score from the Monte Carlo simulation for each of the six possible numbers of banana skins on the board. The X axis shows the number of cards picked (or desired to be picked), and the Y axis the average score.

If you look just at the maximum scores for each of the banana skin counts, then a simple pattern emerges.

Banana SkinsOptimum Number of CardsExpected Score

So, just memorize that pattern and you'll be a better player. And also, if you are player and have exceeded the expected score with fewer than the recommended cards you'll know it's best to stop. I've removed that sentence because it is not correct. That's a different strategy based on stopping based on the cards you have already turned over; that deserves a separate analysis to find the best strategy.

Watch out Vegas, here I come.

Saturday, July 17, 2010

GAGA-1: Some initial investigations

I've been investigating bits and bobs to go into GAGA-1 and have run one freeze test.

1. Components

I've now got the Telit GM862 module up and running with Arduino Duemilanove and have uploaded a simple Python script that SMSes me the device's location once a minute. The module is on the right in the picture and a SparkFun board for it is on the left.

Here's an example SMS received on my phone from the device giving latitude and longitude and GPS status. This format is by no means final, it's just a test.

I've also started working with the Trimble Lassen IQ module which is tiny:

It will eventually be interfaced directly to the Arduino which will connect to the radio to send down telemetry.

2. Freeze test

In my first post on the subject, I said that I'd be doing a freeze test of everything. My first test was to build a simple temperature sensor using an LM35 and then put the Arduino in my home freezer and allow its temperature to drop to -18C for a hour. This test was successful. I was able to read the temperature sensor throughout via a USB cable coming out of the freezer door.

Here's the breadboarded temperature sensor connected to the Arduino in the freezer at the start of the test.

Because the Arduino's analog ports can only handle a positive voltage the negative input to the LM35 is biased (as in the datasheet) using a couple of diodes.

The diodes in series give a voltage drop of about 2V. The LM35 would normally give -550mv at -55C (that temperature is likely to be experienced by GAGA-1 in the stratosphere): with the bias -55C will correspond to about 1.5V. The LM35 is powered by the Arduino's 5V regulated output (it only draws 60µa). The top of the range, 150C would correspond to 3.5V output.

The Arduino's ADC divides the 5V range into 1024 steps or 4.8mV per step. The LM35 gives 10mV per C. So we'll only be able to measure within 0.5C (which is fine for GAGA-1).

An added complexity is that the diodes' forward voltage drop is temperature dependent and so to get a good reading the output from the LM35 is connected to one ADC port and the voltage at the 'top' of the two diodes is connected to another. Reading the two ports and taking the difference gives a more accurate measure of the temperature as it compensates for the drift in the diode forward voltage.

Here's the test code I used.

int temperaturePin = 0;
int biasPin = 1;

void setup() {

void loop() {
while(1) {
int temperature = analogRead(temperaturePin);
int bias = analogRead(biasPin);
int unbiased = temperature - bias;
float celsius = (float)unbiased * 500.0 / 1024.0;

Serial.print( "Temperature: " );
Serial.println( temperature, DEC );
Serial.print( "Bias: " );
Serial.println( bias, DEC );
Serial.print( "Celsius: " );
Serial.println( celsius, 2 );

And here are two excerpts from the log file output (which was just sent by the Arduino down the USB connection to the Wiring application on my Mac).

Temperature: 326 Bias: 279 Celsius: 22.95
Temperature: 255 Bias: 291 Celsius: -17.58

In my fridge test the diodes drifted by about 0.06V.

For the actual project I am going to shift to the GCC toolchain rather than use their IDE. I'll be much happier with emacs, make and gcc.

3. Circuit board

I was initially going to hand build small circuit board for each of the parts of the computer (radio, GPS, backup GPS, temperature sensors), but I've come to the conclusion that between Eagle CAD and BatchPCB that the right thing to do is build a custom board.

So, to simplify my life I'm going to build an Arduino Shield that will slot right onto the Arduino Duemilanove and handle the two GPSes, the radio and temperature sensors. That'll minimize wiring inside GAGA-1 (will just need antenna wiring) and likely be more robust.

Here are the beginnings of the schematic.

Next update: in a few weeks once the board has come together.

Tuesday, July 13, 2010

The Geek Atlas Companion

The Geek Atlas has been out for a year now, and there's been an iPhone version for a while.

But now there's a special companion application for the book: The Geek Atlas Companion. O'Reilly developed this as a counterpart to the book.

My favorite part of the application is the ability to find places from the book that are near you. The application has all 128 places in it with their coordinates (just like the book) and using the iPhone's GPS it can give you a list of places near you. Each place has a small summary of what you'll see there, but not the complete book text. The application is designed as an add on and not a replacement for the book itself.

A fun part that O'Reilly developed is a quiz based on the book. Multiple choice questions lead you through some of the science that's covered in The Geek Atlas.

Finally, there's a strong community component where users are encouraged to upload their photographs of locations in the book. The application gives rapid access to other people's photographs of the locations.

Saturday, July 10, 2010

BallastHalo 5 launch afternoon out

The early bird truly catches the worm. I woke up at 0530 this morning and couldn't go back to sleep. After dealing with email etc. I was able to get to the Apple store to see a Genius, browse around in Maplin, go to the Post Office and be back home by 1000. So, then I started browsing around a bit for a project I'm slowly working on: sending a balloon up into the stratosphere to photograph the Earth and space (more on that in a separate blog post).

I wandered over to the UK High Altitude Society web site and idly clicked on Current Launch. I had to read the date 10/07/10 about three times before I realized that it was today, that I could make it to Cambridge for the launch if I moved it.

I quickly Twittered to James Coxon to see if I'd be welcome. I was. I grabbed the Jack Pack and stuffed it with journey essentials: iPhone, Kindle, Oyster Card, Camera and Power Monkey and raced to Cambridge via Kings Cross.

I arrived with enough time to sit down with Nathan Chong of Newstilt for a chat about really cool stuff that he's working on. And then it was off to Churchill College to met James, Ed and cohorts. They were incredibly welcoming (especially, since I had invited myself along) and answered all my questions about the launch of BallastHalo 5.

BallastHalo 5 was a test launch as part of a plan to send a balloon across the Atlantic:

BallastHalo 5 is the most recent payload in a series of launches to test ballast tank concepts in anticipation of a trans-atlantic balloon launch. A successful trans-atlantic launch would require the ability to drop ballast at night to counteract the loss of helium from the ZP balloon during the day.

Here's the primary payload that was launched. It contains two main components: a microcontroller that talks to a GPS to get location information and then relays that information via a low power and slow data link to the ground; a system for dumping ballast (surgical spirit) to lighten the balloon as needed. The first picture shows the internals, the second shows the assembled unit hanging from a tripod. The payload circuits are inside a polystyrene box that's covered in reflective foil to keep the heat in: it's cold up in the stratosphere.

There's also a second transmitter sending out CW (Morse) and Hellschreiber in case the main computer fails. It can be used to find the entire device if lost (and has been in the past when previous flights have taken the box to the Netherlands and France).

After checking all the electronics and the receivers we headed outside to fill the balloon, attach the parachute and launch. My role was helping to drag the 100kg helium bottle to the launch location and standing around staying out of the way. To get the balloon filled to the right lift a plastic bag was filled with Coke cans equaling the total payload weight (including the cord and parachute) plus the desired lift, and attached to the balloon. Once the combination had achieved neutral buoyancy there was sufficient helium for the flight.

Once the balloon was filled the payload and beacon were attached along with the parachute and we moved off to the launch site about 200m away. And then the balloon was off.

And the balloon ascended beautifully downlinking telemetry to multiple radio operators around the UK.

We returned to the college to watch the telemetry come in and follow the flight on the tracker. The balloon reached a maximum altitude of about 31km and flew for about 5 hours sending a clear signal all the way. Unfortunately, calamity struck when the balloon burst while the payload was over the North Sea and BallastHalo 5 descended gracefully into the water. On the way the emergency ballast routine was activated automatically to dump the surgical spirit as fast as possible to slow the descent.

There are lots more photos on Flickr.

Thursday, July 08, 2010


In April 1961, Yuri Gagarin became the first human into outer space and the first to orbit the Earth. In honor of Gagarin I'm planning to launch my own 'space' mission which I'm code naming: GAGA-1.

The code name serves a triple purpose: it sounds funny, it's the first four letters of Gagarin's name and it stands for "Geek Atlas Goes Airborne". None of the places in The Geek Atlas are quite as high as GAGA-1 will travel; I'll have to come up with a mission patch that includes The Geek Atlas in some way.

My mission won't be quite as high as Gagarin's: I plan to launch a helium-filled balloon into the stratosphere in attempt to photograph and film the curvature of the Earth with the blackness of space visible. Also unlike Gagarin, others have gone before me: it's quite possible to launch a 'weather' balloon carrying a payload with cameras and take photographs at an altitude of 30km or greater. See, for example, this post about BallastHalo 5.

Since it's going to take me a while to assemble all the gear, write all the code and perform ground based testing, my initial flight might coincide with the 50 year anniversary of Gagarin's flight.

To make this interesting for others I'm going to document my progress and the eventual flight here using the tag gaga. Follow that if you are interested in knowing how it goes.

Here are my initial notes:

1. Legality

Yes, I can legally do this in the UK. There are restrictions on which airspace I can launch in and how heavy the payload container can be, but it's ok for me to launch a helium-balloon with a recovery parachute and with a radar reflector as long as I get permission from the CAA and NOTAM (notice to airmen) about the balloon launch. These are somewhat blanket notices. For example, here's one used by CU SpaceFlight:

ID H1510/10
OS Grid Ref TL434597
Start Date 201005170001 End Date 201011172359
Lower Limit 000 Upper Limit 999
Validity Period
Q Line EGTT/QWLLW/IV/M/AW/000/999/5213N00006E002
07789 933271. AUS 10-05-0361/AS5.

2. Insurance

It doesn't appear that I'm required to have insurance, but I need to look into this more.

3. Balloon components

There are three parts to the balloon assembly: a latex weather balloon (the preferred manufacturer appears to be Kaymont, lots of helium (probably from BOC) and a parachute.

Essentially, the larger the balloon the more helium it can contain and the greater the lift it will achieve. The goal of the project will be to minimize the payload weight (currently I have calculated my ideal payload at around 700g) and maximize lift (while controlling costs) so that the balloon rises quickly. The quick rise is important for a number of reasons: it minimizes the time I have to wait for it to come down, it minimizes the low temperature exposure of the electronics, and the it minimizes how far the balloon will drift from the launch site.

It looks like a 1500g balloon (the KCI 1500) would be a good idea. With a 1050g payload it'll give a 5m/s. With a burst altitude of 34km that means the rise to burst would take about 2 hours.

The parachute would be attached to the cord attaching the balloon to the payload and would be open all the time. On the way up it does nothing, once the balloon bursts it comes into play as the payload falls and reenters thick air. The size of the parachute will modulate the descent rate, but larger parachutes also weigh more. So, some research here on good parachute sizes.

4. Balloon trajectory and flight time

Flight time is going to be largely determined by the balloon and parachute combination, but the trajectory is up to the wind. The guys at CU Spaceflight have a balloon trajectory calculator that uses weather information to predict a landing location.

5. Temperature and pressure considerations

Using the NASA Earth Atmosphere Model it's possible to predict the likely temperatures and pressure that GAGA-1 will experience. Since many of the temperatures are very low the capsule will need to be insulated to keep the electronics warm and the electronics will need to be rated to work at low temperatures. Possibly I will need to provide an internal heat source to keep the electronics in working order.

The NASA model divides the stratosphere into lower, upper regions. The balloon will likely be operating in the lower stratosphere (up to 25km) and a small portion of the upper (up to around 30km). The model gives a temperatures of -56C for the lower stratosphere and a temperature of between -56C and -41C between 25km and 30km. Note that the temperature increases in the upper stratosphere because of warming by the sun at that height.

Pressure at 25km will be around 2.5KPa; at the ground it's around 101Kpa. At 30km the pressure is about 2.6KPa.

In order to make sure that temperature doesn't hurt the capsule I plan to run freeze tests first in my home freezer (-18C) and then in a commercial freezer (plan to ask the local butcher to lend me a corner of his deep freeze). This will test the electronics, any warmers and also the capsule structure.

6. Flight computer

The main flight computer will be an Arduino Duemilanove. This is primarily because I have some lying around at home (thus reducing costs), I know how to program them, and they are very flexible little beasts.

The flight computer will be responsible for reading GPS coordinates and altitude, relaying those and status messages to the radiotelemetry system and via GSM SMS messages. It will also, possibly, read the internal temperature of GAGA-1 for monitoring purposes and barometric pressure. And it should be used to activate a video camera once the capsule is high enough.

7. Cameras and Speech Synthesis

For redundancy I want the main camera to be totally independent of the rest of the electronics, and so I am going to use a second-hand Canon A560 that I got off eBay hacked using CHDK to take still pictures and short videos by scripting it internally.

The camera will also run on its own power supply (internal AA batteries) so even if there was a total flight computer failure the camera would run.

One thought I have is using an allophone speech synthesizer to speak the latitude, longitude and altitude of the payload so that the camera will pick up the sound when filming video. That way there'd be a nice record of where we were. Not sure how well this would work in the thin air at 30km.

8. Radiotelemetry for tracking

This is an area where the CU Spaceflight guys and others have got things well worked out. The radio will be a Radiometrix NTX2 running in SSB transmitting RTTY at 50 baud.

They also have software that can decode the RTTY and upload to a tracker web site. That web site can take feeds from multiple radio operators thus giving lots of coverage for tracking.

9. Launch site and time of year

Haven't really investigated this, but clearly want a nice clear day for the launch and would assume I'll do it at one of the two Cambridge launch sites that have been used by others.

10. GPS ballistic missile restrictions

One of the problems with GPS is that there are legal limits on their operation. Modules won't work about some altitude (typically 18km) or speed (515 kph). Some modules implement this restriction as an AND (i.e. you are high and fast). I have a Trimble Lassen IQ that I used for something else that will work at this altitude at slow speed. It will be the main GPS connected to the Arduino for radio telemetry.

The other GPS module I have is a Telit GSM862 which has integrated GSM. It doesn't work about 24km. So I am going to run it separately from the main flight computer providing back up telemetry via SMS for recovery. It has an embedded Python interpreter so I will program it to send out SMS messages giving its location when it is below 24km and in GSM range.

11. Batteries

The system is likely to need both 5V and 3.3V power to run the flight computer, the GPS and the radiotelemetry system. And the batteries are likely to be exposed to very low temperatures. Ordinary household batteries are not going to be good enough because they likely won't provide the power for the length of the flight at the temperatures seen.

The consensus seems to be that some specialized batteries are likely to be needed. A good example seems to be Energizer Lithium Ultimate non-rechargeable batteries.

They can operate down to -40C and have a nice flat discharge curve maintaining a pretty stable voltage as they discharge. There's a nice technical document from Energizer on these here. Looks like they cost about £5 for a pack of four.

Like everything I'll want to run a full test of these batteries at room temperature to see how long they can power the electronics once GAGA-1 is constructed. It's probably worth also doing a freeze test on them as well.

12. Casing

For the core structure I'm thinking of using a plastic box with a screwed on lid. All the electronics can be screwed down inside this box and the box will only have holes for the camera lenses and antenna mountings. Plastic should give a good combination of strength and lightness, but I want to do a freeze test on the box to make sure that is won't crack.

The outside of the case will then be clad in extruded polystyrene panels. These will be glued on. Extruded polystyrene should give a nice combination of being easy to work with, light, insulating and smooth. The smooth is important because I want to spray paint the entire thing bright orange to ease recovery. In addition, that type of polystyrene is easy to obtain because it's used in model building and house insulation.

Another possibility is to use a ready made polystyrene box such as these. That would save on construction costs and they are very, very light.

Then I'll stencil on the outside the following:


There's a ton of other details to worry about (the cord used for the balloon, temperature sensors, ...) and I'll update with pictures, code and circuits as we go.

PS Don't expect rapid updates to this, this is a strictly spare time when I can get spare time activity.

Wednesday, July 07, 2010

The CCE Review report

Today, the Independent Climate Change Email Review report was issued. As a bit of background I'll point you to some of my old blog posts on this:

1. Met Office confirms that the station errors in CRUTEM3 are incorrectly calculated

2. Something a bit confusing from UEA/CRU

3. Something odd in CRUTEM3 station errors

4. New version of CRUTEM3 and HADCRUT3

5. Well, I was right about one thing

6. Bugs in the software flash the message "Something's out there"

7. Whoops. There's a third bug in that code

And here's what one of the people involved in the report wrote:

First, climategate reveals the urgent demand by a new breed of citizen-scientist for access to the raw data scientists use to do their work. Simply accepting a scientist's assurance that data are accurate and reliable is no longer enough. Scientists will have to make their data available for independent audit.

Second, climategate shows that science must change its idea of accountability. Traditionally, scientists have been accountable only to one another. But with the advent of new critical public voices in science – the birth of the blogosphere, for example – scientists must redefine who is a legitimate critic and who isn't. It is easy to brand the blogosphere as universally damaging and defamatory. But climategate has shown that while some critics do enjoy abusing scientists, others ask tough and illuminating questions, exposing important errors and elisions. These critics have an important part to play in shaping scientific debate and dialogue.

The report itself says:

37. Making source code publicly available. We believe that, at the point of publication, enough information should be available to reconstruct the process of analysis. This may be a full description of algorithms and/or software programs where appropriate. We note the action of NASA‘s Goddard Institute for Space Science in making the source code used to generate the GISTEMP gridded dataset publically available. We also note the recommendation of the US National Academy of Sciences in its report ―Ensuring the Integrity, Accessibility and Stewardship of Research Data in the Digital Age that: “…the default assumption should be that research data, methods (including the techniques, procedures and tools that have been used to collect, generate or analyze data, such as models, computer code and input data) and other information integral to a publically reported result will be publically accessible when results are reported." We commend this approach to CRU.


Interestingly, Sir Muir Russell's review did what I did: they wrote code (in their case in C++) to reproduce CRUTEM3 (see Appendix 7). Unfortunately, they didn't go all the way to check the error ranges and find the bug that Ilya Goz and I found.

PS I asked the review if they would give me a copy of the C++ code (in the spirit of openness :-)

Saturday, July 03, 2010

Bayes, Bletchley, JN-25 and a 'modern' optimization

In the current edition (June 2010) of significance (the magazine of the Royal Statistical Society) there's an article titled Edward Simpson: Bayes at Bletchley Park. Simpson is the man behind Simpson's Paradox (which, if you have not heard of it, you should immediately go and read about because it's... weird). He also worked at Bletchley Park during the Second World War.

I found the article fascinating for three reasons: it's about Bletchley Park, it's about Bayes Theorem (which I've spent quite a lot of time dealing with) and it highlights an optimization that was done to enable WRENs to work faster, yet is still in use today.

The article talks about life at Bletchley Park, and the breaking of two codes: Enigma and the Japanese JN-25. So much has been written about Enigma that I have little more to say about it, but JN-25 is a different matter. Its breaking involved a bit of Bayes and ended up helping the Allied victory at the Battle of Midway.


This Japanese Naval code works as follows. The Japanese word to be enciphered is looked up in a book where a corresponding 5 digit number is found. This number is the input to the encipherment. For example, the Japanese word for weather might be encoded as 30525. To help prevent transcription errors the Japanese code numbers were all divisible by three. Thus it would be easy to check a message before enciphering it (especially since it's easy to check if a number is divisible by 3).

So the message would consist of blocks of five-digit numbers. Next a book containing random five digit numbers was used to select a sequence of numbers that would be used to encipher the message. For each five digit group in the message, a five digit number from the book (in sequence) would be chosen.

Simpson's article has the following example:

Corresponding code numbers: 70863 34131 30525
Secret sequence from code book: 58304 68035 91107

The original five digit code representing a word would be added to the secret five digit number digit by digit with no carry (a form of modulo arithmetic). For example, "weather" might be 30525 and the chosen secret number 91107 and so the enciphered word would be 21622. The recipient would reverse the addition to get at the message.

That's a very secure system if the secret numbers are truly random and if they are not reused. In fact the Japanese secret numbers were not randomly used. The Japanese sent multiple messages using the same sequences of random numbers.

In addition, by cracking messages, Bletchley Park was able to build up a table of code numbers and their frequency of occurrence. That frequency of occurrence made an application of Bayes theorem possible.

Here's how it worked. A code breaker would be given a set of messages all known to have been enciphered with the same stretch of secret numbers. They might have between 2 and 20 messages. Suppose that in one message there's "good" (34131) enciphered as 92166 using the secret number 68035 and in another there's "stop" (50418) enciphered with the same secret number as 18443. The code breaker has in front of them 18443 and 92166 and by lining up messages on the table has columns of numbers enciphered with the same, unknown secret numbers.

The difference between 18443 and 92166 (using the modulo arithmetic) will be 26387. This is in fact the difference between the original code words 34131 and 50418: the secret number 68035 has been eliminated.

What the folks at Bletchley Park did was build themselves a big table of differences between the code words. The code breaker would look up 26387 in the book to see if this difference was known (indicating that the two underlying words had been identified and they would have been present in the book as 34131 and 50418). If 26387 appeared in the book then the actual code numbers (34131 and 50418) could be subtracted from the enciphered numbers 18443 and 92166 to reveal the secret number 68035 that was used to encipher the message.

Since they had many messages to work with they would actually check all pairs of numbers. For each pair, a difference is calculated and the resulting number looked up in the book to see if it exists.

Now for Bayes

On the table in front of the code breaker are a set of messages. They would take the number 68035 and subtract it from each enciphered number in the same column to reveal the underlying code number. If any of the numbers was not divisible by three then the decipherment had failed, but if not the next stage was undertaken.

Since the code numbers represented actual words they had a pattern of frequency of occurrence. So, the code breaker would take the list of numbers from the column they had deciphered and look up their frequency in a book. Suppose the code breaker has numbers a, b, c, ... and finds that a and c are in the book but b is not (perhaps because it's wrong or perhaps because the five digit code has never been seen before). The book gives P(a) and P(c) (the probability of those words occurring based on their frequency), the probability that those five digit groups occurred randomnly (actually 3/100000) was called P. Thus the Bayesian evidence given by a occurring is P(a)/P (and similarly for c).

For the word b that hasn't been seen before (or may be wrong) a frequency of occurrence of 1 was set which meant that P(b) was tiny and hence b was essentially ignored.

Then Bayes Theorem was used to multiply together P(a)/P and P(c)/P to get a probability that the number 68035 was the real underlying secret word used to encipher the numbers. If that probability was above a certain threshold the number 68035 was taken as genuine and the decipherment could continue.


There are three things that struck me in Simpson's explanation. Firstly, the prior probability is ignored in the Bayesian calculation. This sort of ignoring the priors happens in Naive Bayesian text classifiers where the prior probability of a 'document' or 'text' is irrelevant: it just acts as a scale factor that can be ignored. I did exactly that in the implementation of POPFile.

Secondly, the handling of the 'unknown' code numbers (b in the example above) is very similar to the way in which POPFile handles unknown words. A very small (close to 0, but non-zero) probability is chosen indicating that this word probability doesn't exist, but just might do.

The other optimization that occurs in POPFile is that rather than multiplying Bayes factors together, the logs of Bayes factors are added together. This is fast and prevents arithmetic underflow when dealing with tiny probabilities. Recall that log AB = log A + log B and log A/B = log A - log B. This is exactly the same trick that makes slide rules work (sliding the rule is 'adding' on a logrithmic scale).

This technique is common (I certainly didn't invent it) and was used at Bletchley Park. Instead of having WRENs do laborious multiplications and divisions of Bayes factors a book of scores was created consisting of the log of the Bayes factor for each code number. Checking whether the code numbers passed the Bayes test to reveal a useable secret number was thus a matter of looking up and adding these factors and seeing if the sum passed a threshold value.

PS If you have access to significance the rest of the article is well worth reading.