Saturday, July 31, 2010

Eating your own dog food in the 1960s

It's common in the software industry to hear the phrase "Eat your own dog food". It's so common that Wikipedia has a page dedicated to the phrase. The idea is that a software developer should also be a user of the software they are developing. By analogy someone making dog food should make it good enough that they'll eat it themselves.

The phrase has been around since at least the 1980s, but I recently came across a much earlier expression of the same idea. And this came from one of the X-15 test pilots: Milton Thompson.

Some models of the X-15 contained a device that mixed the aircraft controls of the X-15 with the rocket controls. Since the X-15 would fly like a plane and a spacecraft it had both conventional airfoils and thrusters. These were originally controlled by different joysticks. The mixing device was the MH-96 and enabled the pilot to fly the X-15 in the atmosphere or in space using a single joystick.

But that meant that the MH-96 was controlling the X-15 with the input of the pilot. And so the pilot had to trust the MH-96 itself. Thompson wrote "As a pilot, you hope the guy who designed this electronic control system knew what he was doing. In fact, you would like him to be in the airplane with you to be exposed to any adverse results." Now, that's eating your own dog food.

More recently I had the pleasure of chatting with Sven Strohban who was the lead engineer on Stanley, the car that won the 2005 DARPA Grand Challenge.

He told me that he had a simple technique for ensuring high quality. When a software engineer told him that their module was ready for testing in the autonomous car, he would say: "OK, we'll load your code and you can sit in the car while it drives you around". Invariably, the engineer would need just a little bit more time before declaring his code ready.

Tuesday, July 27, 2010

Maps don't have to be oriented North/South

If you have a paper map then commonly it'll have a north pointer on it and it'll be oriented such that north is up. This works great if you don't know where you'll be standing when you look at the map: you figure out where north is, turn the map, and away you go. But fixed maps don't have to be oriented like that. In fact, they're better off oriented so that the person looking at the map is presented with a map oriented based on the direction they are facing.

Here's the map outside my office for the new London bike hire scheme:

The You Are Here arrow points in the direction the viewer is facing along the road they are standing next to. The north arrow is still present for people who want that piece of information. To my mind this is far more sensible because you eliminate worrying about which direction you are facing.

Here's another from central London with the same system. This time north is in a different direction.

And here's a section of the New York subway map. All of Manhattan has been falsely oriented north/south so that the designations uptown and downtown make more sense:

Of course on some mobile phones with maps a compass is included to oriented the GPS-based map the correct way. Here's my iPhone map application orienting itself down Dover Street in London:

There are probably lots of other examples of maps that are oriented in a more ergonomic fashion than north/south. Can you point me to examples?

Sunday, July 25, 2010

How to buy O'Reilly print + e-book combos in the UK

If you like books from O'Reilly then you might have noticed that in the US you can buy a book + e-book combo, but in the UK you can only buy the book.

O'Reilly tells me that this situation will change sometime soon, but there is a workaround. You can buy the UK book and when you receive it you can register the book with O'Reilly and then you can upgrade your book and get the e-book version.

Saturday, July 24, 2010

Lost song lyrics

It's funny how songs can haunt you even when they are just a bare memory. For years I have searched for a song for which I only knew a couple of phrases. I thought, wrongly, it was Karen Carpenter, but thanks to the magic of Google and YouTube, I am saved.

It's Amoureuse and the singer is Kiki Dee. I heard it a few times on the radio about 30 years ago and here it is:

I should have known that the original was French:

Friday, July 23, 2010

Welcome childless men under 35

According to Alexa is very popular in the US (sometimes bouncing up into the top 15,000 sites, not blogs, by traffic in the US), but it's particularly popular with childless men under 35 who browse from work and have no college education. The final part surprises me, you would have thought that a college education was a prerequisite for coming here.

If you're a visitor to the site who has no college education, but enjoys the content here, please post a comment on what you like. I'm fascinated to hear.

And if it isn't obvious: welcome!

Monday, July 19, 2010

RIP Stephen Schneider

I learnt tonight that Stanford Professor Stephen Schneider has died. This is a great pity.

In July 2009 I was looking into the use of language to describe statistical concepts and wrote a blog post about it. I was using as a reference the language used in climate change documents and I wrote to Professor Schneider with a query (he's the person referenced in the blog post). He was courteous and helpful, so by way of remembrance I reproduce our exchange. He had no idea who I was and treated me with great respect.

Re: Query about IPCC language used to express confidence
From: Stephen H Schneider to John, Richard
Date: 7/14/09

Thanks John, good question. It was negotiated by several rounds of email peer review, and the guidance paper was no more than that--guidance in the TAR. SOme working groups evolved their own tweaks to it in the TAR, and in AR 4 a formal group got together and redid it again so the margins evolved. Still evolving. Richard, anything to add?

Cheers, Steve

Stephen H. Schneider
Melvin and Joan Lane Professor for Interdisciplinary Environmental Studies,
Professor, Department of Biology and
Senior Fellow, Woods Institute for the Environment
Mailing address:
Yang & Yamazaki Environment & Energy Building - MC 4205
473 Via Ortega
Ph: 650 725 9978
F: 650 725 4387

----- Original Message -----
From: "John Graham-Cumming"
To: [email protected]
Sent: Monday, July 13, 2009 8:44:24 AM GMT -08:00 US/Canada Pacific
Subject: Query about IPCC language used to express confidence


I'm writing to ask a quick question about an apparent change in the confidence intervals used by the IPCC. Specifically, in the 2000 "Uncertainties in the IPCC TAR" paper that you authored with Moss you refer to Morgan and Henrion, 1990 with five levels of confidence. I'm particularly interested in the top level which is "Very High Confidence" being >95%. In the "Guidance Notes for Lead Authors of the IPCC Fourth Assessment Report on Addressing Uncertainties" the recommendation is that "Very High Confidence" be >90%.

Do you know where I'd go to look to find how that language was decided upon? I'm interested in why the confidence interval was widened.

BTW I'm not looking for ammunition for climate change denial... I'm interested in the interpretation of statistics and wanted to blog a little about them.


The exchange continued:

From: Stephen H Schneider to John
Date: 7/14/09

sure John. I forgot to ask if you had the original guidance paper--in case not attached--have fun with your blog, Steve

Stephen H. Schneider
Melvin and Joan Lane Professor for Interdisciplinary Environmental Studies,
Professor, Department of Biology and
Senior Fellow, Woods Institute for the Environment
Mailing address:
Yang & Yamazaki Environment & Energy Building - MC 4205
473 Via Ortega
Ph: 650 725 9978
F: 650 725 4387

----- Original Message -----
From: "John Graham-Cumming" [email protected]
- Hide quoted text -
To: "Stephen H Schneider" [email protected]
Cc: "Richard Moss" [email protected]
Sent: Tuesday, July 14, 2009 1:31:24 AM GMT -08:00 US/Canada Pacific
Subject: Re: Query about IPCC language used to express confidence

Thank you for taking the time to reply. I'll try to write up a short
blog post about this. I thought it was rather interesting because
it's difficult for people to interpret statistics and he attempt to
come up with clear language is worth talking about.


Note how he sent me a document he thought I would find useful without me asking.

Top bloke.

Roomba Torture

I own a Roomba and it's great. But like a guard in Milgram's famous experiment I just can't help picking on the little guy.

Here are three tortures I devised for Roomba. See how well Roomba handled them. (In all these videos you can see flashes of blue light coming from Roomba and the docking station; these are in fact the infrared sensors on Roomba which are picked up by the camera and which are not normally visible).

1. Escape from the legs of three chairs when placed in the middle of them.

2. Don't fall off the table top, you might get hurt!

3. You want to go home? Oops, there's an obstacle.

In the end I couldn't leave Roomba rotating on the spot and had to help out. Maybe I'm not so cruel after all.

The bandwidth of a fully laden 747

Never underestimate the speed of shipping physical stuff when you want to move large amounts of data. The Internet is actually horribly slow even at 'high' speeds. That's why Amazon Web Services offers an Import/Export service that involves shipping physical disks around.

Back in 1999 I wrote an article for The Guardian explaining latency and bandwidth for modem users. In it I used a jet plane full of people flying across the Atlantic to illustrate the difference.

The analogy still works today, and brings me to the farcical question: what's the bandwidth of a fully laden 747?

Assuming I fill it with DAT 320 cartridges, each of which can contain 160GB of uncompressed data and each of which weigh about 50g then I can fit about 2.8m cartridges in the plane (a single 747 can lift 140 tonnes). That's about 427 PB of storage in the plane. (I'm not sure that many cartridges will actually fit inside the 747, but you get the idea).

Now assume it flies from San Francisco to London in 10 hours. That's a bandwidth of about 12 TB/s.

Which brings me to today's announcement from RackSpace about OpenStack. One of the goals of OpenStack is to remove lock in to specific providers. That's a noble goal, but if you store a lot of data in the cloud you might find yourself needing a 747 (or at least FedEx) if you decide to change providers.

PS Many people have commented that it would take a while to fill up the DAT tapes. Clearly the solution, as one commenter suggests, is to use a 747 as your data centre and fly it where you need.

Sunday, July 18, 2010

Monte Carlo simulation of One Banana, Two Banana to develop a card counting strategy

The children's game One Banana, Two Banana is a high stakes game of probability theory in action. Or something like that. Actually, it's a fun game where you have to take probability into account when deciding what to do.

Conceptually, the game is simple. There are 54 cards of five types. Three of the card types have bananas on them (one, two or three bananas), one card type has a banana skin on it and one card type has a swamp on it. All the cards are placed face down on the table and shuffled about. Each player takes turns at turning over as many cards as they want.

The total distance they move on the board (winning is a simple first past the post system) is the sum of the number of bananas on the banana cards. But if they pick a banana skin card then they stop picking cards and must reverse direction by the sum of the number of bananas on the banana cards. The swamp card simply stops a turn without a penalty.

There are six banana skin cards to start with and as they are removed from the pack they are placed on a special card for all to see. At every stage everyone knows the number of banana skin cards remaining. Thus you can adjust your strategy based on the number of banana skins that have caused others to go back.

So I wrote a little program that simulates One Banana, Two Banana games (or at least board positions) and see what the expected score is depending on the number of cards that the player chooses to pick. I assume a very simple strategy of deciding up front how many cards to pick. A more complexity strategy would be to look at the score as the player turns over cards and decide when to quit.

First, here's the code:

# Monte Carlo simulation of One Banana, Two Banana
# Copyright (c) 2010 John Graham-Cumming
# For game details see
# In the game there are a total of 54 cards of five types:
# 14 cards with three bananas on them
# 14 cards with two bananas on them
# 14 cards with one banana on them
# 6 cards with a swamp on them
# 6 cards with a banana skin
# The cards are placed face down on the table. At each turn the
# player can take as many cards as they wish. If they turn over a
# swamp or banana skin card they must stop; otherwise they stop when
# they choose.
# The total number of bananas is added up (call it X). If the banana
# skin card was turned over then the player moves BACK X spaces, if
# not the player moves FORWARD X spaces.
# Players have perfect knowledge of the state of the cards as they are
# placed face up once drawn. Once all six banana skin cards have be
# drawn the cards are all placed face down on the table again.

# The simulation runs through all the possible card positions and
# plays a large number of random draws for each possible number of
# cards a player might draw. It then keeps track of the score for a
# number of different card counting scenarios.
# Card counting scenarios:
# Number of skins: the player only keeps track of the number of banana
# skins that are on the table.
# Number banana cards: the player calculates the number of banana
# cards on the table.
# Number of three bananas: the player keeps track of the number of
# high scoring three banana cards on the table.

use strict;
use warnings;

# This is a two dimensional array. The first dimension is the number
# of skins (6, 5, 4, 3, 2, 1). The second dimension is the number of
# cards the player will pick. Each entry is a hash with keys score
# (the total score of all trials) and count (the total number of
# trials).

my @skins;

# Similar arrays to @skins, but here the first dimension is the number
# of cards on the table of the specific type.

my @bananas;
my @threes;

# Generate all possible valid board positions using nested loops.
# Note that the skin cards can never be zero because that's when the
# board resets.
# There are 6 x 7 x 15 x 15 x 15 possible board positions (115,248).

for my $sk (1..6) {
for my $sw (0..6) {
for my $on (0..14) {
for my $tw (0..14) {
for my $th (0..14) {

# Arbitrarily chosen run of 100 plays

for my $sim (0..99) {

# Allow the player to pick up to 10 cards

for my $to_pick (1..10) {

# The state of the board at any time can be represented by
# a 5-tuple (Skin, Swamp, One, Two, Three) giving the
# number of cards of each type present on the playing
# board. The initial state is (6, 6, 14, 14, 14).

my @cards = ( $sk, $sw, $on, $tw, $th );

my $score = 0;
for my $pick (1..$to_pick) {
my $total = $cards[0] + $cards[1] + $cards[2] + $cards[3]
+ $cards[4];
my $picked = int(rand($total));
if ( $picked < $cards[0] ) {
$score = -$score;
} elsif ( $picked < ( $cards[0] + $cards[1] ) ) {
} elsif ( $picked < ( $cards[0] + $cards[1] + $cards[2] ) ) {
$score += 1;
} elsif ( $picked < ( $cards[0] + $cards[1] + $cards[2] +
$cards[3] ) ) {
$score += 2;
} else {
$score += 3;

$skins[$sk][$to_pick]{score} += $score;

$bananas[$on+$tw+$th][$to_pick]{score} += $score;

$threes[$th][$to_pick]{score} += $score;
print "$sk\n";

if ( open F, ">skins.csv" ) {
for my $i ( 1..6 ) {
for my $k (1..10) {
print F "$i,$k,$skins[$i][$k]{score},$skins[$i][$k]{count}\n";
close F;

And here's a chart showing the average score from the Monte Carlo simulation for each of the six possible numbers of banana skins on the board. The X axis shows the number of cards picked (or desired to be picked), and the Y axis the average score.

If you look just at the maximum scores for each of the banana skin counts, then a simple pattern emerges.

Banana SkinsOptimum Number of CardsExpected Score

So, just memorize that pattern and you'll be a better player. And also, if you are player and have exceeded the expected score with fewer than the recommended cards you'll know it's best to stop. I've removed that sentence because it is not correct. That's a different strategy based on stopping based on the cards you have already turned over; that deserves a separate analysis to find the best strategy.

Watch out Vegas, here I come.

Saturday, July 17, 2010

There must be a name for this disease

Lining up all the toiletries on one side of the sink.

And then using them in order and lining them up on the other side of the sink.

What's that called?

PS Apparently this runs in the family. My father writes:

My father would position a slide rule exactly equidistant from each side of a cabin drawer and I saw him adjust by a single mm with parallel pencil and carefully positioned planimeter. He also counted rivets - that is the next stage...

Hmm. On the counting rivets part, I guess that's like counting squares on the back of an airline seat.

GAGA-1: Some initial investigations

I've been investigating bits and bobs to go into GAGA-1 and have run one freeze test.

1. Components

I've now got the Telit GM862 module up and running with Arduino Duemilanove and have uploaded a simple Python script that SMSes me the device's location once a minute. The module is on the right in the picture and a SparkFun board for it is on the left.

Here's an example SMS received on my phone from the device giving latitude and longitude and GPS status. This format is by no means final, it's just a test.

I've also started working with the Trimble Lassen IQ module which is tiny:

It will eventually be interfaced directly to the Arduino which will connect to the radio to send down telemetry.

2. Freeze test

In my first post on the subject, I said that I'd be doing a freeze test of everything. My first test was to build a simple temperature sensor using an LM35 and then put the Arduino in my home freezer and allow its temperature to drop to -18C for a hour. This test was successful. I was able to read the temperature sensor throughout via a USB cable coming out of the freezer door.

Here's the breadboarded temperature sensor connected to the Arduino in the freezer at the start of the test.

Because the Arduino's analog ports can only handle a positive voltage the negative input to the LM35 is biased (as in the datasheet) using a couple of diodes.

The diodes in series give a voltage drop of about 2V. The LM35 would normally give -550mv at -55C (that temperature is likely to be experienced by GAGA-1 in the stratosphere): with the bias -55C will correspond to about 1.5V. The LM35 is powered by the Arduino's 5V regulated output (it only draws 60µa). The top of the range, 150C would correspond to 3.5V output.

The Arduino's ADC divides the 5V range into 1024 steps or 4.8mV per step. The LM35 gives 10mV per C. So we'll only be able to measure within 0.5C (which is fine for GAGA-1).

An added complexity is that the diodes' forward voltage drop is temperature dependent and so to get a good reading the output from the LM35 is connected to one ADC port and the voltage at the 'top' of the two diodes is connected to another. Reading the two ports and taking the difference gives a more accurate measure of the temperature as it compensates for the drift in the diode forward voltage.

Here's the test code I used.

int temperaturePin = 0;
int biasPin = 1;

void setup() {

void loop() {
while(1) {
int temperature = analogRead(temperaturePin);
int bias = analogRead(biasPin);
int unbiased = temperature - bias;
float celsius = (float)unbiased * 500.0 / 1024.0;

Serial.print( "Temperature: " );
Serial.println( temperature, DEC );
Serial.print( "Bias: " );
Serial.println( bias, DEC );
Serial.print( "Celsius: " );
Serial.println( celsius, 2 );

And here are two excerpts from the log file output (which was just sent by the Arduino down the USB connection to the Wiring application on my Mac).

Temperature: 326 Bias: 279 Celsius: 22.95
Temperature: 255 Bias: 291 Celsius: -17.58

In my fridge test the diodes drifted by about 0.06V.

For the actual project I am going to shift to the GCC toolchain rather than use their IDE. I'll be much happier with emacs, make and gcc.

3. Circuit board

I was initially going to hand build small circuit board for each of the parts of the computer (radio, GPS, backup GPS, temperature sensors), but I've come to the conclusion that between Eagle CAD and BatchPCB that the right thing to do is build a custom board.

So, to simplify my life I'm going to build an Arduino Shield that will slot right onto the Arduino Duemilanove and handle the two GPSes, the radio and temperature sensors. That'll minimize wiring inside GAGA-1 (will just need antenna wiring) and likely be more robust.

Here are the beginnings of the schematic.

Next update: in a few weeks once the board has come together.

Friday, July 16, 2010

The Lazy Programmer

In a comment on my The thing I hate about HTML post, a reader writes:

For me, this boils down to the fact that you're lazy.

He was objecting to me not wanting to write HTML end tags. If you ignore the tone, that might be read as snotty or offensive, then I think there's an important lesson.

Yes, I am a lazy programmer.

I am lazy in the sense that I don't want to spend my time doing hard, tedious work that a computer can do for me. In the HTML example a computer is ideally suited to figuring out which tag I was thinking of closing. And it can tell me when I've missed something. And it will go horribly wrong if I don't get the nesting right.

Here's the thing: getting the nesting right is the thing I need to do. That's what I need to be using my brain for, not for something the computer can do. And if you give me the 'freedom' to mess up the tag closing then you create the mess we have today in HTML: all sorts of documents that require someone else to spend enormous effort to write a parser to cope with my mess. Giving me the luxury of being lazy about not correctly balancing the tags, means someone else has to undertake the Herculean task of being compatible with all the web pages out there.

It's also a bit like garbage collection. Do I want to remember to call free() for every malloc()? Not really, I just want to get memory and use it and discard it when I'm done. And sometimes I want to use C and get into the guts of memory management for speed: that's when I've swapped laziness for hard work because I need to do that hard work.

And there's another sort of lazy that I want to be: lazily evaluated. Don't compute things until you need them, and don't make me figure out when they are going to be needed.

There's great freedom in laziness.

Thursday, July 15, 2010

The thing I hate about HTML

Is end tags. They are the source of HTML evil.

Consider this:

<div class="red">
Hello, World!

Now, ask yourself: do we really need the end tags to say what they are ending? I mean, what's the difference with this?

<div class="red">
Hello, World!

The problem with having named end tags and browsers that are kind to bad developers is that it lets you do something like this and get away with it. The browser magically says: "Oh, he must have meant to close that <p> tag so I'll do it for him.

<div class="red">
Hello, World!

HTML really ought to look like this:

(div class="red"
(Hello, World!)

Given that the tags are meant to be balanced there's no reason to force the author to figure out which tags they want to balance and force them to write it down. It adds nothing. If they get the balancing wrong then they should be punished for it and clean up the page. Writing nice end tags is busy work that satisfies the mediocre. Without them we'd have nice, parseable pages.

But what do I know, I think LISP is cool.

PS Yes, I know there are some tags that don't need end tags (e.g. <img>). But that's just messed up. Also, <p/> would simply be (p) in my scheme.

PPS Since people were bitching about me using font and center tags I've changed them.

Programmer Gore

I recently attended an invitation-only, private event where a small number of venture-backed startups that are doing exciting things in scaling web applications opened the kimono and talked about technical challenges. One of the startups is a major social network who talked about their backend.

When they have a failure they like to truly understand the root cause of the failure and then they do a postmortem with their team. And they love to get into the gory details. They made the point that programmers really love a horrid story about why something failed because it's fascinating and they can learn from the situation.

If they can't get to the root cause they'll fix the problem and then leave a small percentage of their servers running the old buggy code just so they can reproduce the bug and poke at it until they really know why it broke.

It's like telling horror stories around the camp fire:

We shut down all our Apache servers and the bad traffic was still coming.

We pulled the plug on the DNS server and the traffic kept coming.

And then we realized: the traffic was coming from inside the building! (scream)

I think this is a fantastic idea: truly get to the source of a failure and then reveal the gory details. It's a great engineering culture if you can tell those stories and learn from them.

Once at Electric Cloud years ago I was sent into deploy ElectricAccelerator at a large client. Instead of doing what it normally does, speed up a build 20x, the build became slow, really slow. In fact, the product was locking up, but it was a live lock: it was doing something for a very, very long time.

After literally days of debugging I found the problem. ElectricAccelerator has to walk the tree of dependencies in a build making various calculations along the way. The structure of this particular build was such that one particular tree node was highly connected to the rest of the build. Essentially because of the particular shape of the build tree a function was being called on that node to calculate a fixed value over and over again. The simple fix was to memoize the function and the build worked fast.

Later I calculated how long the build would have taken to run without the fix. The answer was that the sun would have died long, long before. Of course, it seems obvious looking back that we should have been caching the value in that function, but distributed or parallel systems have these sorts of problems all the time. It's not a priori obvious what the 'shape' of your data or traffic is going to be.

And when distributed or parallel systems go wrong, they go wrong in weird and wonderful ways that make great programmer gore stories.

Tuesday, July 13, 2010

The Geek Atlas Companion

The Geek Atlas has been out for a year now, and there's been an iPhone version for a while.

But now there's a special companion application for the book: The Geek Atlas Companion. O'Reilly developed this as a counterpart to the book.

My favorite part of the application is the ability to find places from the book that are near you. The application has all 128 places in it with their coordinates (just like the book) and using the iPhone's GPS it can give you a list of places near you. Each place has a small summary of what you'll see there, but not the complete book text. The application is designed as an add on and not a replacement for the book itself.

A fun part that O'Reilly developed is a quiz based on the book. Multiple choice questions lead you through some of the science that's covered in The Geek Atlas.

Finally, there's a strong community component where users are encouraged to upload their photographs of locations in the book. The application gives rapid access to other people's photographs of the locations.

Saturday, July 10, 2010

BallastHalo 5 launch afternoon out

The early bird truly catches the worm. I woke up at 0530 this morning and couldn't go back to sleep. After dealing with email etc. I was able to get to the Apple store to see a Genius, browse around in Maplin, go to the Post Office and be back home by 1000. So, then I started browsing around a bit for a project I'm slowly working on: sending a balloon up into the stratosphere to photograph the Earth and space (more on that in a separate blog post).

I wandered over to the UK High Altitude Society web site and idly clicked on Current Launch. I had to read the date 10/07/10 about three times before I realized that it was today, that I could make it to Cambridge for the launch if I moved it.

I quickly Twittered to James Coxon to see if I'd be welcome. I was. I grabbed the Jack Pack and stuffed it with journey essentials: iPhone, Kindle, Oyster Card, Camera and Power Monkey and raced to Cambridge via Kings Cross.

I arrived with enough time to sit down with Nathan Chong of Newstilt for a chat about really cool stuff that he's working on. And then it was off to Churchill College to met James, Ed and cohorts. They were incredibly welcoming (especially, since I had invited myself along) and answered all my questions about the launch of BallastHalo 5.

BallastHalo 5 was a test launch as part of a plan to send a balloon across the Atlantic:

BallastHalo 5 is the most recent payload in a series of launches to test ballast tank concepts in anticipation of a trans-atlantic balloon launch. A successful trans-atlantic launch would require the ability to drop ballast at night to counteract the loss of helium from the ZP balloon during the day.

Here's the primary payload that was launched. It contains two main components: a microcontroller that talks to a GPS to get location information and then relays that information via a low power and slow data link to the ground; a system for dumping ballast (surgical spirit) to lighten the balloon as needed. The first picture shows the internals, the second shows the assembled unit hanging from a tripod. The payload circuits are inside a polystyrene box that's covered in reflective foil to keep the heat in: it's cold up in the stratosphere.

There's also a second transmitter sending out CW (Morse) and Hellschreiber in case the main computer fails. It can be used to find the entire device if lost (and has been in the past when previous flights have taken the box to the Netherlands and France).

After checking all the electronics and the receivers we headed outside to fill the balloon, attach the parachute and launch. My role was helping to drag the 100kg helium bottle to the launch location and standing around staying out of the way. To get the balloon filled to the right lift a plastic bag was filled with Coke cans equaling the total payload weight (including the cord and parachute) plus the desired lift, and attached to the balloon. Once the combination had achieved neutral buoyancy there was sufficient helium for the flight.

Once the balloon was filled the payload and beacon were attached along with the parachute and we moved off to the launch site about 200m away. And then the balloon was off.

And the balloon ascended beautifully downlinking telemetry to multiple radio operators around the UK.

We returned to the college to watch the telemetry come in and follow the flight on the tracker. The balloon reached a maximum altitude of about 31km and flew for about 5 hours sending a clear signal all the way. Unfortunately, calamity struck when the balloon burst while the payload was over the North Sea and BallastHalo 5 descended gracefully into the water. On the way the emergency ballast routine was activated automatically to dump the surgical spirit as fast as possible to slow the descent.

There are lots more photos on Flickr.

UseTheSource / O'Reilly book giveaway

O'Reilly kindly contacted me about UseTheSource and offered me free, e-copies of the book Beautiful Code to give away to UseTheSource readers. I thought this was a fun, great idea and so O'Reilly has made available 5 copies of Beautiful Code.

Here's the giveaway.

Thursday, July 08, 2010

Top nerdy parody music videos

I love a good song parody (and tried to write a few of my own) and since I've laughed at many YouTube parodies in the past I figured I'd share.

The first video was tweeted by geekcalendar yesterday and was new to me. It's a parody of Katy Perry's "I kissed a girl" called "I kissed a nerd":

It immediately made me think of Venetian Princess's parody of Britney Spears' Womanizer.

Then, of course, there's a master of nerd rap: MC Hawking:

And what collection would be complete with Weird Al's White and Nerdy.

Finally, there's the smutty mouthed Monzy

I cry at the following words:

Your mom circulates like a public key,
Servicing more requests than HTTP.
She keeps all her ports open like Windows ME


In April 1961, Yuri Gagarin became the first human into outer space and the first to orbit the Earth. In honor of Gagarin I'm planning to launch my own 'space' mission which I'm code naming: GAGA-1.

The code name serves a triple purpose: it sounds funny, it's the first four letters of Gagarin's name and it stands for "Geek Atlas Goes Airborne". None of the places in The Geek Atlas are quite as high as GAGA-1 will travel; I'll have to come up with a mission patch that includes The Geek Atlas in some way.

My mission won't be quite as high as Gagarin's: I plan to launch a helium-filled balloon into the stratosphere in attempt to photograph and film the curvature of the Earth with the blackness of space visible. Also unlike Gagarin, others have gone before me: it's quite possible to launch a 'weather' balloon carrying a payload with cameras and take photographs at an altitude of 30km or greater. See, for example, this post about BallastHalo 5.

Since it's going to take me a while to assemble all the gear, write all the code and perform ground based testing, my initial flight might coincide with the 50 year anniversary of Gagarin's flight.

To make this interesting for others I'm going to document my progress and the eventual flight here using the tag gaga. Follow that if you are interested in knowing how it goes.

Here are my initial notes:

1. Legality

Yes, I can legally do this in the UK. There are restrictions on which airspace I can launch in and how heavy the payload container can be, but it's ok for me to launch a helium-balloon with a recovery parachute and with a radar reflector as long as I get permission from the CAA and NOTAM (notice to airmen) about the balloon launch. These are somewhat blanket notices. For example, here's one used by CU SpaceFlight:

ID H1510/10
OS Grid Ref TL434597
Start Date 201005170001 End Date 201011172359
Lower Limit 000 Upper Limit 999
Validity Period
Q Line EGTT/QWLLW/IV/M/AW/000/999/5213N00006E002
07789 933271. AUS 10-05-0361/AS5.

2. Insurance

It doesn't appear that I'm required to have insurance, but I need to look into this more.

3. Balloon components

There are three parts to the balloon assembly: a latex weather balloon (the preferred manufacturer appears to be Kaymont, lots of helium (probably from BOC) and a parachute.

Essentially, the larger the balloon the more helium it can contain and the greater the lift it will achieve. The goal of the project will be to minimize the payload weight (currently I have calculated my ideal payload at around 700g) and maximize lift (while controlling costs) so that the balloon rises quickly. The quick rise is important for a number of reasons: it minimizes the time I have to wait for it to come down, it minimizes the low temperature exposure of the electronics, and the it minimizes how far the balloon will drift from the launch site.

It looks like a 1500g balloon (the KCI 1500) would be a good idea. With a 1050g payload it'll give a 5m/s. With a burst altitude of 34km that means the rise to burst would take about 2 hours.

The parachute would be attached to the cord attaching the balloon to the payload and would be open all the time. On the way up it does nothing, once the balloon bursts it comes into play as the payload falls and reenters thick air. The size of the parachute will modulate the descent rate, but larger parachutes also weigh more. So, some research here on good parachute sizes.

4. Balloon trajectory and flight time

Flight time is going to be largely determined by the balloon and parachute combination, but the trajectory is up to the wind. The guys at CU Spaceflight have a balloon trajectory calculator that uses weather information to predict a landing location.

5. Temperature and pressure considerations

Using the NASA Earth Atmosphere Model it's possible to predict the likely temperatures and pressure that GAGA-1 will experience. Since many of the temperatures are very low the capsule will need to be insulated to keep the electronics warm and the electronics will need to be rated to work at low temperatures. Possibly I will need to provide an internal heat source to keep the electronics in working order.

The NASA model divides the stratosphere into lower, upper regions. The balloon will likely be operating in the lower stratosphere (up to 25km) and a small portion of the upper (up to around 30km). The model gives a temperatures of -56C for the lower stratosphere and a temperature of between -56C and -41C between 25km and 30km. Note that the temperature increases in the upper stratosphere because of warming by the sun at that height.

Pressure at 25km will be around 2.5KPa; at the ground it's around 101Kpa. At 30km the pressure is about 2.6KPa.

In order to make sure that temperature doesn't hurt the capsule I plan to run freeze tests first in my home freezer (-18C) and then in a commercial freezer (plan to ask the local butcher to lend me a corner of his deep freeze). This will test the electronics, any warmers and also the capsule structure.

6. Flight computer

The main flight computer will be an Arduino Duemilanove. This is primarily because I have some lying around at home (thus reducing costs), I know how to program them, and they are very flexible little beasts.

The flight computer will be responsible for reading GPS coordinates and altitude, relaying those and status messages to the radiotelemetry system and via GSM SMS messages. It will also, possibly, read the internal temperature of GAGA-1 for monitoring purposes and barometric pressure. And it should be used to activate a video camera once the capsule is high enough.

7. Cameras and Speech Synthesis

For redundancy I want the main camera to be totally independent of the rest of the electronics, and so I am going to use a second-hand Canon A560 that I got off eBay hacked using CHDK to take still pictures and short videos by scripting it internally.

The camera will also run on its own power supply (internal AA batteries) so even if there was a total flight computer failure the camera would run.

One thought I have is using an allophone speech synthesizer to speak the latitude, longitude and altitude of the payload so that the camera will pick up the sound when filming video. That way there'd be a nice record of where we were. Not sure how well this would work in the thin air at 30km.

8. Radiotelemetry for tracking

This is an area where the CU Spaceflight guys and others have got things well worked out. The radio will be a Radiometrix NTX2 running in SSB transmitting RTTY at 50 baud.

They also have software that can decode the RTTY and upload to a tracker web site. That web site can take feeds from multiple radio operators thus giving lots of coverage for tracking.

9. Launch site and time of year

Haven't really investigated this, but clearly want a nice clear day for the launch and would assume I'll do it at one of the two Cambridge launch sites that have been used by others.

10. GPS ballistic missile restrictions

One of the problems with GPS is that there are legal limits on their operation. Modules won't work about some altitude (typically 18km) or speed (515 kph). Some modules implement this restriction as an AND (i.e. you are high and fast). I have a Trimble Lassen IQ that I used for something else that will work at this altitude at slow speed. It will be the main GPS connected to the Arduino for radio telemetry.

The other GPS module I have is a Telit GSM862 which has integrated GSM. It doesn't work about 24km. So I am going to run it separately from the main flight computer providing back up telemetry via SMS for recovery. It has an embedded Python interpreter so I will program it to send out SMS messages giving its location when it is below 24km and in GSM range.

11. Batteries

The system is likely to need both 5V and 3.3V power to run the flight computer, the GPS and the radiotelemetry system. And the batteries are likely to be exposed to very low temperatures. Ordinary household batteries are not going to be good enough because they likely won't provide the power for the length of the flight at the temperatures seen.

The consensus seems to be that some specialized batteries are likely to be needed. A good example seems to be Energizer Lithium Ultimate non-rechargeable batteries.

They can operate down to -40C and have a nice flat discharge curve maintaining a pretty stable voltage as they discharge. There's a nice technical document from Energizer on these here. Looks like they cost about £5 for a pack of four.

Like everything I'll want to run a full test of these batteries at room temperature to see how long they can power the electronics once GAGA-1 is constructed. It's probably worth also doing a freeze test on them as well.

12. Casing

For the core structure I'm thinking of using a plastic box with a screwed on lid. All the electronics can be screwed down inside this box and the box will only have holes for the camera lenses and antenna mountings. Plastic should give a good combination of strength and lightness, but I want to do a freeze test on the box to make sure that is won't crack.

The outside of the case will then be clad in extruded polystyrene panels. These will be glued on. Extruded polystyrene should give a nice combination of being easy to work with, light, insulating and smooth. The smooth is important because I want to spray paint the entire thing bright orange to ease recovery. In addition, that type of polystyrene is easy to obtain because it's used in model building and house insulation.

Another possibility is to use a ready made polystyrene box such as these. That would save on construction costs and they are very, very light.

Then I'll stencil on the outside the following:


There's a ton of other details to worry about (the cord used for the balloon, temperature sensors, ...) and I'll update with pictures, code and circuits as we go.

PS Don't expect rapid updates to this, this is a strictly spare time when I can get spare time activity.

Wednesday, July 07, 2010

The CCE Review report

Today, the Independent Climate Change Email Review report was issued. As a bit of background I'll point you to some of my old blog posts on this:

1. Met Office confirms that the station errors in CRUTEM3 are incorrectly calculated

2. Something a bit confusing from UEA/CRU

3. Something odd in CRUTEM3 station errors

4. New version of CRUTEM3 and HADCRUT3

5. Well, I was right about one thing

6. Bugs in the software flash the message "Something's out there"

7. Whoops. There's a third bug in that code

And here's what one of the people involved in the report wrote:

First, climategate reveals the urgent demand by a new breed of citizen-scientist for access to the raw data scientists use to do their work. Simply accepting a scientist's assurance that data are accurate and reliable is no longer enough. Scientists will have to make their data available for independent audit.

Second, climategate shows that science must change its idea of accountability. Traditionally, scientists have been accountable only to one another. But with the advent of new critical public voices in science – the birth of the blogosphere, for example – scientists must redefine who is a legitimate critic and who isn't. It is easy to brand the blogosphere as universally damaging and defamatory. But climategate has shown that while some critics do enjoy abusing scientists, others ask tough and illuminating questions, exposing important errors and elisions. These critics have an important part to play in shaping scientific debate and dialogue.

The report itself says:

37. Making source code publicly available. We believe that, at the point of publication, enough information should be available to reconstruct the process of analysis. This may be a full description of algorithms and/or software programs where appropriate. We note the action of NASA‘s Goddard Institute for Space Science in making the source code used to generate the GISTEMP gridded dataset publically available. We also note the recommendation of the US National Academy of Sciences in its report ―Ensuring the Integrity, Accessibility and Stewardship of Research Data in the Digital Age that: “…the default assumption should be that research data, methods (including the techniques, procedures and tools that have been used to collect, generate or analyze data, such as models, computer code and input data) and other information integral to a publically reported result will be publically accessible when results are reported." We commend this approach to CRU.


Interestingly, Sir Muir Russell's review did what I did: they wrote code (in their case in C++) to reproduce CRUTEM3 (see Appendix 7). Unfortunately, they didn't go all the way to check the error ranges and find the bug that Ilya Goz and I found.

PS I asked the review if they would give me a copy of the C++ code (in the spirit of openness :-)

Saturday, July 03, 2010

Bayes, Bletchley, JN-25 and a 'modern' optimization

In the current edition (June 2010) of significance (the magazine of the Royal Statistical Society) there's an article titled Edward Simpson: Bayes at Bletchley Park. Simpson is the man behind Simpson's Paradox (which, if you have not heard of it, you should immediately go and read about because it's... weird). He also worked at Bletchley Park during the Second World War.

I found the article fascinating for three reasons: it's about Bletchley Park, it's about Bayes Theorem (which I've spent quite a lot of time dealing with) and it highlights an optimization that was done to enable WRENs to work faster, yet is still in use today.

The article talks about life at Bletchley Park, and the breaking of two codes: Enigma and the Japanese JN-25. So much has been written about Enigma that I have little more to say about it, but JN-25 is a different matter. Its breaking involved a bit of Bayes and ended up helping the Allied victory at the Battle of Midway.


This Japanese Naval code works as follows. The Japanese word to be enciphered is looked up in a book where a corresponding 5 digit number is found. This number is the input to the encipherment. For example, the Japanese word for weather might be encoded as 30525. To help prevent transcription errors the Japanese code numbers were all divisible by three. Thus it would be easy to check a message before enciphering it (especially since it's easy to check if a number is divisible by 3).

So the message would consist of blocks of five-digit numbers. Next a book containing random five digit numbers was used to select a sequence of numbers that would be used to encipher the message. For each five digit group in the message, a five digit number from the book (in sequence) would be chosen.

Simpson's article has the following example:

Corresponding code numbers: 70863 34131 30525
Secret sequence from code book: 58304 68035 91107

The original five digit code representing a word would be added to the secret five digit number digit by digit with no carry (a form of modulo arithmetic). For example, "weather" might be 30525 and the chosen secret number 91107 and so the enciphered word would be 21622. The recipient would reverse the addition to get at the message.

That's a very secure system if the secret numbers are truly random and if they are not reused. In fact the Japanese secret numbers were not randomly used. The Japanese sent multiple messages using the same sequences of random numbers.

In addition, by cracking messages, Bletchley Park was able to build up a table of code numbers and their frequency of occurrence. That frequency of occurrence made an application of Bayes theorem possible.

Here's how it worked. A code breaker would be given a set of messages all known to have been enciphered with the same stretch of secret numbers. They might have between 2 and 20 messages. Suppose that in one message there's "good" (34131) enciphered as 92166 using the secret number 68035 and in another there's "stop" (50418) enciphered with the same secret number as 18443. The code breaker has in front of them 18443 and 92166 and by lining up messages on the table has columns of numbers enciphered with the same, unknown secret numbers.

The difference between 18443 and 92166 (using the modulo arithmetic) will be 26387. This is in fact the difference between the original code words 34131 and 50418: the secret number 68035 has been eliminated.

What the folks at Bletchley Park did was build themselves a big table of differences between the code words. The code breaker would look up 26387 in the book to see if this difference was known (indicating that the two underlying words had been identified and they would have been present in the book as 34131 and 50418). If 26387 appeared in the book then the actual code numbers (34131 and 50418) could be subtracted from the enciphered numbers 18443 and 92166 to reveal the secret number 68035 that was used to encipher the message.

Since they had many messages to work with they would actually check all pairs of numbers. For each pair, a difference is calculated and the resulting number looked up in the book to see if it exists.

Now for Bayes

On the table in front of the code breaker are a set of messages. They would take the number 68035 and subtract it from each enciphered number in the same column to reveal the underlying code number. If any of the numbers was not divisible by three then the decipherment had failed, but if not the next stage was undertaken.

Since the code numbers represented actual words they had a pattern of frequency of occurrence. So, the code breaker would take the list of numbers from the column they had deciphered and look up their frequency in a book. Suppose the code breaker has numbers a, b, c, ... and finds that a and c are in the book but b is not (perhaps because it's wrong or perhaps because the five digit code has never been seen before). The book gives P(a) and P(c) (the probability of those words occurring based on their frequency), the probability that those five digit groups occurred randomnly (actually 3/100000) was called P. Thus the Bayesian evidence given by a occurring is P(a)/P (and similarly for c).

For the word b that hasn't been seen before (or may be wrong) a frequency of occurrence of 1 was set which meant that P(b) was tiny and hence b was essentially ignored.

Then Bayes Theorem was used to multiply together P(a)/P and P(c)/P to get a probability that the number 68035 was the real underlying secret word used to encipher the numbers. If that probability was above a certain threshold the number 68035 was taken as genuine and the decipherment could continue.


There are three things that struck me in Simpson's explanation. Firstly, the prior probability is ignored in the Bayesian calculation. This sort of ignoring the priors happens in Naive Bayesian text classifiers where the prior probability of a 'document' or 'text' is irrelevant: it just acts as a scale factor that can be ignored. I did exactly that in the implementation of POPFile.

Secondly, the handling of the 'unknown' code numbers (b in the example above) is very similar to the way in which POPFile handles unknown words. A very small (close to 0, but non-zero) probability is chosen indicating that this word probability doesn't exist, but just might do.

The other optimization that occurs in POPFile is that rather than multiplying Bayes factors together, the logs of Bayes factors are added together. This is fast and prevents arithmetic underflow when dealing with tiny probabilities. Recall that log AB = log A + log B and log A/B = log A - log B. This is exactly the same trick that makes slide rules work (sliding the rule is 'adding' on a logrithmic scale).

This technique is common (I certainly didn't invent it) and was used at Bletchley Park. Instead of having WRENs do laborious multiplications and divisions of Bayes factors a book of scores was created consisting of the log of the Bayes factor for each code number. Checking whether the code numbers passed the Bayes test to reveal a useable secret number was thus a matter of looking up and adding these factors and seeing if the sum passed a threshold value.

PS If you have access to significance the rest of the article is well worth reading.

Friday, July 02, 2010

Update to UseTheSource

Thanks for everyone who visited UseTheSource yesterday on its first day of operation. I've received lots of feedback on things you'd like to see and I've implemented almost everything. Thank you to everyone who suggested improvements or submitted beautiful code.

1. An RSS feed (and a suitable META tag to make it easy to find).

2. A favicon. Thanks to jimjim for making it.

3. Search. I've integrated Google Custom Search. It's easy to find, it's at the top of the page.

4. The ability to add code. When submitting you can choose between submitting a URL, or some text with associated code. The code will have formatting preserved and is syntax highlighted automatically. Here's an example of C and here's one of elisp.

5. There's a new 'hn' option on the user page. If your Hacker News name is the same as your UseTheSource name then set this to 'yes' and UseTheSource will make a link between the two.

Left to implement is filtering by language. This is coming. Stay tuned.

Thursday, July 01, 2010

UseTheSource: a sort of "Hacker News" that's about code

Yesterday on Hacker News someone called a piece of my code beautiful. It occurred to me that it would be nice to have a place where people talk about beautiful or interesting code or algorithms.

So, I've resurrected my old domain name and created UseTheSource: a place where you can read about (and submit) great source.

I've seeded it with some recent, interesting pieces (and a couple of pieces of my own code). Let's try to make it a place where people come to read about interesting software (and not the business side of software).

There's a short introduction here.