Tuesday, June 30, 2009

The 1944 US Presidential Election was fraudulent

OK, it wasn't really, but I thought I'd run the Scacco/Beber analysis on that election and see what it comes up with. Guess what.

If you look at the non-adjacent, non-repeated digits in the last two places in the votes counts by state for Roosevelt and Dewey you discover that 59.38% of the votes are non-adjacent, non-repeated. If the numbers were truly random you'd expect 70%. That's way worse than the 62.07% in the Iranian election.

If you then do the old Z-Test you get a Z value of -2.49 with a p-value of 0.013. That's well below the 0.05 critical value so you can reject the null hypothesis. The final digits are not random.

Is this fraud?

Is there any suggestion that the state-level numbers in the 1944 US election were invented by people?

If not, how can anyone claim that this test indicates fraud in the Iranian election?

Now run the other bit of their test looking at the frequencies of the last digit. You get 'too many' 7s (expected 10%, got 16%) and 'too few' 1s (expected 10%, got 5%).

I'm telling you, man, what's the chance of that happening, and the non-adjacent, non-repeating digits thing? (It's about 0.17% according to simulation) I mean, come on, that's gotta be fraud.

Oh, wait, it's not.

The Iranian Election Detector

OK, I thought I was done criticizing the Washington Post Op-Ed about how statistics leave 'little room for reasonable doubt' that the Iranian election was fraudulent. But then Hannah Devlin at The Times did her own analysis and it got me thinking about the errors in that article again.

Firstly, my previous post talks about the right way to determine whether the digits are random or not, I'm not going to go over that again, but I am going to go back over some of the actual figures that are presented in the article.

So begin with this quote:

But that's not all. Psychologists have also found that humans have trouble generating non-adjacent digits (such as 64 or 17, as opposed to 23) as frequently as one would expect in a sequence of random numbers. To check for deviations of this type, we examined the pairs of last and second-to-last digits in Iran's vote counts. On average, if the results had not been manipulated, 70 percent of these pairs should consist of distinct, non-adjacent digits.

Not so in the data from Iran: Only 62 percent of the pairs contain non-adjacent digits. This may not sound so different from 70 percent, but the probability that a fair election would produce a difference this large is less than 4.2 percent.

And there's a footnote:

This is a corollary of the fact that last digits should occur with equal frequency. For an arbitrary second-to-last numeral, there are seven out of ten equally likely last digits that will produce a non-adjacent pair. Note that we treat both 09 and 10 as adjacent.

Firstly, I believe they mean to say that they treat 09 and 90 as adjacent (not 09 and 10). That means that for any number there are two possible adjacent digits out of a ten, in other words 20% of digit pairs are adjacent, so 80% of digit pairs are non-adjacent.

In their article they say 70% 'distinct, non-adjacent'. OK, so their definition of non-adjacent means that you need to exclude repeats as well (so 23, 32 and 33 are all to be excluded).

They then present the argument that a figure of 62% or less will only happen in 4.2% of fair elections. Nowhere do they explain how they derived this figure, so I decided to run a simulation. (Hannah Devlin argues that this number is incorrect in her article, worth a read)

I ran a simulation of 1,000,000 elections that generate 116 counts of votes and I looked at the adjacent pairs of numbers in the vote counts and then I calculated the percentage of fair elections that would result in the same 62% or less as seen in the Iranian election. The figure is 2.66%. 2.66% of fair elections would produce the result (or 'worse') seen in Iran.

The difference, 4.2% vs 2.66%, comes about because the figure that they must have used is not 62%, but 62.07%. That is the actual number, to two decimal places, that comes from analyzing the digit distribution in the Iranian election results.

(Email me if you want my source code)

So, what does that tell you? That in almost 3 in 100 fair elections we would have seen the result in Iran. Or if you use their numbers 4 in 100. Either way that's pretty darn often. In the 20th century there were 26 general elections in the UK. Given their 4/100 number is 1/25 we shouldn't be at all surprised if one of those general elections looked fraudulent!

Now, we expect that the percentage of non-adjacent digits is normally distributed. And, in fact, my little simulation shows a nice little normal distribution centered on 70 with a standard deviation of 4.27.

So, we've got normally distributed data, a mean and a standard deviation and a sample (62.07%). Hey, time for a Z-test!

For this situation the Z value is -1.86 which yields a p-value of 0.063 for a two-tailed test (I'm doing two-tailed here because what I'm interested in is the deviation away from the mean, not the specific direction it went in). That's above the 0.05 value typically used for statistical significance and so we can't from this sample determine that there's statistical significance in the 62.07% figure.

So, I'd say that based on the figures given I can't find statistical significance. So I don't learn anything from that about the Iranian election.

Given that the Z-test on their 'non-adjacent, non-repeated' digits test doesn't find statistical significance, and my previous piece showed that the chi-squared test on the other claim in their paper didn't find statistical significance (that was on the randomness of the last two digits).

You might be scratching your head wondering how the authors made the claim that this was definitely fraud (their words: 'But taken together, they leave very little room for reasonable doubt.')

Well, what they do is take the probability of seeing the 62% or less number in a fair election (4.2%) and multiply it by the probability of seeing the specific variance they see in the digits 7 and 5 in a fair election (4%) to come up with 1.4% likelihood of this happening in a fair election:

More specifically, the probability is .0014 that a fair election (with 116 vote counts) has the characteristics that (a) 62% or fewer of last and second-to-last digits are non-adjacent, and (b) has at least one numeral occurring in 17% or more of last digits and another numeral occurring in 4% or fewer of last digits.

That's a very specific test. In fact, it's so specific that I'm going to name it the "Iranian Election Detector". It's a test that's been crafted from the data in the Iranian election results, it's not the test that they started with (which is all about randomness of digits, and adjacency).

So, let's accept their 1.4% figure and delve into it... that's 1.4 in 100 elections. That's roughly 1 in 71. So, they are saying that their test would give a false positive in 1 in 71 elections.

How is that 'leaving little room for reasonable doubt'?

Friday, June 26, 2009

Running the numbers on the BBC executives expenses

So, another lovely data set appeared, the expenses of senior BBC executives. And the papers went a little wild highlighting the spending that they don't like.

Me, I just punched the numbers into a little program and took a look at how well they fit Benford's Law. I wanted to see if there were any interesting anomalies to look at. And there are. No smoking guns, though. Just some fun on a Number 22 bus with some numbers.

First, here are the chi-squared values for the fit of first digits of the expenses for 2007/2008 by executive. The critical value (for p = 0.05) is 15.51, so most of the expenses do not fit Benford's Law. The fun is in finding out why.

Ashley Highfield,3.06874517880956
John Smith,4.61752636949634
Mark Byford,15.5817014457229
Jana Bennett,15.6178982350545
Zarin Patel,17.5034731417114
Jennifer Abramsky,20.8803339214804
Mark Thompson,22.4588511455346
Caroline Thomson,37.666988157616
Timothy Davie,143.433388695613
Stephen Kelly,178.662639451409

The best fit is Ashley Highfield's lovely curve:

We'll come back to Mr Highfield later, but let's go to the other end of the spectrum and look at the extremes that don't match the expected. The 'worst' offender is Stephen Kelly (he's not with the BBC anymore). Here's his curve.

Whoa. What happened with all those 8s? Delve into the data and you find lots of £8.00 claims for "Road/Bridge Tolls". My guess is that Mr Kelly passed through the London Congestion Charging Zone in his own car. That's enough to skew the data. And if you match up his £8.00 charges and his mileage claims it all makes sense.

Now to Timothy Davie and here's his curve:

So, he's like Stephen Kelly and sure enough there are lots of £8.00 charges for the same "Road/Bridge Tolls".

Next on ths list comes a different pattern created by Caroline Thomson. An excess of 1s:

She's got a ton of taxi trips in the £10 to £19 range. According to Transport for London you'd see those fares on a weekday when traveling around 4 miles in central London. Given the location of BBC Television Centre it's pretty easy to imagine the need for these trips. Also, she doesn't claim any mileage or congestion charge so she's not using her own car.

Next up is the Big Kahuna Mark Thompson. His curve shows an excess of numbers 6, 7 and 8.

Why is that? Well, if you look at his expense claims line by line (and if you do you're a total nerd) you'll see that Mr Thompson takes people out to lunch a lot and spends a lot of money on lunches under £100. You can imagine this being totally legitimate. He probably has to do that for his job, there could be a BBC guideline about how much to spend on lunch, or Mr Thompson could simply have a moral compass that says he shouldn't go totally wild on lunch costs.

He doesn't have lots of taxis or tolls, but then again there's a note in his expense report where he did take a taxi that says "Driver not available" so I'm guessing he has a chauffeur.

And so it goes on. You can carry on down the list and look for little anomalies, but there's nothing glaring.

So, how come Ashley Highfield has such a perfect curve? Well, he doesn't take a lot of cabs (so no excess of 1s), drives his own car (lots of little mileage claims) and doesn't seem to claim the congestion charge (no excess of 8s). Did he forget to claim the congestion charge, or does he drive an electric car?

He was, after all, the BBC's Director of the Future Media and Technology.

Thursday, June 25, 2009

The Scacco/Beber analysis of the Iranian election is bogus

OK, I wasn't going to write another blog entry about the 2009 Iranian election, but the article in the Washington Post that supposedly gives statistical evidence for vote fraud just won't die in the blogosphere and just got a boost from a tweet by Tim O'Reilly.

The trouble is the analysis is bogus.

The authors propose a simple hypothesis: the last and second-to-last digits of vote counts should be random. In statistical terms this is often called uniformly distributed, which just means that they are each equally likely. So you'd expect to see 10% 0s, 10% 1s, 10% 2s, and so on.

Of course, you only expect to see that if you had an infinite number of vote counts because the point about random processes is that they only 'even out' to the expected probabilities in the long run. So if you've got a short run of numbers you have to be careful because they won't actually be exactly uniform.

To confirm that try tossing a coin six times. Did it come up with exactly 3 heads and 3 tails? Probably not, but that doesn't mean it's unfair.

Now, given some run of numbers (vote counts for example), the right thing to do is ask the statistical question "Could these numbers have occurred from a random process?" If they couldn't then you can go looking for some other reason (e.g. fraud).

The question "Could these numbers have occurred from a random process?" is given the ugly name the 'null hypothesis' by stats-heads. That just means that thing you are testing.

More concretely, the Scacco/Beber null hypothesis is "the last and second-to-last digits in the vote counts are random". What you want to know is with what confidence can you reject this, and for Scacco/Beber rejecting means fraud.

Now, what you don't do is go count the last and second-to-last digits, look for some that have counts that deviate from what you expect (the exactly 10% figure) and then try to work out how often that happens. That's like tossing a coin a few times, noticing that heads has come up more than 50% of the time and then starting to think the coin is biased.

Unfortunately, that's essentially what Scacco/Beber did. They picked on two numbers that lay outside their expected value and went off to calculate how frequently that would occur. That's cherrypicking the data.

What you do do is apply a chi-square test to figure out whether the numbers you are seeing could have been generated by a random process. And you use that test because it gives you the probability with which you can reject your null hypothesis.

To prevent you, dear reader, from having to run the test I've done it for you. I took their data and wrote a little program to do the calculation against the last and second-to-last digits. Here's the program:

use strict;
use warnings;

use Text::CSV;
my $csv = Text::CSV->new();

my %la;
my %sl;

foreach my $i (0..9) {
$la{$i} = 0;
$sl{$i} = 0;

my $count = 0;

open I, "<i.csv";
while (<I>) {
my @cols = $csv->fields();
for my $i (@cols[1..4]) {
my @d = reverse split( //, $i );
close I;

print "Count: $count\n";

my $e = $count/10;

my $slchi = 0;
my $lachi = 0;

foreach my $i (0..9) {
print "$i,$e,$sl{$i},$la{$i}\n";

$slchi += ( $sl{$i} - $e ) * ( $sl{$i} - $e ) / $e;
$lachi += ( $la{$i} - $e ) * ( $la{$i} - $e ) / $e;

print "slchi: $slchi\n";
print "lachi: $lachi\n";

Here's a little CSV table that you can steal to do your own analysis:

Digit,Expected Count,Second-to-last Count,Last Count

And true enough I get the same figures as Scacco/Beber. The number 7 does occur 17% of the time in the last digit, and the number 5 only occurs 4% of the time. But, I don't care. What I want to know is, is the null hypothesis wrong. Could these results have occurred from a random process? And with what likelihood.

So here's where I avoid staring at the numbers (which can get to be borderline numerology) and do the chi-square test.

For the last digit the magic chi-square number is (drum roll, please): 15.55 and for the second-to-last digit it's 9.33. Then I go to my chi-square table and I look at the row for 9 degrees of freedom (that corresponds to the 10 possible digits; if you want to know why it's 9 and not 10 go read up on the subject) and I see that the critical value is 16.92.

If either of my numbers exceeded 16.92 then I'd have high confidence (greater than 95%) that the digit counts were not random. But neither do. I cannot with confidence reject the null hypothesis, I cannot with confidence say that these numbers are not random, and I cannot with confidence, therefore, conclude that the vote counts are fraudulent.

What this means is, is that there is no 'statistically significant' difference between the Iranian results and randomness. So, what we learn is that this statistical analysis tells us nothing.

It doesn't mean that the numbers weren't fiddled, it just means that we haven't found evidence fiddling.

PS In the notes added to their annotated version of the article Scacco/Beber mention that they did the chi-square test and got a p-value of 0.077. This is below the 'statistical significance' cut off of 0.05 and so their results are (as I find) not statistically significant.

To put 0.077 in context it means that there's a 7.7% chance that the digits are random. Sounds small but 7.7 is approximately 8 in 100 or 4 in 50 or 2 in 25 or ... 1 in 12.5. i.e. in 1 in every 12.5 fair elections we shouldn't be surprised to see the sort of figures we saw in Iran. That's pretty often! That's why chi-square tells us not to find non-randomness in the Iranian results.

30 June 2009 Update: I've removed that paragraph because that interpretation of the p-value is arguably inaccurate and if you are a statistician you'd probably shout at me about it. Doesn't change the fact that the data says the Iranian result is not statistically significant; it just says that my attempt to do a 'layman's version' is faulty.

To come up with better layman's version I ran a little simulation to find out how often you'd expect to see one digit occurring more than 17% of the time with another occurring less than 4% of the time (as in the Iranian election). The answer is about 1.48% of the time, or in about 1 in 67 fair elections.

Britannica.com makes me want to weep

I got a marketing mail from Britannica.com trying to entice me back after I canceled my subscription. So, I figured I'd just go take a quick look at a random Britannica entry and remind myself of what I was missing. Nightmare.

On the Britannica.com home they were mentioning that their article about the US Voyager program was featured and I could see it for free. So I clicked.

This featured article contains 503 words that give the briefest of introductions to Voyager. The related articles are all about the planets that Voyager passed, and there's a connection to a general article about space exploration. There's absolutely no drill down to explore Voyager in any depth.

Of course, I whizzed over to Wikipedia and looked up the same subject. The main article contains 2,009 words and links to in-depth articles about Voyager 1 and Voyager 2. And there are links to interesting articles about their voyages, their power systems, the Voyager Golden Record and more.

And Wikipedia links you straight to the definitive source for Voyager information: NASA's Voyager Program page. Britannica doesn't link; they choose to link to a small collection of images of the Voyager craft from NASA's web site.

So, basically Britannica.com's article is close to useless because it's a dead-end and a short dead-end at that. In contrast, Wikipedia's article is rich, links to even more information and lets me get to source material.

And if that's not enough Britannica.com's page is infested with distracting ads. The worst of these are the weird keyword-linked ads buried right inside the article itself.

It looks like you might be able to click on, say, solar system in the article to drill down. Far from it! Hover over solar system and you get the following irrelevant, useless, pop-up ad.

Pure genius, Britannica.com. Pure, pure genius.

Now, Britannica.com's article does contain some drill down, but some of it is useless. For example, the Voyagers each contain a phonograph record with a recording of sounds from Earth (language, music, etc.). On the Britannica.com page the words phonograph record are a link. Click through and they will tell you what a phonograph record is, not about the ones on board the Voyagers. Thanks, I'm old enough to know what a phonograph record is.

So, Britannica.com, now you know why I donate money to Wikipedia, and don't buy your service.

Michael Faraday criticizes 'security theatre' from beyond the grave

I was reading David Knight's book Humphry Davy and at one point he describes the arrival of Davy and Faraday in France in 1813:

On arrival, Faraday reported, they were searched, an unusual experience for a true-born Englishman: 'he then felt in my pockets, my breast, my clothes, and lastly, desired to look into my shoes; after which I was permitted to pass', and could hardly help 'laughing at the ridiculous nature of their precautions'.

Lucky he doesn't have to fly anywhere.

Wednesday, June 24, 2009

The Turing Test and Prejudice

Yesterday, I blogged about how I believed that Alan Turing deserves an apology for the way he was treated. A few people asked how we should apologize. Here's what I would say.


I am sorry for the way you were treated. I am sorry that Britain treated a man of your genius, a hero of the Second World War, so despicably.

You laid the foundations of computer science, you helped break the Nazi Enigma code, and you were cut down in your prime because of prejudice. In your death we lost a great man who no doubt had much more to offer the world.

In 1950 you published a paper in which you proposed a way to determine if a machine was intelligent or not. This has become known as the Turing Test, and it has wider implications that just machine intelligence.

Your test involved asking a person to distinguish between a human intelligence and a machine intelligence by removing prejudice. You limit the judge to communicating with the human and machine via an intermediary (you proposed a teleprinter) so that the judge is unable to see or hear who they are communicating with. The judge is limited to judging intelligence alone. If the judge cannot tell the difference then the machine is deemed intelligent.

But replace the machine in your test with another human with some supposedly undesirable characteristic. Your test can pit a straight man against a gay man, a white man against black man, a man against a woman. I imagine if you and I were hidden behind the teleprinters, that you would be determined to be more intelligent.

Without the prejudice of knowing your sexuality, skin colour or sex, only your true values come through in the Turing Test.

At the end of that paper you write "We can only see a short distance ahead, but we can see plenty there that needs to be done." You took your own life four years later, after being prosecuted for homosexuality. I cannot fathom how much we lost when we lost you.

I am sorry that these things happened to you. But you may ask me, "What good is being sorry?"

An apology is really an atonement for the past, wrapped around a promise for the future. My promise (and I know there are others who will agree with me) is that we won't let prejudice prevent us from applying our own Turing Test to the people we deal with.

You may also ask me whether I write this because of an agenda. Because I want to take your death and use it to promote gay-rights, to hold you up as an example of how prejudice against gays harms the world.

I would be lying to you if I didn't tell you that I am personally uncomfortable with the implications of the acceptance of homosexuality. I suffer great internal conflict about questions such as "Should gay couples be allowed to have or adopt children?". I cannot overcome these feelings because they are grounded in an upbringing and they are in conflict with rationality and my own experiences.

But I see clearly that my feelings conflict with a simple truth: if I allow irrational opinions to guide my actions I lose my way. In allowing irrationality around homosexuality to guide our collective actions we lost you.

What we all need is to apply the Turing Test daily. I know that despite my own misguided feelings, I apply your test to those I encounter, and that is my way of apologizing to you.


Tuesday, June 23, 2009

Alan Turing deserves an apology from the British Government

When I started writing The Geek Atlas there was one name that was getting in the book no matter what: Alan Turing.

Alan Turing matters on many levels because he was, in the words of the memorial in Manchester:

Father of computer science, mathematician, logician, wartime codebreaker, victim of prejudice

Turing's work has affected us all. He's best know for his involvement in Second World War code breaking (especially for helping to break Engima) and if all he had done was that we would be grateful.

But Turing was also a critical pioneer of computer science. He defined a theoretical model of computers (at a time when 'computer' meant a person, often a woman, who computed numbers) that holds true today. He suggested how we might determine whether a computer was sentient (with the Turing Test).

Turing's death should remind us how prejudice ruins and degrades.

Alan Turing was gay. And he was prosecuted for 'indecent acts' and eventually took his own life aged 41. This man, younger than me, killed himself because at the time homosexuality was illegal and having been prosecuted he was chemically castrated in an attempt to 'cure' him. He had been stripped of his security clearance.

For years, his legacy was largely ignored outside the computer community. To quote Wikipedia:

In 1994 a stretch of the A6010 road (the Manchester city intermediate ring road) was named Alan Turing Way. A bridge carrying this road was widened, and carries the name 'Alan Turing Bridge'.

A frikkin' Ring Road!

It wasn't until 2001 that a statue was erected.

Today is Alan Turing's 97th birthday. Or at least it could have been if it were not for his prosecution and untimely death.

Isn't it time the British Government apologized for the way he was treated? We shouldn't let this anniversary of his death go by without recognizing the great works this man did and the ignominious way in which he was treated.

I need your help: I need iphone 3GS serial numbers

Do you own an iPhone 3GS? Would you be willing to give me its serial number?

I don't need to know who you are, what your phone number or IMEI number is, just the serial number.

"Why?", you ask.

Because I want to do an order statistics analysis of the serial number to attempt to calculate the number of iPhone 3GS models being sold.

If you click on Settings then General and then About you can find the serial number. Just email it to me. I'll publish the results later.

"How are you going to estimate the number of iPhone 3GS phones sold from serial numbers?", you cry.

Just like the Allies estimated the number of Nazi German tanks being produced.

Monday, June 22, 2009

Last digits analysis of UK and Iranian elections

After all the Benford's Law posts I've made, I read an interesting article by Alexandra Scacco and Bernd Beber in the Washington Post about analyzing the last two digits of election results.

Their thesis is that the last two digits should be random (i.e. equally likely) in genuine results, and would be non-random in faked results because people have biases about which numbers they come up with when thinking of 'random' numbers.

Scacco and Beber have an annotated version of the article available and their actual paper.

I thought this was pretty interesting I decided to run my own analysis on the 2001 UK General Election and the 2009 Iranian Presidential Election. Scacco and Beber used a smaller set of Iranian data (the first set that I used) and so I ran my reanalysis against the larger set from per-country returns.

Start with the UK. The following chart shows the expected distribution of last and second-to-last digits (i.e. a uniform distribution: all digits are equally likely) and the actual counts. A quick application of the chi-squared test shows that there's a good fit: we can't reject the hypothesis that the UK digits are uniformly distributed (i.e. random).

Now switch to Iran. Once again I show the exact same analysis of the vote counts across all candidates across the country. And once again the chi-squared test shows that these are random.

These two show random distribution. The chi-squared test confirms that (the actual values are 11.125 for the last digit and 4.875 for the second to last digit. With 9 degrees of freedom the critical cut off point is 16.92 and neither of these exceeds that so we cannot reject the hypothesis that the Iranian digits are uniformly distributed.

It's an intriguing idea that just by looking at the numbers it would be possible to detect election fraud, but it equally seems to me that you could cherry pick your data to come up with your viewpoint.

For example, in my analysis the UK election is not Benford's Law distributed but the Iranian one is. Which is fraudulent? Either, both, neither?

Also, my analysis shows that both the UK and the Iranian election have randomly distributed last digits. Are either fraudulent? Or neither?

I think what's needed is a large scale analysis of election results to see where and when different mathematical tests work. Otherwise the correct preconditions aren't established (e.g. in the UK election Benford's Law probably fails because of the redistribution of constituencies) and you can end up finding your favorite conclusion in the data.

A week without Google Search (Bing there done that)

I decided as an experiment to switch my default search engine in Firefox from Google to Microsoft's new Bing. That was last Monday; today I switched back to Google.

I made the switch using the Official Bing plug-in for Firefox and from then on I was binging instead of Googling.

The one-line summary: I found what I was looking for with Bing but I'm going back to Google.

Overall, Bing was an adequate substitute for Google, but didn't provide anything better. And in a few cases, it was worse.

I was able to find things faster with Google because the Google interface is cleaner. Although there's not much on the Bing page there's a distracting (to me) image at the top, the top menu fades into the colored background and there's an annoying left navigation thing that distracts from the main results. Visually these things sent my eyes searching around the page a bit more. (I think the 'related searches' on the LHS in Bing is a total distraction).

Here's a vanity search on Google (which I find easy to navigate):

And the same vanity search on Bing (lots of wasted space and a horizontal scroll bar):

However, the actual links that Bing was finding were just fine and in general I could easily find what I was looking for within the first five on either search engine. I didn't find Bing's cute popup page summary very helpful. I can usually figure out from the short summary below the link whether it's worth clicking on.

Bing really fails when it comes to news. It has no equivalent of a Google News search. Obviously, I'm interested in news stories about my book, The Geek Atlas so here's the search I do on Google News:

And here's the equivalent for Bing:

Also, Bing's image search is poorer than Google's. For example, when searching for images of a chi-square table on Google I get this:

With Bing I get irrelevant results and a horizontal scroll bar (yuck):

Overall, Bing is a worthy competitor for Google, but it's not worth switching to. Perhaps it will be one day.

Friday, June 19, 2009

Hazel Blears doesn't deal in pennies

The other amusing thing I came across late in the night was the Additional Cost Allowance for Hazel Blears. It seems that she doesn't claim pennies. In fact every single line item claimed for in that PDF file is a whole number of pounds (naturally there are no receipts because they are all either food or below the £250 threshold).

And they repeat month after month the same items with the same amounts. Here's a typical month:

Now, I need to admit that I've never been to Salford. Perhaps they don't use pennies there.

It's probably worth testing MPs' allowances with Benford's Law

Last night I couldn't sleep because I was thinking about all the juicy data in the newly released MPs' allowances. Unfortunately, the government chose to releases these as scanned PDFs which makes them open but hard to work with.

Nevertheless, I figured it would be fun to grab a couple of MPs' allowances, type them into a spreadsheet and then run a little test with them. I chose two MPs: Alistair Darling (because he's the Chancellor and keeps tabs on everyone's money) and Harriet Harman (because she is listed as one of the cheapest MPs of all).

I obtained the PDF files for 2007-2008 here and here and entered into a spreadsheet the amounts listed on the expense claim forms line by line (excluding totals and not delving into receipts for additional breakdowns).

Then I wrote a little program to do the Benford's Law analysis of the first digits of these line items. From that I was able to plot the expected frequency of each digit and the actual.

Here's the data for Harriet Harman:

You can see that the actual figures and the predicted follow along together rather nicely and if you do a chi-squared test to check for the correlation between the two you get a figure of 6.78 which indicates that it's not possible to reject the notion that "the figures in Harman's expense claims follow the Benford's Law distribution".

So, then I turned to the Chancellor and got a rather different chart:

And the chi-squared test comes out at 18.19 indicating that his expenses aren't following Benford's Law.

So then you have to ask yourself why? And in particular why does the Chancellor have "too many" 3s and 4s. The answer may lie in the expense claims themselves.

If you pop into the Chancellor's Additional Cost Allowance PDF you'll see that he claimed exactly £300 for food in May, July, September, October, November and December 2007.

The figures are exact and there are no receipts because they are not required for food bills. So, it's probably the case that these extra six number 3s account for the fact that he's got six more 3s than expected.

Now what about the fours, there are eleven more of those than expected. Note that these expected numbers don't have to match exactly the actual numbers. The idea is just to follow Benford's Law fairly closely.

So what accounts for the extra 4s? Perhaps the six occurrences of an exactly £45 telephone bill that he's claimed for.

Now, I'm not trying to claim that Mr Darling's expense claim is fradulent, but Benford's Law is interesting because it can tell you where to go look for oddities. And the oddest thing I see is that the parliamentary rules allow an MP to make expenses claims without receipts. I know that I couldn't get away with an expense claim without receipts in my company.

There are precisely three receipts in the Chancellor's Additional Cost Allowance claim; contrast that with Gordon Brown's which is a veritable receipt mountain.

And now, I know, you are dying to see the PM's chart. Well, here it is. It's a bit like Harriet Harman's. The figures are close to what is expected.

And Brown's chi-squared is 9.37 (meaning his figures follow the Benford's Law distribution).

Thursday, June 18, 2009

Does Benford's Law apply to election results?

Yesterday I blogged about an analysis I did of the Iranian Presidential election results and showed that they seemed to match closely with Benford's Law in the first and second digits.

A few people asked me if it would be expected that election results meet Benford's Law and in reading up on the research in this area I wasn't very convinced that they would always work. I read a number of papers, including this one and thought its justification was weak.

I figured the best way to test this was to go get some real election data from a country that I assume is not electorally corrupt and run exactly the same test. I obtained the per-constituency election results for the 2001 UK General Election and ran the same test against the Labour and Conservative party candidates.

The result is that these results do not match Benford's Law at all in either first or second digits. Here are the relevant graphs.

For the first digit we get chi-squared values well above the critical level for both Labour and Conservative which indicates that there is no correlation and these are not Benford's Law distributed. The same applies to the second digits.

I suspect this is because of the way the size of constituencies is constrained in the UK. They are subject to resizing and try to balance local geography (e.g. borough boundaries) and keep to an average size. In fact, there's not a lot of variation in constituency size. This point is actually made by the paper that I refer to above, although it claims that the second-digit would be significant.

So in this case Benford's Law seems to tell us nothing because of the underlying shape of the data.

So, I'm left with the curious conundrum of the Iranian results which seem to follow Benford's Law rather nicely and the British ones that don't. Could this simply be to do with the way in which the voting areas (British constituencies and Iranian counties) are organized?

At any rate it doesn't seem to me that you can just take election data, apply Benford's Law and come to any useful conclusion. Looks like the Carter Center agrees.

Wednesday, June 17, 2009

Benford's Law and the Iranian Election

Yesterday I saw a tweet from The Guardian Datastore saying that the Iranian election results were available as a spreadsheet. My immediate thought was "Ooh! Benford's Law".

Benford's Law is a lovely piece of mathematics which allows you to predict the distribution of the digits in lots of numbers. It gets used for spotting fraud quite frequently because people think that the digits in numbers appearing in, say, financial reports are random. In fact, they follow Benford's Law with the number 1 occurring most frequently at the start of a number.

The same sort of analysis can be applied to election results. Unfortunately, The Guardian's spreadsheet only has results by region (and there are only 30 regions) so there's not as much data as I'd like. It would be great to have votes per polling place.

So I got the data and wrote a little Perl script to do the analysis. This chart shows the expected frequency of the first digit of the vote results per region and the actual numbers for Ahmadinejad and Mousavi. Just looking at it, it doesn't look like there's anything fishy going on.

Applying the Pearson chi-squared test to this data shows that these results are in line with Benford's Law (Ahmadinejad gets 5.44 and Mousavi 7.22).

Looking at the second digit is harder because there's not much data and the curve in Benford's Law is flatter. But here's the chart.

But once again applying the statistical test says that these values are in line with expectations (Ahmadinejad gets 10.86 and Mousavi 5.41).

So this analysis doesn't provide a smoking gun.


A couple of people (the first was a gentleman called Ali Hashemi) pointed me to per-county data for the Iranian election which provides a much greater level of granularity. It was trivial to modify my script to use this data. Once again here's the graph for Ahmadinejad and Mousavi looking at the first digit only by county.

I don't even need to do the statistical test to see how close those actual results are to the predicted result.

Now to the second digit.

To look at that one in detail recall that the null hypothesis being tests is "the election results for candidate X match the Benford's Law distribution for the second digit". There are 9 degrees of freedom of the second digit and so the critical cut off value for statistical significance is (where P=0.05) is 16.92.

Running the test for the data above we get Ahmadinejad with 6.96 and Mousavi with 14.14. Since neither of those numbers is greater than 16.92 we'd conclude that the null hypothesis is not invalidated and so the results do match Benford's Law.

It's true that Ahmadinejad's results seem to match the distribution more closely than Mousavi's (you can see this just by looking at the chart and how Mousavi's numbers 'bounce around'), but neither are statistically significant.

I wonder what caused both of them to have a lot of 9s at the start?

Tuesday, June 16, 2009

If you build it, they will ignore it

The other day I came across a post about being an iPhone developer. It's an interesting look inside the mind of someone who made some good money for a small amount of effort with a non-original idea.

I think the key part comes when he says:

Blocked hit the App Store in early September 2008. I had a modest amount of downloads at first. And then, right after Christmas, sales jumped. I’m not sure what led to Blocked’s being chosen as a staff favorite, but I know that once I started actively promoting it through advertising, Web forums, YouTube, and Twitter, I saw an increase in activity.

In many ways the hard part comes after you've created something.

Take my book, The Geek Atlas, for example. I started writing it in May 2008 and spent 6 months full-time writing and researching. I stopped working and just worked on the book.

In November 2008, I handed the book to O'Reilly and in June of this year the book was published. It was a lot of work by me (and others) to get the book out. Now, do you think anyone cares how much work it was? Does any of that mean it'll sell?


What matters is promotion. I wrote it, it doesn't mean people will buy it. And the main reason they won't buy it is they've never heard of it. The same applies to the sales of iPhone apps.

There are over 25,000 iPhone apps and a small number of slots on the iTunes main page where you can see top apps or recommendations. This is analogous to Amazon.com. My book's one of many, many books and it sure gets a boost when it's on one of Amazon's top-N lists, but what really matters is promotion.

If you've made the effort to create something, make the effort to promote it.

For example, my book was given lots of airtime by Leo Laporte and Steve Gibson on the Security Now #199 podcast. Last week there was an article in the San Francisco Chronicle about it and the same day there was another article in The Times (of London). All of these articles came about because I went out and promoted the book. You can do the same for an iPhone app.

To get my book mentioned in as many high impact places as possible I made a list of every newspaper, magazine, blog and podcast I'd like to see it in. Then I went and found the relevant editor, blog owner, podcaster, etc. I obtained their email addresses. I wrote individual emails to every single one of them, tailored to their publication. Those that responded got free review copies of the book.

In one case, the San Francisco Chronicle, they asked me to write an article for them about the book.

It was a lot of work to make those things happen, but the only way to make a creative work successful is for people to hear about it.

And it doesn't end there. The promotion continues to make sure that people actually received review copies. To see if they have questions and know how to contact me (I handed out my mobile phone number). And I've recorded videos promoting the book.

At some point I realized that O'Reilly wasn't going to have the time to exploit the geekatlas.com domain name that I'd persuaded them to buy. So, I got it from them (thank you!) and set up my own site there. The site is all part of the promotion process.

Most recently I've been asking people who've read the book if they'd be willing to write reviews on Amazon. That's an important part of getting the book into people's hands since I (and everyone else) reads the reviews. I've been doing this by asking people who mention the book on Twitter if they'd write a review. You can search geek atlas and see exactly what I and others have said. (And, BTW, if you've read it please consider writing an honest review on Amazon)

It's not enough to create, you have to market.

PS I've ignored all the wonderful efforts made by the staff of O'Reilly in promoting my book. I didn't mean to downplay your work... I'm just making a point about the effort it takes to get people to hear about The Geek Atlas.

The Billionaire Donation

Yesterday there was a post on Hacker News about how little money people who make donation-ware WordPress plugins actual end up getting.

Almost nine years ago I released my own donation-ware project called POPFile. It's an GNU GPL licensed email sorting program that uses Naive Bayes to do automatic sorting and spam removal. During 2003 and 2004 it was very popular.

One way of supporting POPFile was to make donations to my PayPal account and over the years people did make donations: 353 in all. The average donation size was $16.39 and I received a total of $5,784.95 (which works out to $74.17 per month). The following chart shows the donations received per month.

If I take SourceForge's numbers as accurate and representative of the total number of POPFile downloads then we have 928,800 downloads which means that 0.038% of downloads resulted in a donation. Or, put differently, a single download was worth $0.006.

One day in 2003 I received a donation from a billionaire. This person, who I'll call simply J. Doe, sent me $25 via PayPal

You've Got Cash!

Dear John Graham-Cumming,

J. Doe just sent you money with PayPal.
J. Doe is a Verified buyer.

Payment Details

Amount: $25.00
Subject: POPfile donation
Note: Thanks for a great product, keep up the good work!

As I did for every single donation I received I replied with thanks:

Thanks for the donation. Glad to hear that POPFile is working out for you; are you just using it for spam filtering or something more?

And J. Doe replied:

Actually, I have 20 buckets for various topics I receive e-mail related to. One of them is spam, obviously. And I run multiple e-mail accounts through the system.

I'm also doing something potentially interesting, but a major hack: some of my accounts use APOP, so I'm using the hacked version I found in the forums. But the non-APOP accounts then don't work with the same instance of POPfile, using the Mac's mail program -- it always uses APOP if the greeter gives an APOP timestamp, even if you tell it not to. So I run a second instance of POPfile, and symlink the corpus to the first instance.

Kind of strange and bizarre, but it works for now.

I know it isn't trivial, but are you planning on adding support for SSL?

And we bounced back and forth emails for a while. And J. Doe ended up telling me that POPFile had 'saved' an email address that had been public years and wanted to continue to use.

But J. Doe only sent me $25. J. Doe probably could have afforded to send $250 or $2,500. But J. Doe sent $25.

This is entirely because I set the price of POPFile at $0. It's free. Donations are purely altruistic. J. Doe got nothing more from me than anyone else who's emailed me about POPFile over the years. And J. Doe even understood that he'd got a large amount of value from POPFile.

If you choose to do donation-ware you need to realize that almost no one donates. You are making a choice to give away your software and need to treat every donation as what it is: an unexpected gift.

If you want to make a living forget about donations and sell your software. Sell support for your software. Make it your living.

If I really wanted to get J. Doe's money I could have made POPFile closed source, I could have gone and sold the product. I could have made the case for how much saving that email address was worth and I could have charged J. Doe a lot more than $25.

But that's a whole different ball game; that's business.

Monday, June 08, 2009

The Geek Atlas helping to save Bletchley Park

When I was creating the original list of places in The Geek Atlas there were a few places that caused my heart to beat a little faster just recalling visits to them. One of those places was Bletchley Park.

Bletchley Park combines cryptography, the Second World War and the out-and-out genius Alan Turing. How could I not include Bletchley Park? It's the place Enigma and Lorenz were broken. Enigma is most famous, but I find Lorenz more fascinating because of its use of binary, pseudorandom numbers and and the XOR operation (but that's another story---told in my book).

And now it's been restored to its wartime glory. Yet it is in severe financial difficulty and has recently been denied further funding by the British government. Some months ago I donated as much money as I could afford to help save Bletchley Park.

So, when O'Reilly (the publisher) of my book suggested that they would donate 50p for every copy of my book sold in the UK in the next 12 months I jumped at the chance to help.

So, if you buy The Geek Atlas in the UK you are getting a fascinating book which describes Bletchley Park (and 127 other great places to visit), the Engima and Lorenz codes and Alan Turing's life and work.

And you are helping save this unique site.