Yesterday I saw a tweet from The Guardian Datastore saying that the Iranian election results were available as a spreadsheet. My immediate thought was "Ooh! Benford's Law".
Benford's Law is a lovely piece of mathematics which allows you to predict the distribution of the digits in lots of numbers. It gets used for spotting fraud quite frequently because people think that the digits in numbers appearing in, say, financial reports are random. In fact, they follow Benford's Law with the number 1 occurring most frequently at the start of a number.
The same sort of analysis can be applied to election results. Unfortunately, The Guardian's spreadsheet only has results by region (and there are only 30 regions) so there's not as much data as I'd like. It would be great to have votes per polling place.
So I got the data and wrote a little Perl script to do the analysis. This chart shows the expected frequency of the first digit of the vote results per region and the actual numbers for Ahmadinejad and Mousavi. Just looking at it, it doesn't look like there's anything fishy going on.
Applying the Pearson chi-squared test to this data shows that these results are in line with Benford's Law (Ahmadinejad gets 5.44 and Mousavi 7.22).
Looking at the second digit is harder because there's not much data and the curve in Benford's Law is flatter. But here's the chart.
But once again applying the statistical test says that these values are in line with expectations (Ahmadinejad gets 10.86 and Mousavi 5.41).
So this analysis doesn't provide a smoking gun.
A couple of people (the first was a gentleman called Ali Hashemi) pointed me to per-county data for the Iranian election which provides a much greater level of granularity. It was trivial to modify my script to use this data. Once again here's the graph for Ahmadinejad and Mousavi looking at the first digit only by county.
I don't even need to do the statistical test to see how close those actual results are to the predicted result.
Now to the second digit.
To look at that one in detail recall that the null hypothesis being tests is "the election results for candidate X match the Benford's Law distribution for the second digit". There are 9 degrees of freedom of the second digit and so the critical cut off value for statistical significance is (where P=0.05) is 16.92.
Running the test for the data above we get Ahmadinejad with 6.96 and Mousavi with 14.14. Since neither of those numbers is greater than 16.92 we'd conclude that the null hypothesis is not invalidated and so the results do match Benford's Law.
It's true that Ahmadinejad's results seem to match the distribution more closely than Mousavi's (you can see this just by looking at the chart and how Mousavi's numbers 'bounce around'), but neither are statistically significant.
I wonder what caused both of them to have a lot of 9s at the start?