Saturday, July 03, 2010

Bayes, Bletchley, JN-25 and a 'modern' optimization

In the current edition (June 2010) of significance (the magazine of the Royal Statistical Society) there's an article titled Edward Simpson: Bayes at Bletchley Park. Simpson is the man behind Simpson's Paradox (which, if you have not heard of it, you should immediately go and read about because it's... weird). He also worked at Bletchley Park during the Second World War.

I found the article fascinating for three reasons: it's about Bletchley Park, it's about Bayes Theorem (which I've spent quite a lot of time dealing with) and it highlights an optimization that was done to enable WRENs to work faster, yet is still in use today.

The article talks about life at Bletchley Park, and the breaking of two codes: Enigma and the Japanese JN-25. So much has been written about Enigma that I have little more to say about it, but JN-25 is a different matter. Its breaking involved a bit of Bayes and ended up helping the Allied victory at the Battle of Midway.

JN-25

This Japanese Naval code works as follows. The Japanese word to be enciphered is looked up in a book where a corresponding 5 digit number is found. This number is the input to the encipherment. For example, the Japanese word for weather might be encoded as 30525. To help prevent transcription errors the Japanese code numbers were all divisible by three. Thus it would be easy to check a message before enciphering it (especially since it's easy to check if a number is divisible by 3).

So the message would consist of blocks of five-digit numbers. Next a book containing random five digit numbers was used to select a sequence of numbers that would be used to encipher the message. For each five digit group in the message, a five digit number from the book (in sequence) would be chosen.

Simpson's article has the following example:

Words: MARU GOOD WEATHER
Corresponding code numbers: 70863 34131 30525
Secret sequence from code book: 58304 68035 91107

The original five digit code representing a word would be added to the secret five digit number digit by digit with no carry (a form of modulo arithmetic). For example, "weather" might be 30525 and the chosen secret number 91107 and so the enciphered word would be 21622. The recipient would reverse the addition to get at the message.

That's a very secure system if the secret numbers are truly random and if they are not reused. In fact the Japanese secret numbers were not randomly used. The Japanese sent multiple messages using the same sequences of random numbers.

In addition, by cracking messages, Bletchley Park was able to build up a table of code numbers and their frequency of occurrence. That frequency of occurrence made an application of Bayes theorem possible.

Here's how it worked. A code breaker would be given a set of messages all known to have been enciphered with the same stretch of secret numbers. They might have between 2 and 20 messages. Suppose that in one message there's "good" (34131) enciphered as 92166 using the secret number 68035 and in another there's "stop" (50418) enciphered with the same secret number as 18443. The code breaker has in front of them 18443 and 92166 and by lining up messages on the table has columns of numbers enciphered with the same, unknown secret numbers.

The difference between 18443 and 92166 (using the modulo arithmetic) will be 26387. This is in fact the difference between the original code words 34131 and 50418: the secret number 68035 has been eliminated.

What the folks at Bletchley Park did was build themselves a big table of differences between the code words. The code breaker would look up 26387 in the book to see if this difference was known (indicating that the two underlying words had been identified and they would have been present in the book as 34131 and 50418). If 26387 appeared in the book then the actual code numbers (34131 and 50418) could be subtracted from the enciphered numbers 18443 and 92166 to reveal the secret number 68035 that was used to encipher the message.

Since they had many messages to work with they would actually check all pairs of numbers. For each pair, a difference is calculated and the resulting number looked up in the book to see if it exists.

Now for Bayes

On the table in front of the code breaker are a set of messages. They would take the number 68035 and subtract it from each enciphered number in the same column to reveal the underlying code number. If any of the numbers was not divisible by three then the decipherment had failed, but if not the next stage was undertaken.

Since the code numbers represented actual words they had a pattern of frequency of occurrence. So, the code breaker would take the list of numbers from the column they had deciphered and look up their frequency in a book. Suppose the code breaker has numbers a, b, c, ... and finds that a and c are in the book but b is not (perhaps because it's wrong or perhaps because the five digit code has never been seen before). The book gives P(a) and P(c) (the probability of those words occurring based on their frequency), the probability that those five digit groups occurred randomnly (actually 3/100000) was called P. Thus the Bayesian evidence given by a occurring is P(a)/P (and similarly for c).

For the word b that hasn't been seen before (or may be wrong) a frequency of occurrence of 1 was set which meant that P(b) was tiny and hence b was essentially ignored.

Then Bayes Theorem was used to multiply together P(a)/P and P(c)/P to get a probability that the number 68035 was the real underlying secret word used to encipher the numbers. If that probability was above a certain threshold the number 68035 was taken as genuine and the decipherment could continue.

Streamlining

There are three things that struck me in Simpson's explanation. Firstly, the prior probability is ignored in the Bayesian calculation. This sort of ignoring the priors happens in Naive Bayesian text classifiers where the prior probability of a 'document' or 'text' is irrelevant: it just acts as a scale factor that can be ignored. I did exactly that in the implementation of POPFile.

Secondly, the handling of the 'unknown' code numbers (b in the example above) is very similar to the way in which POPFile handles unknown words. A very small (close to 0, but non-zero) probability is chosen indicating that this word probability doesn't exist, but just might do.

The other optimization that occurs in POPFile is that rather than multiplying Bayes factors together, the logs of Bayes factors are added together. This is fast and prevents arithmetic underflow when dealing with tiny probabilities. Recall that log AB = log A + log B and log A/B = log A - log B. This is exactly the same trick that makes slide rules work (sliding the rule is 'adding' on a logrithmic scale).

This technique is common (I certainly didn't invent it) and was used at Bletchley Park. Instead of having WRENs do laborious multiplications and divisions of Bayes factors a book of scores was created consisting of the log of the Bayes factor for each code number. Checking whether the code numbers passed the Bayes test to reveal a useable secret number was thus a matter of looking up and adding these factors and seeing if the sum passed a threshold value.

PS If you have access to significance the rest of the article is well worth reading.

No comments: