Friday, June 24, 2011

Mean and standard deviation of the distances between letters

The following table shows is computed by calculating the distance between occurrences of letters in Jane Austen's Pride and Prejudice (for example, the distance between e's in the sentence "Hello, my name is Jeremy" is 9 and 2; all punctuation and spaces are ignored).

LetterMeanStandard Deviation
e7.76.3
t11.510.0
a12.911.1
o13.312.0
i14.212.3
n14.212.5
h16.014.2
s16.315.6
r16.514.4
d24.122.2
l25.025.9
u35.634.5
m36.535.1
c39.239.2
y42.342.3
w43.941.9
f44.643.9
g52.850.4
b58.955.6
p63.568.3
v94.492.0
k165.1172.8
j569.2638.9
z582.8834.4
x635.0669.6
q859.4873.9

4 comments:

martijn said...

I take it this is quite useful for cracking simple codes. Did you try other works? More recent ones? Non-fiction ones? Do they differ significantly? How big a sample text would you need for these numbers to approach their 'real' values?

Questions, questions...

idpage said...

Presumably there's some kind of mathematical relationship between these distances and the relative frequencies of the letters, possibly distorted by English n-graph frequencies?

RBerenguel said...

And odd post John. Why were you looking at this? Some cryptographic thing, I guess? Or just out of boredom?

Cheers,

Ruben @mostlymaths.net

sesqu said...

Those measurements seem a lot like the actual letters might fit Poisson distributions. I guess that means she has a wide vocabulary?