Saturday, June 30, 2012

Meet Brad Bradstone, CEO of Yellow Yellow

Last week I wrote a parody post on Hacker News that made fun of the life hacker movement and of what I saw as the sort of oversharing that some startup founders and CEOs indulge in. To my surprise the post got almost 600 points and received 125 comments. Many of the comments expressed how funny people found it (especially the concept of double stealth mode) and some asked for more.

Hacker News isn't the right place for a continuing stream of such long posts, so I've created a blog for Brad Bradstone, the fictional CEO of New York, New York based startup Yellow Yellow. That blog is called Double Stealth.

This is a bit of an experiment because I may run out of time or inspiration in writing as Brad, but I hope people find it funny.  I would love to hear feedback from readers who enjoy it, or who have suggestions for things Brad might write about.

So far, he's written the following posts:

Tuesday, June 26, 2012

Waiting around for something to happen in Go

I was faced with what turned out to be a synchronization problem with a simple solution using Go.  How do you get multiple goroutines to wait for a single event to occur?  The answer is... close a channel.
package main

import (
    "fmt"
)

func main() {
    count := make(chan int)
    hold  := make(chan int)

    num := 16

    for i := 0; i < num; i++ {
        go func(i int) {
            fmt.Printf("%d ", i)
            count <- 1

            <-hold

            fmt.Printf("%d ", i)
            count <- 1
        }(i)
    }

    for reported := 0; reported != num; reported += <-count {
    }

    fmt.Printf("\n")

    close(hold)

    for reported := 0; reported != num; reported += <-count {
    }

    fmt.Printf("\n")
}
This program outputs something like:
0 2 1 4 5 3 8 9 7 6 12 13 14 11 10 15 
1 5 4 3 2 0 8 6 13 14 12 7 9 11 10 15 
The exact order will depend on the order in which the goroutines are executed.  The program uses two channels called count and hold. The count channel is used by the 16 goroutines to each inform the main program that they have performed a printing task.  The main routine counts the number of messages it has seen on the count channel so it knows when all the goroutines have completed their first print.

The main routine then closes the hold channel.  All of the goroutines are trying to receive something on that channel.  When it is closed the receive fails.

The goroutines then all print out their id number again and terminate.  The program waits for them to finish by receiving another 16 messages on the count channel.

Monday, June 25, 2012

What the final Turing Machine Google Doodle does

If you solved the puzzles in the Google Doodle commemorating Alan Turing's 100th birthday and clicked on the rabbit you'd get to see the machine performing this program starting with a blank tape.


Translating that into pseudo-code you obtain the following:
one

do
  if one
    blank

    do
      right
    while not blank

    one
    right
    zero

    while not blank
      left
    end

    one
    right
  else
    blank

    do
      right
    while not blank

    one

    while not blank
      left
    end

    zero
    right
  end
end
There one, zero and blank each write those symbols to the same; when used in an if or while they test for the presence of a symbol.  Right and left move the tape one square left or right.

The two branches of the main if there are doing very similar things.  They both blank out the current square, then move right until they find blank space and write 1 0 or 1 depending on whether the original square examined in the if was 1 or 0.  Then the tape is moved back to the blank and the original number 1 or 0 replaced.

So, it's extending the sequence on the tape by 1 0 or 1 depending on whether the number it sees is 1 or 0.

Each time around the outer loop the tape changes as follows:
1
1 1 0
1 1 0 1 0
1 1 0 1 0 1
1 1 0 1 0 1 1 0
1 1 0 1 0 1 1 0 1
1 1 0 1 0 1 1 0 1 1 0
1 1 0 1 0 1 1 0 1 1 0 1 0
1 1 0 1 0 1 1 0 1 1 0 1 0 1
1 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0
1 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 1 0
1 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 1 0 1
1 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 1 0 1 1 0
1 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 1 0 1 1 0 1
1 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 1 0 1 1 0 1 1 0
This program appears to be calculating the rabbit sequence 1, 10, 101, 10110, ... and writing it out onto the tape.

And here's a little Perl program the simulates the Turing machine in the Google Doodle:
use strict;
use warnings;

my @tape;
my $pos = 0;

one();

do {
  dump_tape();

  if ( is_one() ) {
    blank();
    do { right() } while ( !is_blank() );

    one(); right(); zero();

    while ( !is_blank() ) { left() }

    one();
  } else {
    blank();
    do { right() } while ( !is_blank() );

    one();

    while ( !is_blank() ) { left() }

    zero();
  }

  right();
} while(1);

sub zero  { x('0') }
sub one   { x('1') }
sub blank { x(' ') }
sub x     { $tape[$pos] = $_[0] }

sub is_blank { is(' ') }
sub is_one   { is('1') }
sub is       { $tape[$pos] eq $_[0] }

sub right
{
  if ( $pos == $#tape ) {
    push @tape, (' ');
  }
  $pos += 1;
}
sub left  { $pos -= 1 }

sub dump_tape
{
  print join( ' ', @tape ), "\n";
  sleep(1);
}

Friday, June 22, 2012

Is there a 'Moore's Law' for web pages?

I came across an article I wrote in 1999 for The Guardian entitled Cut you modem's flying time which mentions that at the time the HTML of The Guardian's home page was 18kB.  Today the home page is more like 250kB.

I was curious about the growth pattern so using the Internet Archive I downloaded the home page of The Guardian for every year available from 1996 to 2011 (plus the current page) and compared the sizes of the HTML of the front page.  Here's the raw data:

  Year  Bytes
  ----  -----
  1996   5381
  1997  11140
  1998  10435
  1999  39013
  2000  97746
  2001  70933
  2002  92995
  2003  81833
  2004  92637
  2005  92078
  2006 108445
  2007 118300
  2008 186670
  2009 184271
  2010 181221
  2011 192592
  2012 253748

This excludes anything other than the raw HTML of / on The Guardian.  Clearly, it's grown a lot, but curious about the pattern I decided to see what curve would fit by using Wolfram Alpha.  Linear, quadratic and cubic fits were all about an R^2 of 0.90.  But an exponential fitted with R^2 of 0.97.

The exponential that generates that curve is 28985.6 * (1.134292^x) (x being the year counting 1996 as 0).  For comparison, Moore's Law is n * 1.414213^x (doubling every two years; I don't have an estimate for n).
For that exponential doubling takes a bit more than 5 years.
I wonder if there's a 'Moore's Law' for web sites.  Are we seeing exponential growth in the HTML used?  And what happens if we take into account the other assets?  And what's the relationship between this and bandwidth consumed on the Internet?
Discuss.

New Scientist's Instant Expert on Alan Turing now free to non-subscribers

Tomorrow is the 100th anniversary of the birth of Alan Turing and earlier this year I was asked to write a special supplement for New Scientist covering his life and work.  The supplement was in the June 1 edition of the physical magazine and available on line to subscribers only.

This morning a gentle nudge of New Scientist persuaded them to make the entire thing open to anyone who registers on their site.  You don't have to pay, just register.  The complete thing is here.

The individual articles in the special are:
A big thank you to New Scientist for making this available to all.

Wednesday, June 13, 2012

Autograms

The following is a list of words that can be anagrammed into other words.  It is ordered by the number of possible words that can be made from a single word (e.g. dog and god have the same letters, trap, part and prat have the same letters).  It shows all the words that have 6 or more possible anagrams.

I particularly like that canter can turn into trance, recant, nectar, cretan, tanrec, and creant.

10 angor argon goran grano groan nagor orang organ rogan ronga
10 elaps lapse lepas pales salep saple sepal slape spale speal
9 ester estre reest reset steer stere stree terse tsere
9 caret carte cater crate creat creta react recta trace
8 asteer easter eastre reseat saeter seater staree teaser
8 laster lastre rastle relast resalt salter slater stelar
8 armet mater metra ramet tamer terma trame trema
8 leapt palet patel pelta petal plate pleat tepal
8 arist astir sitar stair stria tarsi tisar trias
7 canter creant cretan nectar recant tanrec trance
7 alem alme lame leam male meal mela
7 acinar arnica canari carian carina crania narica
7 argel ergal garle glare lager large regal
7 ante aten etna neat taen tane tean
7 least setal slate stale steal stela tales
7 aril lair lari liar lira rail rial
7 alert alter artel later ratel taler telar
7 abel able albe bale beal bela blae
7 dater derat detar drate rated trade tread
7 easting gainset genista ingesta seating signate teasing
7 arpent enrapt entrap panter parent pretan trepan
7 entrail latiner latrine ratline reliant retinal trenail
7 darter dartre redart retard retrad tarred trader
7 albeit albite baltei belait betail bletia libate
7 alien aline anile elain elian laine linea
7 atle laet late leat tael tale teal
7 aldern darnel enlard lander lenard randle reland
7 lepra paler parel parle pearl perla relap
7 emir imer mire reim remi riem rime
6 aitch chait chati chita taich tchai
6 acher arche chare chera rache reach
6 reins resin rinse risen serin siren
6 cerotin cointer cotrine cretion noticer rection
6 gater grate great greta retag targe
6 degrain deraign deringa gradine grained reading
6 angler arleng garnel largen rangle regnal
6 alban balan banal laban nabal nabla
6 caliver caviler claiver clavier valeric velaric
6 aster serta stare strae tarse teras
6 alerse leaser reales resale reseal sealer
6 agnel angel angle genal glean lagen
6 amen enam mane mean name nema
6 apert pater peart prate taper terap
6 petrous posture proetus proteus septuor spouter
6 angrite granite ingrate tangier tearing tigrean
6 pearlet pleater prelate ptereal replate repleat
6 amil amli lima mail mali mila
6 alevin alvine valine veinal venial vineal
6 altair atrail atrial lariat latria talari
6 sawt staw swat taws twas wast
6 balder bardel bedlar bedral belard blader
6 halse leash selah shale sheal shela
6 danuri diurna dunair durain durani durian
6 asper parse prase spaer spare spear
6 grein inger nigre regin reign ringe
6 corset cortes coster escort scoter sector
6 ates east eats sate seat seta
6 beround bounder rebound unbored unorbed unrobed
6 actor corta croat rocta taroc troca
6 derange enraged gardeen gerenda grandee grenade
6 asterin eranist restain stainer starnie stearin
6 clethra latcher ratchel relatch talcher trachle
6 esker keres reesk seker skeer skere
6 acrolein arecolin colinear cornelia creolian lonicera
6 agroan angora anogra arango argoan onagra
6 inset neist snite stein stine tsine
6 evil levi live veil vile vlei
6 anis nais nasi nias sain sina
6 palster persalt plaster psalter spartle stapler
6 abord bardo board broad dobra dorab
6 aspen panse snape sneap spane spean
6 mate meat meta tame team tema
6 roset rotse soter stero store torse
6 poter prote repot tepor toper trope
6 resaw sawer seraw sware swear warse
6 earthen enheart hearten naether teheran traheen
6 owser resow serow sower swore worse

Here's the code that I used to generate the list
use strict;
use warnings;

my %proper;

open W, "</usr/share/dict/propernames";
while (<W>) {
    chomp;
    $proper{lc($_)} = 1;
}
close W;

my %words;

open W, "</usr/share/dict/words";
while (<W>) {
    chomp;
    my $w = lc($_);
    if ( !defined( $proper{$w} ) ) {
        my $l = join('', sort split('', $w));
        if ( !defined( $words{$l}{words}{$w} ) ) {
            $words{$l}{count} += 1;
            $words{$l}{words}{$w} = 1;
        }
    }
}
close W;

foreach my $w (sort { $words{$b}{count} <=> $words{$a}{count} } keys %words) {
    print "$words{$w}{count} " . join(' ', sort keys %{$words{$w}{words}}), "\n";
}

Friday, June 08, 2012

One way to fix your rubbish password database

Suppose for a moment that you are running a web site that has a poorly implemented password database that stores password hashes using something like SHA1(password) or MD5(password) or versions of that with some salt.  (This is what LinkedIn, last.fm and eHarmony were all doing and undoubtedly many, many other companies are in the same position).

And suppose you realize that if your password database gets disclosed your users are in danger and you'd really like to change the system used to something much more secure like bcrypt or scrypt.

Here's how you can do that in one go and secure your password database.

1. Suppose you have a database that contains password hashes for the n users of your site, and for each user you have salt si and hash hi (where hi was computed with some algorithm such as SHA1 or MD5).  (Note that the rest of these instructions work if there's no salt, just ignore it).

2. Suppose that you've chosen to use scrypt.  For each user you first create a new random salt value s'i and then compute a new hash h'i using the formula scrypt(s'i, hi) and store the new salt and hash in the database.  You throw away the old weak hash hi and forget it ever existed. So you are left with two salt values and a new hash.  (I've ignored the other scrypt parameters which are left to the reader to determine).

3. When user i logs in and presents password p you use your old weak hash algorithm (let's suppose it was md5(salt, password)) to compute a hash for comparison as follows: scrypt(s'i,  md5(si, p)) and compare that with the h'i stored in the database.

4. If, like last.fm, you were also allowing third-parties to authorize users by presenting the old hash value instead of the password then you can still use this scheme.  When the third-party presents hash h for user i you calculate scrypt(s'i,  h) and do the comparison.

5. If step 4 is not needed then you can go further when a user logs in.  Once the user has logged in successfully with password p you can completely eliminate any trace of the old weak hash by choosing a new random salt value s''i and computing scrypt(s''i,  p) and storing that in the database.

This has the effect of immediately making your password database more secure if stolen without any effort on the part of your users.

Wednesday, June 06, 2012

Don't be reckless with other people's hearts, And don't put up with people that are reckless with yours

And that goes for passwords also.

Today, LinkedIn had been the subject of a disclosure of 6.4 million passwords likely corresponding to at least that many LinkedIn accounts.  A file containing 6.4 million hashed passwords was released onto the Internet and those passwords were easily cracked (not all, but most).  Given that each hash was unique and that many people use the same passwords (such as password, secret and, in this case, l1nked1n) it's likely that many more than 6.4 million customers of LinkedIn are affected.

After many hours of 'we're looking into it', LinkedIn posted a somewhat detail-free blog post entitled An Update on LinkedIn Member Passwords Compromised.   It begins:
We want to provide you with an update on this morning’s reports of stolen passwords. We can confirm that some of the passwords that were compromised correspond to LinkedIn accounts. We are continuing to investigate this situation and here is what we are pursuing as far as next steps for the compromised accounts:
There's no data here on how many, just a vague 'some'.  Pity that they missed an opportunity to simply come clean.  Further on they say:
These affected members will receive a second email from our Customer Support team providing a bit more context on this situation and why they are being asked to change their passwords.
Pity again that they missed the opportunity to be open here.  Why is it that only the affected people get 'more context'?  But they continue:
It is worth noting that the affected members who update their passwords and members whose passwords have not been compromised benefit from the enhanced security we just recently put in place, which includes hashing and salting of our current password databases. 
This paragraph is extremely worrying.  Firstly, it's an admission that LinkedIn were not using anything close to a best practice in storing passwords.  The leaked passwords were all hashed with SHA1, an almost useless technique given the speed with which SHA1 based passwords can be tested.  Worse, the 'enhanced solution' of  'hashing and salting' is literally 1970s technology.

Of course, they could have implemented something strong like PBKDF2 or scrypt.  And given Kerchoff's Principle it wouldn't have hurt them to have come out publicly and said precisely what they were doing.  But they didn't.

Also, there's no news on when 'recently' was.  Worse case it was just before that blog post was written. I hope that's not true.

And then...
We sincerely apologize for the inconvenience this has caused our members. We take the security of our members very seriously.
Inconvenience?  It's hard to take seriously the final sentence given that it appears that the storage of passwords was very, very weak.

What they should do now is simply tell the truth (the whole truth).  Admit their failings, tell us how many people were compromised, tell us how they are storing passwords now.  In security openness breeds trust.

PS Just before LinkedIn posted their admission that passwords had been disclosed, they put up a blog post on password security which contains terrible advice such as "Substitute numbers for letters that look similar (for example, substitute “0″ for “o” or “3″ for “E”."   In a situation like the LinkedIn disclosure today that does absolutely nothing.  All the good password cracking software can make those substitutions.

They also didn't recommend using a password manager of any type, but recommend not writing your passwords down and using a different password on each site.  How, exactly, is someone who is not Rain Man meant to remember different complex passwords for every site and, following LinkedIn's advice, change their password every three months?

Either write your passwords down or use a password manager.

LinkedIn's entire handling of the situation makes be even happier that I don't have a LinkedIn account.



Monday, June 04, 2012

Animated solution to the "Never gonna give you up" program problem

I mentioned the fun "smallest program to output the lyrics of 'Never gonna give you up'" the other day.  It being a bank holiday weekend I've come up with my own solution which has 589 bytes in Perl.

It works by building a dictionary from which to compress the lyrics and then there's a simple decompressor.  Like many of the solutions this works by having the dictionary be indexed by single bytes and be built into a single string that expands to the result.

Here's the compressor.
use strict;
use warnings;

my $n =<<EOF;
We're no strangers to love
You know the rules and so do I
A full commitment's what I'm thinking of
You wouldn't get this from any other guy
I just wanna tell you how I'm feeling
Gotta make you understand

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

We've known each other for so long
Your heart's been aching but
You're too shy to say it
Inside we both know what's been going on
We know the game and we're gonna play it
And if you ask me how I'm feeling
Don't tell me you're too blind to see

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

(Ooh, give you up)
(Ooh, give you up)
(Ooh)
Never gonna give, never gonna give
(Give you up)
(Ooh)
Never gonna give, never gonna give
(Give you up)

We've know each other for so long
Your heart's been aching but
You're too shy to say it
Inside we both know what's been going on
We know the game and we're gonna play it

I just wanna tell you how I'm feeling
Gotta make you understand

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
EOF

my @b = map(chr, (reverse 129..254));

foreach my $i (reverse 2..173) {
    my $found = 1;
    while($found) {
        my %saving;
        $found = 0;

        foreach my $j (0..length($n)-$i) {
            my $s = substr($n, $j, $i);
            my $t = $n;
            $t =~ s/\Q$s\E/0/g;
            $saving{$s} = length($n) - length($t) - $i - 1;
        }

        my $top = (sort { $saving{$b} <=> $saving{$a} } keys %saving)[0];
        if ($saving{$top} > 0) {
            my $nb = shift(@b);
            $n =~ s/\Q$top\E/$nb/g;
            $n = $top . "$nb$n";
            $found = 1;
        }
    }
}

print $n;
It works by looking for all strings starting at length 173 (which I determined by looping and corresponds to the entire chorus) and working backwards. It computes the space saved by extracting that string and replacing it with a dictionary entry and removes the strings that provide the most space saving. When there are no more left for a specific length it moves onto the next length (one smaller). The result is 514 byte string containing a bunch of high-ASCII characters like this:
 gÐngÑveÒe ÓthÔouÕo Önd× yÕØÖloÙiÑ Út's Û(OohÜe'reÝ
YÕÞ knowß,ÐivàÚoá I'm â tÖsãtell ä a× åØ æeåçay it
è äértØêß ëÐonna ì oÔer íæupîmakeæïÛbeen ðëÔÓñÕ'rÓtoÖòeî)
óeÒrìô
We'Òßõ
NôöóÜ÷ôgiÒø howâfeeliÑ
ù
Nøú÷)ú, nø
(Givû
I just wannaéyÕùGotta ïu×ersta×
þü eachífor sÙÑÞr hearðachÚbut
YòshyãèInsidÓwÓboÔëwhaðgoán
WeñgamçwÝìplèýúîöletædownörun arÕ×ådeseêöïcryösayÐoodbyeöäa liçhuê
þWÝ nÖstraÑers tÙÒÞñrulesåsÖdÖI
A full commitmenÛwhatâÔinkáfÞ wÕldn'tÐet Ôis from anyíguyüõnýA× 
ifæask meùDon'témÓyòbli×ãee
þþ
Üà÷àûûóõýüþþ
To understand the compression look at the start. The first string to be decompressed will be " g", then "ng", then "ve" and so on. The high-ASCII characters are both the dictionary separators (between entries) and the characters to be replaced with the corresponding entry in the dictionary. Thus each occurrence of "Ð" is replaced with " g". A complete program to decompress to the lyrics is as follows:
$b=<<E;
 gÐngÑveÒe >ÓthÔouÕo Önd× yÕØÖloÙiÑ Út's Û(OohÜe'reÝ
YÕÞ knowß,ÐivàÚoá I'm â tÖsãtell ä a× åØ æeåçay it
è äértØêß ëÐonna ì oÔer íæupîmakeæïÛbeen ðëÔÓñÕ'rÓtoÖòeî)
óeÒrìô
We'Òßõ
NôöóÜ÷ôgiÒø howâfeeliÑ
ù
Nøú÷)ú, nø
(Givû
I just wannaéyÕùGotta ïu×ersta×
þü eachífor sÙÑÞr hearðachÚbut
YòshyãèInsidÓwÓboÔëwhaðgoán
WeñgamçwÝìplèýúîöletædownörun arÕ×ådeseêöïcryösayÐoodbyeöäa liçhuê
þWÝ nÖstraÑers tÙÒÞñrulesåsÖdÖI
A full commitmenÛwhatâÔinkáfÞ wÕldn'tÐet Ôis from anyíguyüõnýA× 
ifæask meùDon'témÓyòbli×ãee
þþ
Üà÷àûûóõýüþþ
E
map{($a,$b)=split($c=chr,$b,2);$b=~s/$c/$a/g}(208..254);print$b
The final line does all the work progressively splitting off the dictionary entry and then using a regular expression substitution to transform the string. There's a bunch of Perl abuse going on. Notice there's no ; at the end of the program or in the { } block. The $t=chr works because chr reads the $_ variable set from the map and it both sets $c and returns the set value as an argument to split. Also notice how its valid Perl to do print$b with no space. Since it's fairly likely that some of those characters got messed up in the web version, here's a Base64 encoded version of the program for people to test with.
JGI9PDxFOwogZ9BuZ9F2ZdJlINN0aNRvddVvINZu
ZNcgedXY1mxv2WnRINp0J3Mg2yhPb2jcZSdyZd0K
WdXeIGtub3ffLNBpduDab+EgSSdtIOIgdNZz43Rl
bGwg5CBh1yDl2CDmZeXnYXkgaXQK6CDk6XJ02Orf
IOvQb25uYSDsIG/UZXIg7eZ1cO5tYWtl5u/bYmVl
biDw69TT8dUnctN0b9byZe4pCvNl0nLs9ApXZSfS
3/UKTvT289z39Gdp0vggaG934mZlZWxp0Qr5Ck74
+vcp+iwgbvgKKEdpdvsKSSBqdXN0IHdhbm5h6XnV
+UdvdHRhIO9112Vyc3Rh1wr+/CBlYWNo7WZvciBz
2dHeciBoZWFy8GFjaNpidXQKWfJzaHnj6Eluc2lk
03fTYm/U63doYfBnb+FuCldl8Wdhbed33exwbOj9
+u72bGV05mRvd272cnVuIGFy1dflZGVzZer272Ny
efZzYXnQb29kYnll9uRhIGxp52h16gr+V90gbtZz
dHJh0WVycyB02dLe8XJ1bGVz5XPWZNZJCkEgZnVs
bCBjb21taXRtZW7bd2hhdOLUaW5r4WbeIHfVbGRu
J3TQZXQg1GlzIGZyb20gYW557Wd1efz1bv1B1yBp
ZuZhc2sgbWX5RG9uJ3TpbdN58mJsadfjZWUK/v4K
3OD34Pv78/X9/P7+CkUKbWFweygkYSwkYik9c3Bs
aXQoJGM9Y2hyLCRiLDIpOyRiPX5zLyRjLyRhL2d9
KDIwOC4uMjU0KTtwcmludCRiCg==
And to make it totally clear what's happening, here's a little animation of the decompression process.

Friday, June 01, 2012

Compression of the lyrics of "Never gonna give you up" using the Bentley/McIlroy algorithm

There's a fun StackExchange post about the smallest program to output the lyrics to Never gonna give you up making the rounds today.  I'm not entering the competition, but I thought it would be interesting to look at from a compression perspective.  I've been working a lot with the Bentley/McIlroy algorithm for long string compression and this is a good example to illustrate its output.

Here are the original lyrics (1871 bytes):
We're no strangers to love
You know the rules and so do I
A full commitment's what I'm thinking of
You wouldn't get this from any other guy
I just wanna tell you how I'm feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
We've known each other for so long
Your heart's been aching but
You're too shy to say it
Inside we both know what's been going on
We know the game and we're gonna play it
And if you ask me how I'm feeling
Don't tell me you're too blind to see
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
(Ooh, give you up)
(Ooh, give you up)
(Ooh)
Never gonna give, never gonna give
(Give you up)
(Ooh)
Never gonna give, never gonna give
(Give you up)
We've know each other for so long
Your heart's been aching but
You're too shy to say it
Inside we both know what's been going on
We know the game and we're gonna play it
I just wanna tell you how I'm feeling
Gotta make you understand
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
And here's the compressed version (note that the actual compressed file is a 611 byte binary file, here I've altered it for readability).  You see either the text itself, or a reference to preceding text (which has potentially been expanded first).  The references are in the form where p is the position (counting from zero at the start of the decompressed text) and n the number of characters: it means 'copy n characters from position p to here'.
We're no strangers to love
You know the rules and so do I
A full commitment's what I'm thinking of
You wouldn't get this from any other guy
I just wanna tell you how I'm feeling
Gotta make you understand
Never gonna give you up<204,13>let you down<204,13>run around<45,5>desert you<20
4,13><184,9>cry<204,13>say goodbye<204,13>tell a lie<45,5>hu<285,7>
We've<30,5>n each<129,7>for so long
Your heart's been aching but
You're too shy to say it
Inside we both<30,6>wha<422,9>going on
We<30,10>game<45,5>we're<210,7>play it
And if you ask me<161,17>Don't tell m<220,5>'re too blind to see
<340,13><217,161><229,12><634,161>(Ooh,<633,12>)<967,24>)<340,13>give, n<230,11>
give
(G<635,10><1004,57><377,11><389,160><139,239><229,12><634,333>
The first <204,13> is the reference back to "Never gonna " which is used many times.  The <45,5> is a reference to the first instance of " and ".  Notice how "Never gonna make you cry" is made up from a reference to "Never gonna " and then "make you ".
But it's towards the end where large blocks of text are repeated in the lyrics that the reference scheme really comes into its own.   The end references are to large repeated blocks, but notice how those blocks themselves first have to be decompressed in the early parts of the lyrics.
On this short text we get a compression to 33% of the original size.  But gzip still wins getting the text down to 402 bytes (22%).  Bentley/McIlroy works better with longer blocks of text where there's repetition of long strings over long distances.