Wednesday, December 19, 2007

Is Digg really worth $300m?

There's a rumor going around that Digg is trying to sell itself to somebody (anybody?) for $300m. That seems like a lot of money to me so I went searching for comparative statistics. Prior to taking their series B financing, Digg was rumoured to be looking for (and failing to find) a buyer at $150m.

Back in 2005 News Corp. bought Intermix for $580m. Intermix spent $69m of that money to acquire the remaining 47% of MySpace that it didn't already own. The company had 27 million unique users per month at the time of acquisition. So they paid roughly $22 per user.

Now take a look at Digg today with about 17.8m unique visitors. Using that same metric Digg would be valued at $392m.

But now look at the attention span of those people on the site.

Visitors to look at an an average of 37 pages per visit whereas Digg users look at 5. In terms of time Digg users spend 0.04% of their time on Digg, whereas MySpace engages users for 7.3% of their time. So > 7 times the pages viewed, plus 180x more time spent on MySpace than Digg.

Now, the average stay at Digg is 2m25s and MySpace is 24m26s. So people spend 12 times longer on MySpace than on Digg.

So MySpace is way more engaging that Digg, which isn't surprising given that Digg is place you go to go someplace else. You're unlikely to spend long there.

So how do you discount the $392m based on engagement..? By % of time spent there Digg is worth $2m, by length of stay it's $32m, by pages viewed it's $56m.

Now, Digg has taken about $11m in VC money so assuming participating preferred etc. the VCs aren't going to be happy below that $32m number.

My guess is that Digg is worth somewhere between $50m and $100m. Of course, I don't know what their revenues are and if they are making a lot on advertising then those numbers could change. The rumour has always been that Digg users don't click on ads so their revenue might not be spectacular.

Another telling statistic on Digg is its Alexa rank which dropping fast.

And if they are out there looking to be sold the price will be depressed.

$300m: too high IMHO.

Wednesday, December 05, 2007

The Seven Point Scale

For some time now I've noticed the scoring things on a scale of 1 to 7 seems to be a good way of evaluating some analogue or continuous phenomenon. I was first introduced to 7 point scales by John Ousterhout (who's best known as the Tcl instigator). John likes to use a 7 point scale to evaluate interviewees as follows:

1Worst candidate imaginable. I will quit if you hire them.
2Very negative. Will argue strongly against hiring.
4Totally ambivalent about this candidate.
6Enthusiastic. Will argue strongly to hire.
7Best candidate imaginable. I will quit if you don't hire them.

And we used to look for candidates with 5 and above votes from all interviewers and at least one 6. We did once have someone vote 7 on a candidate, but it's very rare to see 1 or 7.

Turns out that seven point scales are not that uncommon. Kinsey used one in defining types of sexuality:

1Exclusively heterosexual.
2Predominantly heterosexual, only incidentally homosexual.
3Predominantly heterosexual, but more than incidentally homosexual.
4Equally heterosexual and homosexual.
5Predominantly homosexual, but more than incidentally heterosexual
6Predominantly homosexual, only incidentally heterosexual
7Exclusively homosexual.

And there's actually research into the accuracy of 7 points scales. See for example this report that indicates that 7 point scales give as much information as 10 point scales when rating happiness. And here's another paper recommending seven point scales for measurement. And then there's the Likert scale which has been in use since the 1930s which has 5 and 7 point variants.

Seven point scales are neat because they have a clear middle point and between the middle and end points there are just two choices. That gives them to capture variations in opinions without presenting too many choices (leading to vacillation) or too few (meaning that too much data is lost). I'm using them lots of different places: most recently in the votes on books I've recently read.

Monday, December 03, 2007

Transitive decay in social networks

When I was doing research in computer security we often used to say "trust isn't transitive". What we meant was that if Alice trusts Bob and Bob trusts Charlie then we can't assume that Alice trusts Charlie. Another way to think of this is to look at your own friends and friends of friends; do you trust the friends of friends? It's likely that you do not (if you did then it's likely that they would actually already be a friend and not a FOAF).

Clearly, trust is not a constant value across all friends, so each of your N friends will have a trust value ("how much you trust them"), which I'll call Ti, assigned to them. A friend you'd trust with your life has Ti = 1 (perhaps they're a candidate for a BFF), and a friend you don't trust at all has Ti = 0. (I'll ignore the question of why you even have friends with Ti = 0, but in the context of computer social networks you probably do have some).

In social situations we are only exposed to this FOAF trust problem occasionally, but with 'social networking' a current web buzzword we see social networks, or social graphs, and can traverse them. Many web sites are trying to use this graph traversal to build services (e.g. LinkedIn allows you to send requests across the network, or ask questions; Facebook is hoping that graph traversal will be the new application distribution method).

But any graph traversal suffers from the FOAF trust problem. In a social network online this gets expressed by statements like "Just because Alice likes the Werewolf application and shares it with Bob and Charlie is friends with Bob, that doesn't mean that Charlie wants to be a Werewolf", or "A message crossing between more than one hop won't get passed all the time".

I dare say that LinkedIn, Facebook and others could actually characterize the rate at which the FOAF attenuates messages (be they actual messages, application installations, or any other meme) passing through the network.

I'm going to posit that the amount of trust a user would place in a FOAF (and a FOAFOAF, a FOAFOAFOAF, ... ) decays rapidly with the number of FOAF hops traversed.

Intuitively, if Alice trusts Bob with Talice,bob and Bob trusts Charlie with Tbob,charlie then how much does Alice trust Charlie? Less than she trusts Bob because Charlie is not her friend, and she can only evaluate Charlie based on Bob's recommendation. The more she trusts Bob the more she should trust Charlie. So some sort of estimate Talice,charlie is created from Talice,bob and Tbob,charlie taking into account these trust estimates.

A simple combination would be Talice,charlie = Talice,bob * Tbob,charlie (this assumes quite the opposite of the original declaration above: here trust is transitive to a certain degree).

The problem with this is that it treats all trust relationships as having equal weight, no matter how far they are from the original person (in this case, Alice). Imagine the case where Alice trusts Bob with Talice,bob = 1, Bob trusts Charlie Tbob,charlie = 1. This formula gives Talice,charlie = 1 which would probably not reflect most people's intuitive grasp of trust. If in addition, Charlie trusts Dave with Tcharlie,dave = 1 then we get Talice,dave = 1 which seems even more unlikely.

What's needed is a way to decay trust the further apart people are.

One way to to this is for each person to have their own damping factor the encodes how much they trust another person's trust. So Alice might trust other people's recommendations with factor Dalice (in the range [0,1]). The formula would be updated to have

Talice,charlie = Dalice * Talice,bob * Tbob,charlie

Talice,dave = Dalice * Talice,charlie = Dalice * Talice,bob * Dbob * Tbob,charlie * Tcharlie,dave

But that's still essentially linear. I think trust looks more like an inverse square law so that distance is explicitly encoded. With that

Talice,charlie = Dalice * Talice,bob * Tbob,charlie / 1^2

Talice,dave = Dalice * Talice,charlie / 2^2 = Dalice * Talice,bob * Dbob * Tbob,charlie * Tcharlie,dave / 4

This seems to fit better intuition because trust of distant people drops away very rapidly. Now, since this is only a hypothesis it would be interesting to measure the reach of messages passing inside a social network to look at the actual 'pass it on' rates to see if they match intuition.

Anyone out there got lots of social network data I could analyze? Perhaps there's a Facebook application developer who's tracked enough invite/install data that this could be verified.

Saturday, December 01, 2007

Double-checking Dawkins

I've been reading through Richard Dawkins' books and am currently half way through The Blind Watchmaker (2006 paperback edition) and on page 119 he writes:

In my computer's ROM, location numbers 64489, 64490 and 64491, taken together, contain a particular pattern of contents---1s and 0s which---when interpreted as instructions, result in the computer's little loudspeaker uttering a blip sound. This bit pattern is 10101101 00110000 11000000.

Of course, this piqued my curiosity. Did Dawkins just make that up, or is this really the contents of a specific bit of memory on a specific computer?

The book was first published in 1986, so I just had to figure out what it was. Starting with the instructions and converting to hex we have AD 30 C0. Now, considering the main processors around at the time there are three possible interpretations of these three bytes:

Z-80/8080 XOR L ; JR NC C0

6502: LDA C030

6809: JSR 30C0

The first didn't look at all plausible, but both the other two do. The 6809 looks like it could be calling a sub-routine at location 30C0 and the 6502 would work if C030 were actually memory-mapped I/O and a read was necessary to cause the blip.

I couldn't find any machine with a meaningful interpretation of the 30C0 location on a 6809 machine, but 6502 turned out to be a different story.

On an Apple ][ memory in the range C000-C0FF is mapped to various bits of I/O:

The memory space on the Apple II between $C000 and $CFFF was assigned to handle input and output. From $C000 to $C0FF the space was reserved for various soft-switches used to control the display, and various built-in I/O devices, such as the keyboard, paddles, annunciators, and the cassette port.

Poking around further I discovered that a read from C030 (known as SPKR) will blip the speaker on an Apple ][. So Dawkins is telling the truth and he's using an Apple ][.

So returning to the memory locations what program is he talking about? The three memory locations are FBE9, FBEA and FBEB. That turns out to be inside the Monitor ROM on an Apple ][ (the stuff that Woz wrote) and happily a complete ROM listing was provided with the Apple ][.

So looking in a scanned Apple ][ manual we find that those locations are inside the BELL2 routine. Specifically on page 163 of the manual we find the following:

FBE6: 20 A8 FC 598 JSR WAIT 1 KHZ FOR .1 SEC
FBE9: AD 30 C0 599 LDA SPKR
FBEC: 88 600 DEY
FBEF: 60 602 RTS2B RTS

The line in bold is exactly the code that Dawkins is referrring to.

So, while writing this famous text on evolution, Dawkins had time to poke around inside the Apple ][ ROM and write about it in his book.


Friday, November 30, 2007

Windows just likes to mess with me

I don't use Windows very much, but when I do my trusty Windows 2000 SP4 VM comes in handy (I guess now that Vista is out I should upgrade to XP). This morning I installed a big chunk of Windows updates that needed dealing with and Windows said:

Just why is the Restart Now button greyed? :-)

Tuesday, November 27, 2007

Facebook meets the Birthday Paradox

Logging in to Facebook this morning there was a great demonstration of the Birthday Paradox (which isn't actually a paradox, it's just that people get surprised by it).

I have 95 'friends' on Facebook, and this Thursday three of them have a birthday. Wow! Or not, wow, in fact since the calculation in the birthday paradox shows us that this is very likely to happen.

Once you reach 57 friends there's a 99% likelihood that two share he same birthday, with 95 friends your getting very close to 100%. So the fact that three people have the same birthday is not at all unlikely. In fact the birthday paradox can be generalized to cover more than two birthday's being the same. My three birthday example with 95 friends would happen with probability well over 50% (which happens with 88 friends).

Thursday, November 15, 2007

Steve Gibson's PPP... new version 3 in Java and C

If you've been following along you'll know that I've implemented Steve Gibson's PPP system in Java and C. The Java code is in the form of a CLDC/MIDP project for cell phones and the C code is a command-line tool.

Steve's made a major change to the algorithm which he's calling PPPv3. My updated code is here:

The C source code is available in

The Java source code is available in

The compiled Java is available in ppp3.jar.

Read my original blog posts for details.

Tuesday, November 13, 2007

Cryptographically and Constantly Changing Port Opening (or C3PO)

In another forum I was just talking about a little technique that I came up with for securing a server that I want on the Internet, but to be hard for hackers to get into. I've done all the right things with firewalling and shutting down services so that only SSH is available. But that still leaves port 22 sitting there open for someone to bang on.

So what I wanted was something like port knocking (for an introduction to that you can read my DDJ article Practical Secure Port Knocking). To avoid doing the classic port knocking where you have to knock the right way to open port 22 I came up with a different scheme which I call Cryptographically and Constantly Changing Port Opening or C3PO.

The server and any SSH client who wish to connect to it share a common secret. In my case that's just a long passphrase that we both know. Both bits of software hash this secret with the current UTC time in minutes (using SHA256) to get 256 bits of random data. This data changes every minute.

The 256 bits gives me 16 words of random data. Those 16 words can be interpreted as 16 different port numbers. Once a minute the server reconfigures iptables to open those 16 ports forwarding one of them (which corresponds to word[0] in the hash) to SSH and the other 15 to a blacklist service.

At any one time 16 ports are open (i.e. respond to a SYN) with only one being SSH and the other 15 being a trap to be sprung by an attacker. The 16 ports change once a minute.

Since both sides can compute the hash the client is able to compute where the SSH server is residing at that moment and contact it. Once contact is established the connection remains open for the duration of the session. New sessions, of course, will need to recompute the hash once a minute.

The blacklist service serves to tarpit an attacker. Any connection attempt to one of the other 15 sockets causes the IP address of the attacker to be blacklisted (again an iptables change) which means that hitting any of the 15 ports causes the attacker to shut off their access to the SSH server for the next 15 minutes.

A casual NMAP of my machine gets your IP address blacklisted and shows up a random selection of open ports. A real user connects to the SSH server first time because they know where it resides.

Of course, this doesn't replace making sure that the SSH server is up to date, and that passwords are generated carefully, but it seriously frustrates a potential attacker.

If this is of interest to others I'd be happy to release my code (which is currently in Perl) as a GPL2 project.

The effect of trying to save electricity

Over the last couple of years I've tried to cut down electricity use in two simple ways:

1. Gradually replace incandescent light bulbs with low power fluorescents. I only do this when an incandescent bulb dies, currently I'd estimate that half the bulbs in the house are converted.

2. Switching off appliances that are not in use. This has been achieved by grouping appliances on power strips with physical on/off switches. When not in use I can kill off the TV and related appliances, the Internet and all networking, computers and peripherals.

I've not got a couple of years of data from electricity bills. For the first 10 months of 2006 I used 2,821 kWh of electricity, for the same period on 2007 (during which I've implemented 1 and 2 above) I've used 2,345. That's 83% of the amount used in 2006. And that's despite an increase in electricity use during the winter of 2007.

Here's a chart which shows the trend.

Sunday, November 04, 2007

PPPv2 in Java and C

Recently, I released C and Java versions of Steve Gibson's PPP system for password generation. Steve updated the algorithm to PPPv2 which uses a different hash (SHA256 instead of SHA384) and a slightly different plain text generation algorithm (see the PPP pages for details).

I've now updated my code to PPPv2 and am releasing it here.

The C source code is available in

The Java source code is available in

The compiled Java is available in ppp.jar.

Read my original blog posts for details. All source code is released under the BSD License.

Tuesday, October 30, 2007

Write once, debug everywhere

I basically never program in Java. A while back I was forced to write a JNI wrapper for polymail, but just recently I got to write a complete, if little, Java program.

You can read about the program here.

It worked great in Sun's phone emulator, it worked great on my Motorola RAZR that's running Sun's JVM. And then someone else tried it out. And it failed. It failed in a weird way... the text labels on UI elements were invisible. The program itself ran, but there's no text.

This is on a Motorola Q with Windows Mobile 5.0 and the IBM WebSphere Everywhere Micro Environment. That environment is supposedly compatible with the CLDC 1.0/MIDP 2.0 that I built the JAR for. That's the same environment running on my RAZR.

So much for write once, run everywhere. 12 years, let me just repeat that: 12 years, after the introduction of this technology I'm reduced to downloading IBM's entire environment just to be able to debug an application because it doesn't work.

And people wonder why I shy away from Java. Every time I touch it, it bites me.

Monday, October 29, 2007

A Java client implementation of Steve Gibson's PPP

I recently produced an open source implementation of Steve Gibson's Perfect Paper Passwords system in C. It occurred to me that a better implementation would be a Java client for my mobile phone (thus eliminating the need for printing and carrying the paper passwords).

Here's my PPP client implementation running on my Motorola RAZR. It's written in Java using CLDC 1.0 and MIDP 2.0.

You can download and install the JAR file. The current version is 1.0.0.

Thursday, October 18, 2007

Times Square: a fun spammer GIF

Nick FitzGerald reported a neat spammer image trick to me the other day. It's entered in The Spammers' Compendium that involves using animation to display the word Viagra emulating a flashing neon sign.

Since many OCR systems merge the layers together before OCR this image is actually in the 'wrong' order. Once merged the letters are in the order VIRAAG.

Monday, October 15, 2007

SOC Update and Google Maps integration

After receiving some feedback on my Simple code for entering latitudes and longitudes I've made a couple of changes:

1. Replace the letter V with the symbol @ in the alphabet to remove confusion between U and V. Implementations should automatically map V to U if entered.

2. Changed the checksum to the following calculation:

C = p0 + p1 * 2 + p2 * 3 + p3 * 4 + p4 * 5 + p5 * 6 + p6 * 7 + p7 * 8 + p8 * 9 mod 29

To make it a bit easier to visualize here's an integration of SOC with Google Maps. You can either type in an address to navigate to that address and see the SOC, or type in a SOC to navigate to that location.

Enter address to find:

Or, enter a SOC:

Friday, October 12, 2007

An open source implementation of Steve Gibson's PPP algorithm

Steve Gibson has come up with a simple two-factor password scheme that relies on printed cards of passcodes generated using a combination of SHA-384 and Rijndael. The idea is that a system could prompt the user for one of the passcodes in addition to their normal password.

Steve calls this his Perfect Paper Passwords system and has given a detailed description of the algorithm.

As usual he's released code written in assembly language as a DLL for Windows. He hasn't released his source code (he never does), so I thought it would be interesting to write my own implementation of his algorithm. Here's the C code:

#include <sys/time.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

#include "rijndael.h"
#include "sha2.h"

#pragma pack(1)

typedef unsigned char Byte;

typedef union __Passcode {
unsigned long as_long;
struct {
Byte byte[4];
} bytes;
} Passcode;

typedef struct __PasscodeString {
char character[5];
} PasscodeString;

typedef unsigned long long SixtyFour;

typedef union __OneTwoEight {
struct {
SixtyFour low;
SixtyFour high;
} sixtyfour;
Byte byte[16];
} OneTwoEight;

typedef struct __SequenceKey {
Byte byte[SHA384_DIGEST_SIZE];
} SequenceKey;

typedef unsigned long DWord;

const char * alphabet = "23456789!@#%+=:?abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPRSTUVWXYZ";

void inc( OneTwoEight * i )
if ( i->sixtyfour.low == 0 ) {

void add( OneTwoEight * to, OneTwoEight * addend )
SixtyFour low = to->sixtyfour.low;

low += addend->sixtyfour.low;

if ( ( low < to->sixtyfour.low ) || ( low < addend->sixtyfour.low ) ) {

to->sixtyfour.low = low;

void ConvertPasscodeToString( PasscodeString * passcodeString,
Passcode passcodeValue )
Byte bytes[4];

bytes[0] = passcodeValue.bytes.byte[0] & 0x3f;
bytes[1] = ( ( passcodeValue.bytes.byte[0] & 0xc0 ) >> 6 ) +
( ( passcodeValue.bytes.byte[1] & 0x0f ) << 2 );
bytes[2] = ( ( passcodeValue.bytes.byte[1] & 0xf0 ) >> 4 ) +
( ( passcodeValue.bytes.byte[2] & 0x03 ) << 4 );
bytes[3] = ( ( passcodeValue.bytes.byte[2] & 0xfc ) >> 2 );

int i;
for ( i = 0; i < 4; ++i ) {
passcodeString->character[i] = alphabet[bytes[i]];

passcodeString->character[4] = '\0';

void RetrievePasscodes( Passcode passcodeListBuffer[],
OneTwoEight firstPasscodeNumber,
int passcodeCount,
SequenceKey * sequenceKey )
int i;

#define KEY_BITS (int)256

for ( i = 0; i < KEYLENGTH(KEY_BITS); ++i ) {
key[i] = sequenceKey->byte[i+16];

unsigned long rk[RKLENGTH(KEY_BITS)];
OneTwoEight plain;

for ( i = 0; i < 16; ++i ) {
plain.byte[i] = sequenceKey->byte[i];

OneTwoEight block = firstPasscodeNumber;
unsigned int skip = (unsigned int)(block.sixtyfour.low & 0xF);

SixtyFour carry = block.sixtyfour.high & 0xF;
block.sixtyfour.high >>= 4;
block.sixtyfour.low >>= 4;
block.sixtyfour.low |= (carry << 60);

OneTwoEight temp = block;
add( &block, &temp );
add( &block, &temp );
add( &plain, &block );

int nrounds = rijndaelSetupEncrypt( rk, key, KEY_BITS );
Byte cipher[16*3];

int c = 0;

while ( passcodeCount > 0 ) {
rijndaelEncrypt( rk, nrounds, (Byte *)&plain.byte[0], &cipher[0] );
inc( &plain );
rijndaelEncrypt( rk, nrounds, (Byte *)&plain.byte[0], &cipher[16] );
inc( &plain );
rijndaelEncrypt( rk, nrounds, (Byte *)&plain.byte[0], &cipher[32] );
inc( &plain );

for ( i = skip; ( i < 16 ) && ( passcodeCount > 0 ); ++i ) {
passcodeListBuffer[c].bytes.byte[0] = cipher[i*3];
passcodeListBuffer[c].bytes.byte[1] = cipher[i*3+1];
passcodeListBuffer[c].bytes.byte[2] = cipher[i*3+2];

skip = 0;

void GenerateSequenceKeyFromString( char * string,
SequenceKey * sequenceKey )
sha384( (const unsigned char *)string, strlen( string ),
(unsigned char *)sequenceKey );

void GenerateRandomSequenceKey( SequenceKey * sequenceKey ) {
struct timeval t;
gettimeofday( &t, 0 );

char t_buffer[61];
strftime( t_buffer, 60, "%c%d%e%H%I%j%m", localtime( &t.tv_sec ) );

char msecs_buffer[32];
sprintf( msecs_buffer, "%ld", t.tv_usec );

char hostname_buffer[256];
gethostname( hostname_buffer, 255 );

char pointer_buffer[16];
sprintf( pointer_buffer, "%p", sequenceKey );

char loadavg_buffer[256];
double samples[3];
getloadavg( samples, 3 );
sprintf( loadavg_buffer, "%f%f%f", samples[0], samples[1], samples[2] );

char buffer[1024];
sprintf( buffer, "%s-%s-%s-%s-%s", t_buffer, msecs_buffer, hostname_buffer,
pointer_buffer, loadavg_buffer );

GenerateSequenceKeyFromString( buffer, sequenceKey );

int ConvertHexToKey( char * hex, SequenceKey * key )
int i, j;

for ( i = 0, j = 0; i < 96; i += 2, ++j ) {
char pair[3];
sprintf( pair, "%c%c", hex[i], hex[i+1] );
int x;
sscanf( pair, "%x", &x );
key->byte[j] = (Byte)x;

int main( int argc, char * argv[] )
if ( argc == 1 ) {
printf( "Error: You must provide the passphrase or sequence key as the first parameter\n" );
return 1;

SequenceKey key;

if ( strlen( argv[1] ) == 0 ) {
printf( "Generating random sequence key\n" );
GenerateRandomSequenceKey( &key );
} else {
if ( ( strlen( argv[1] ) == 96 ) && ( ConvertHexToKey( argv[1], &key ) ) ) {
printf( "Using entered sequence key\n" );
} else {
printf( "Generating sequence key from passphrase\n" );
GenerateSequenceKeyFromString( argv[1], &key );

printf( "Sequence Key: " );
int i;
for ( i = 0; i < SHA384_DIGEST_SIZE; ++i ) {
printf( "%2.2x", key.byte[i] );
printf( "\n" );

if ( argc == 4 ) {
OneTwoEight firstPasscode;

// Warning! This only uses the bottom 64-bits of argv[2] and hence
// can't convert a much higher number

firstPasscode.sixtyfour.low = atoi( argv[2] );
firstPasscode.sixtyfour.high = 0;

int count = atoi( argv[3] );

Passcode * pcl = malloc( sizeof( Passcode ) * count );

RetrievePasscodes( pcl, firstPasscode, count, &key );

for ( i = 0; i < count; ++i ) {
PasscodeString str;
ConvertPasscodeToString( &str, pcl[i] );
printf( "%s ", &str.character[0] );

printf( "\n" );

return 0;

Now that's a little less flexible than all the options given in Steve's ppp.exe implementation, but it does compute the correct output and can easily be modified if you want your own implementation with source of PPP.

It uses this SHA-384 implementation and this Rijndael implementation.

Here's the output of my ppp program producing the first 70 passcodes:

$ ./ppp 53303f97ddcf91ed74391fc5c3661246
fb94f8fbeefa5f1e9263c12878e0a95e 0 70
Using entered sequence key
Sequence Key: 53303f97ddcf91ed74391fc5c3661246
VJNV gHoF PaRp T8FS tGw2 s%iT u7rp
ZvN@ MWGb %574 ?DVF btRq PLTA DDtm
C2TP Yin8 zMF@ a8%H zHvq Uwxc qkF7
YuUk 8Ca? :ZvZ T9:? wki+ KiHq d?9b
GY%5 !igR [email protected] [email protected] eyVm 5PwY CAVs
oKzK 43Mc nR%? [email protected] oZUs Tbec xn6B
9bVA UvJt DfAX =Gqp 7Abj M:6Y ENRs
49?Z Z?Hk G+A? zoK5 :Z8N z8NU WpM!
=AB% RrSq %7:Y %=P8 RKXr di#5 4T3L

Feel free to take my code and use it under the BSD license.

Thursday, October 11, 2007

More spammer crossword creativity

Nick FitzGerald writes in with a variant of the "1 across, 3 down" spammer content trick which looks like this:

The neat thing is that the crossword is created using HTML in a way that prevents a simple HTML-stripping spam filter from reading the brand names. To a simple spam filter this looks like:


The actual HTML (cleaned up by Nick) is:

<DIV align=right>
<DIV align=center>
<DIV align=center>

Friday, September 07, 2007

Wildfire has launched

I wrote the other day about what I saw as the problems with social news web sites and my proposed alternative. That alternative, the Facebook application called Wildfire has now launched.

Based on feedback from users a couple of changes have been made:

1. The standard page you see is a random selection of stories, you can pick any of these and forward them to your friends and display them in your profile.

2. Wildfire will now pull down RSS feeds you specify and build your "My Fires" list so that you can choose to forward some or all of the stories from RSS. RSS feeds are pulled down twice a day.


Tuesday, September 04, 2007

The problems with social news

I recently launched my own 'social news' application, call Wildfire. I did this because looking around at existing social news sites like Digg and Reddit I felt like they left a lot to be desired.

Before telling you about Wildfire, let me tell you what I think it wrong with social news as it stands:

1. The Wisdom of Crowds incorrectly implemented

It seems to me that most people who use the phrase "Wisdom of Crowds" haven't read the book, or if they have read the book they haven't understood the concept. For me, social news sites fail to produce interesting links because they rely more on herd mentality than WoC.

If you visit the front page of Digg or Reddit you are being presented with already filtered news and you are offered the opportunity to 'digg' stories you like. This is on the on-line equivalent of going to the same restaurant as everyone else, just because it's full. Digging a front page story is pointless.

The WoC partly relies on not knowing what other people think. Galton's ox-weight-guessing relied on averaging the opinions people gave of an ox's weight without collaboration. All current social news sites fail because they expose other people's votes to readers.

If social news sites really wanted the WoC they'd present users with a random selection of news stories and ask them to vote. Then they'd get to see the top stories without a chance to vote. Of course, no site is likely to do that, what people want is an instant page of links.

2. Misplaced downvoting

Reddit has the ability to downvote a story. This has a two-fold effect: it reduces the number of votes on a story (and thus its position on Reddit) and it trains the collaborative filter that Reddit uses to give recommended use. This is a mistake.

Either downvoting is a way of telling the world you think a story is unworthy, or it's for personalization. You might think they are the same thing: surely if enough people don't want a story then their personalization is the same as the global view. Wrong. Only if you can assume that my world view is your world view. Which leads me to...

3. No social aspects

There isn't one society. Society is covered (in a topological sense :-) by many overlapping societies. Yet social news sites have a 'one size fits all' mentality.

Each social news site has a personality and you have to choose the site that best fits your views. Reddit is very left learning and, frankly, a bit snobbish. Digg is younger, more raw, and more low brow.

You shouldn't have to choose the site that fits your personality. The site should tailor to you.

4. No homophily

We tend to associate with people who have similar tastes and views to us. This is known as homophily. Current social news sites fail to deliver on the power of homophily. The reality is that my interests are closely aligned with my friends' interests, and not with some one size fits all view of the web.

Currently with their badly implemented WoC, missing social aspects and lack of homophily social news doesn't deliver.

Sure, social news is hot: any filter is better than having none, but that doesn't make it good. Which brings me to Wildfire.


Wildfire is a social news application built inside Facebook. It works as follows:

  • Wildfire users submit stories by giving a URL and a title (just like any news site)
  • Submitted stories are automatically spread to the submitter's friends (and appear in the friends' profiles)
  • Friends can choose to further spread news they receive to their friends. When the do this the original submitter (and every up the chain who spread the news) get a point of karma. This means that karma is only gained when your friends think a story you spread (or submitted) is worth spreading further.
  • You see two views of stories: those that were spread to you (and hence should be relevant because of homophily) and the 'global view' (called Major Fires) which are the top stories across all users.

Wildfire is designed to deliver you a small number of stories that you'll care about. These appear in your Facebook profile. Currently, I have not submitted Wildfire to Facebook for approval because it needs testing. If you use Facebook and want to try it out please add Wildfire and give me your feedback.

Wednesday, August 29, 2007

Why you don't want to code for a government department

Back in the mists of time, straight after my doctorate, I worked for a UK start-up called Madge Networks initially maintaining device drivers that implemented LLC, NetBIOS, IPX/SPX protocols and then writing a TCP/IP stack. Most of this work was done in C and assembler (x86 and TMS380).

When I first joined the company I was sent on an x86 assembly training course run by QA Training. (It rained on the first day and we were locked out so one of the company big cheeses ran over with QA Training umbrellas; to this day I use that umbrella).

During the course we were asked to write a simple function in C. I've forgotten what it was, but let's say it was a classic factorial function. I wrote something like:

unsigned int fac( unsigned int i )
if ( i < 2 ) {
return 1;

return i * fac( i - 1 );

Later we looked at an assembly equivalent of the function, but before that I took at look at the person sitting next to me. His function looked like this:

unsigned int aq456( unsigned int aq457 )
if ( aq457 == 0 )
return 1;

if ( aq457 == 1 )
return 1;

return aq457 * aq456( aq457 - 1 );

So, naturally I asked him why he used such horrible names for functions and variables .

It turned out that he worked for a government department and all identifiers were allocated and documented before the code was written. Hence, somewhere a document would tell you that aq456 was a factorial function with excruciating detail of its parameters and return values. And also you could discover that aq457 was the parameter for aq456.

He recounted how each project had a unique two letter code (he'd chosen aq as being qa backwards) followed by a sequence number.

He'd chosen aq because all the projects had worked on were aa, ab, ac, ... the department had apparently never got past 26 projects.

I wonder why?

Wednesday, July 18, 2007

Adding application ratings to Facebook Applications

I have a few Facebook applications which have a modest number of users (Easter Egg, Four Secrets, One Lie and My Days) and it started to annoy me that Facebook does not have a way for users to rate applications.

Well, now they do.

A Free Service

I've put together a little service that allows any Facebook application developer to have their app rated by users and display the rating on their application's About Page. To see how this works take a look at the About Page for My Days. At the bottom there's a section where the current application rating is shown, and the user is given the chance to vote on the app.

All the voting takes place on my server, and I've managed to create HTML that the About Page will accept that dynamically updates the voting as people vote.

Adding rating to an application's About Page

If you have a Facebook application and want to have voting on your app., then simply add the follow HTML to the app. About Page text at the bottom:

Rating: <div style="height: 16px; width: 85px; background-image: url(; background-repeat: no-repeat"></div>
Rate this app: <a href="">1 star</a>, <a href="">2 stars</a>, <a href="">3 stars</a>, <a href="">4 stars</a>, <a href="">5 stars</a>

Here's what that looks like when the About Page is shown:

Adding rating to an application's Canvas Page

To add app rating to the Canvas Page of your application (so that users can rate from within the app) you add the following HTML:

<hr />
Rating: <div style="height: 16px; width: 85px; background-image: url(; background-repeat: no-repeat"></div>
Rate this app: <a href="">1 star</a>, <a href="">2 stars</a>, <a href="">3 stars</a>, <a href="">4 stars</a>, <a href="">5 stars</a>

Replacing YOUR_APP_ID with the application ID (i.e. the number on the app's About Page) of your application.

View the top applications

Anyone can view the top applications by clicking this link.

Warning to fraudsters

A note for fraudsters: behind the scenes I'm monitoring how the voting is taking place and where the links are coming from, even how they were inserted. Don't mess with this, or risk having your application blacklisted. It would be a pity if trying to game the system resulted in your application's rating disappearing completely :-)

Saturday, July 14, 2007

Last push for POPFile voting

POPFile has been nominated for a SourceForge Community Choice Award due to the efforts of many people.

Voting closes on July 20 and there's strong competition in POPFile's category from the likes of Pidgin (formerly GAIM) and phpBB.

If POPFile wins SourceForge will be making a donation to a charity that I picked: Doctors without Borders.

If you feel like voting for POPFile, please vote in the Best Project for Communications category here:

Thursday, July 05, 2007

Stop me before I code again: another Facebook application

I woke up with a brainstorm and I just had to implement it. You may remember the Five Things meme that was going around the blogosphere a while back. I thought it would be fun to have something similar on Facebook, but with a little twist: one of the things is a lie.

So, I created the Four Secrets, One Lie application which lets you post five things people don't know about you, one of which is untrue. The application randomizes the order and places them on your Facebook profile page.

To make it interesting for the visitor you can click on each secret to discover whether it's true or not.

Here's my current set of Four Secrets, One Lie:

* John is a qualified SCUBA diver.
* John is a closet Meatloaf fan.
* John plays the drums in French Meatloaf tribute band "Pain de Veau"
* John speaks French very well.
* John thinks Neve Campbell is just perfect.

I have just one more idea for a Facebook application and then I'd better stop coding these things, it's getting addictive.

Tuesday, July 03, 2007

So, I built a Facebook application: Easter Egg

I've been playing with Facebook for a little over a week and I couldn't resist trying out the Facebook Platform. To be honest despite the missing documentation and changing APIs, they've done a nice job.

My first attempt at an application took a little over 3 hours and is called Easter Egg. It lets you leave Easter Eggs on your profile that only certain visitors can see. So, for example, you could leave a little love note for your girlfriend, or details of a secret party for your buddies, or just say Hi! when your Mom visits. And here's a screen shot:

If you are a Facebook user and like that sort of thing you can add Easter Egg.

Please vote for POPFile

POPFile has been nominated for a SourceForge Community Choice Award through the efforts of its users.

Now it's time to vote.

If you love POPFile, please vote for it in the Best Project for Communications category.

Monday, July 02, 2007

The Long Tail of Facebook Applications

Lately, I've been developing a little application using the Facebook Platform (my Easter Egg and Four Secrets, One Lie application) and I started wondering about the statistics surrounding Facebook applications. So, a little WWW::Mechanize application later here's some facts that are true today:

1. There are 1,225 Facebook applications available in the Facebook directory.

2. There are 73,109,074 installed applications in Facebook users' profiles.

3. Hence the average application has about 59700 users.

The problem with the average is that it hides a savage reality of Facebook applications: some have many users, most have almost none. To analyze that I broke the data logarithmically counting the number of applications with greater than 1,000,000, 100,000, 10,000, 1,000, 100, 10 and 0 users.

At least X usersTotal applicationsPercent of total

So, 86% of applications on Facebook have less than 10,000 users, with 62% having less than 1,000 users. The 'big applications' (with over 100,000 users) account for 5% of all the applications, and only 2% make it over 1,000,000 users. So for most people developing a Facebook applications means very, very few users.

Graphing that you get a classic Long Tail picture:

The top 21 applications (over 1,000,000 users are):

ApplicationTotal users
Top Friends7,490,104
Fortune Cookie4,107,451
X Me3,628,523
Free Gifts3,106,403
Honesty Box1,904,742
My Questions1,556,722
Favorite Peeps!1,459,334
Food Fight!1,387,109
HOT or NOT1,138,625
Where I've Been1,027,373

Another way to examine the Facebook data is to ask who the most prolific authors are. The top most profilic authors are:

AuthorUser count
Goowy Media106,248
Sze Tan83,670
Patrick Shyu827,621
Kevin Koder8230,164
Pete Crucian7503
Lucy Silverman71,180
Sari Williams6599
Zach Smith619,397
James Tyler6487

Only Kevin Koder (with 230,164 users spread of 8 applications) is anywhere near being a 'top application' on Facebook. Kevin's top application is Family Guy Quotes.

The top 10 authors by number of users are:

AuthorTotal user
Jia Shen4583310
R. Tyler Ballance4556234
Ted Suzman4085317
Nikil Gandhy3351734
Zachary Allia3107219

Unsurprisingly Facebook is one of the most popular authors given their advantage of actually owning the site. Ignoring the companies on the list there are a few individuals who've hit the big time on Facebook. Whether that turns in to any money for them is another thing... but Facebook benefits from this Long Tail because the cost to them of the vast underbelly of barely used applications is close to 0: they don't even host those applications, you do.

Tuesday, June 26, 2007

Pretty Darn Fancy: Even More Fancy Spam

Looks like the PDF wave of spam is proceeding with a totally different tack. Sorin Mustaca sent me a PDF file that was attached to a pump and dump spam. Unlike the previous incarnation of PDF spams, this one looks a lot like a 'classic' image-spam.

It's some text (which has been misaligned to fool OCR systems), but there's a little twist. Zooming in on a portion of the spam shows that the letters are actually made up of many different colors (a look inside the PDF it's actually an image.

I assume that the colors and the font misalignement is all there to make it difficult to OCR the text (and wrapping it in a PDF will slow down some filters).

Extracting the image and passing it through gocr gave the following result:

_ Ti_ I_ F_ _ Clih! %

%_ % _c. (sR%)
t0.42 N %x

g0 _j_ __ h_ climb mis _k
d% % %g g nle_ Frj%.
_irgs__Dw_s _ rei_ � a
f%%d __. n_ia une jg gtill
oo_i_. Ib _ _ _ _ _ _ 90I

And a run through tesseract resulted in:

9{Eg Takes I:-west.tY`S Far Gawd Climb! UP
fatued Sta=:khlauih. This we 1: still

Close, but no cigar.

Thursday, June 21, 2007

Pretty Darn Fancy: Stock spammers using PDF files

Joe Chongq sent me a fascinating spam that is the first I've seen that's using a PDF file to send its information. I've long predicted that we'll see a wave of complex formats used for spam as broadband penetration increases and sending large spams becomes possible.

This particular spam has a couple of interesting attributes:

1. The PDF file itself is a really nicely formatted report about a particular stock that's being pumped'n'dumped.

2. The file name of the PDF was chosen to entice the user further by using their first name. In this case it was called joe_report.pdf.

3. The PDF is compressed using the standard PDF 'Flate' algorithm and totals 84,398 bytes. That's fairly large, but we've certainly seen image spams that were larger. Use of compression here means that a spam filter that's not aware of PDF formats would be unable to read the message content.

Here's what the actual PDF looks like (click for a larger view):


Wednesday, June 20, 2007

jeaig (jgc's email address image generator) launches

Today is launch day for a simple little web site called jgc's email address image generator. It's a web service that enables anyone to generate a CAPTCHA like image containing their email address for insertion on their web site.

Since web crawlers are currently unable to look inside images to scrape email address this means that your email address will not be scraped from a web site, but can be written down by a human.

Here's an example email address:

Made using jeaig

and the code that generates that is:

<img src="
<br />
<font size="-2">Made using <a href="">jeaig</a></font>

The server does not store the email address: when a user enters an email address on the site it is padded with a random amount of random data (both before and after the address) with randomness being supplied by /dev/urandom. Then the address is encrypted using Blowfish using a secret key known only to the jeaig server (this key was also generated from /dev/urandom) and a random IV is chosen.

The encrypted data is then modified base-64 encoded so that it can be used in the URL.

When the image is requested using the special base-64 filename the email address is decrypted and then rendered using CAPTCHA code to produce the image. The image changes each time the email address is loaded.

And, yes, this is a free service.

Monday, June 18, 2007

Escaping comma and space in GNU Make

Sometimes you need to hide a comma or a space from GNU Make's parser because GNU Make might strip it (if it's a space) or interpret it as an argument separator (for example, in a function invocation).

First the problem. If you wanted to change every , into a ; in a string in GNU Make you'd probably head for the $(subst) function and do the following:

$(subst ,,;,$(string))

See the problem? The argument separator for functions in GNU Make is , and hence the first , (the search text) is considered to be separator. Hence the search text in the above is actually the empty string, the replacement text is also the empty string and the ;, is just preprended to whatever is in $(string).

A similar problem occurs with spaces. Suppose you want to replace all spaces with ; in a string. You get a similar problem with $(subst), this time because the leading space is stripped:

$(subst ,;,$(string))

That extra space isn't an argument it's just extraneous whitespace and hence it is ignored. GNU Make just ends up appending ; to the $(string).

So, how do you solve this?

The answer is define variables that contain just a comma and just a space and use them. Because the argument splitting is done before variable expansion it's possible to have an argument that's a comma or a space.

For a comma you just do:

comma := ,
$(subst $(comma),;,$(string))

And everything works.

For a space you need to get a space into a string, I find the easiest way is like this:

space :=
space +=
$(subst $(space),;,$(string))

That works because += always space separates the value of the variable with the appended text.

Now, GNU Make has really liberal variable naming rules. Pretty much anything goes, so it's possible to define a variable with the name , or even having the name consisting of a space character.

First, here's how to define them:

, := ,
space :=
space +=
$(space) :=
$(space) +=

The first line is clear, it does an immediate define of a , to the variable named ,. The second one is a little more complex. First, I define a variable called space which contains a space character and then I use it to define a variable whose name is that space character.

You can verify that these work using $(warning) (I like to wrap the variable being printed in square brackets for absolute certainty of the content):

$(warning [$(,)])
$(warning [$( )])

$ make
Makefile:1: [,]
Makefile:2: [ ]

Yes, that's pretty odd looking, but it gets stranger. Since GNU Make will interpret $ followed by a single character as a variable expansion you can drop the braces and write:

$(warning [$,])
$(warning [$ ])

Now that starts to look like escaping. In the examples above you can use these variables to make things a little clearer:

$(subst $(,),;,$(string))
$(subst $ ,;,$(string))

Note that you have to use the $(,) form because function argument splitting occurs before the expansion and GNU Make gets confused. In the second line the space is 'escaped' with the $ sign.

You might be wondering about other crazy variable names: here are a few that's possible with GNU Make:

# Defining the $= or $(=) variable which has the value =
equals := =
$(equals) := =

# Define the $# or $(#) variable which has the value #
hash := \#
$(hash) := \#

# Define the $: or $(:) variable which has the value :
colon := :
$(colon) := :

# Define the $($$) variable which has the value $
dollar := $$
$(dollar) := $$

; := ;
% := %

You probably don't need any of those, but you never know...

POPFile v0.22.5 Released

Here are the details:

Welcome to POPFile v0.22.5

This version is a bug fix and minor feature release that's been over a
year in the making (mostly due to me being preoccupied by other


SourceForge has announced their 'Community Choice Awards' for 2007 and
is looking for nominations. If you feel that POPFile deserves such an
honour please visit the following link and nominate POPFile in the
'Best Project for Communications' category. POPFile requires multiple
nominations (i.e. as many people as possible) to get into the list of



1. POPFile now defaults to using SQLite2 (the Windows installer will
convert existing installations to use SQLite2).

2. Various improvements to the handling of Japanese messages and
improvements for the 'Nihongo' environment:

Performance enhancement for converting character encoding by
skipping conversion which was not needed. Fix a bug where the
wrong character set was sometimes used when the charset was not
defined in the mail header. Fix a bug where several HTML
entities caused misclassification. Avoid a warning 'uninitialized
value'. Fix a bug that the word's links in the bucket's detail
page were not url-encoded.

3. Single Message View now has a link to 'download' the message from
the message history folder.

4. Log file now indicates when an SSL connection is being made to the
mail server.

5. A number of small bug fixes to the POPFile IMAP interface.

6. Installer improvements:

Email reconfiguration no longer assumes Outlook Express is
present. Add/Remove Programs entries for POPFile now show a more
realistic estimate of the program and data size. Better support
for proxies when downloading the SSL Support files. The SSL
patches are no longer 'hard-coded', they are downloaded at
install time. This will make it easier to respond to future
changes to the SSL Support files. The Message Capture Utility
now has a Start Menu shortcut to make it easier to use the
utility. The minimal Perl has been updated to the latest
version. The installer package has been updated to make it work
better on Windows Vista (but further improvements are still



An introduction to installing and using POPFile can be found in the
QuickStart guide:


SSL Support is offered as one of the optional components by the
installer. If the SSL Support option is selected the installer will
download the necessary files during installation.

If SSL support is not selected when installing (or upgrading) POPFile
or if the installer was unable to download all of the SSL files then
the command

setup.exe /SSL

can be used to run the installer again in a special mode which will
only add SSL support to an existing installation.


The current version of SQLite (v3.x) is not compatible with POPFile.
You must use DBD:SQLite2 to access the database.

Users of SSL on non-Windows platforms should NOT use IO::Socket::SSL
v0.97 or v0.99. They are known to be incompatible with POPFile; v1.07
is the most recent release of IO::Socket::SSL that works correctly.


If you are upgrading from pre-v0.22.0 please read the v0.22.0 release
notes for much more information:


Thank you to everyone who has clicked the Donate! button and donated
their hard earned cash to me in support of POPFile. Thank you also to
the people who have contributed their time through patches, feature
requests, bug reports, user support and translations.


Big thanks to all who've contributed to POPFile over the last year.


Friday, June 15, 2007

Is this me?

A friend forwarded me this image with the subject line: Is this you? Looks like an old publicity for Byte magazine, and sure enough it does look rather like me:

So, all I can say is (Austin Powers accent): "Groovy, man. Yeah, baby, that's me."

BTW. The machine looks a lot like a Versatile 2, or perhaps it's a Vector 3. That would date this picture to around 1977.

Monday, June 11, 2007

Measuring my inbox depth

Some time ago I wrote about how I manage the flow of email hitting me. Back then I didn't talk about what happens to the email after it's been filtered and automatically sorted.

Here's a quick picture of my email folders:

POPFile does the automatic sorting into the sub-folders when I download mail (read the article linked above for details on that) and I then manually move mail that I can't respond to immediately to the ACTION folder (to make this quick I use the lovely TB QuickMove Extension which makes all my mail moving a CTRL-# click away).

Anything that's in ACTION needs to be dealt with. Sometimes that means those messages will be gone in a day, sometimes in weeks, just depends on the contents. There are also a couple of other 'action' type folders called Next Newsletter (where I store interesting stuff to mention in my next spam and anti-spam newsletter) and Next GNU Make (where I store stuff that'll go into next month's CM Basics Mr Make column).

It occurred to me that the depth of my inbox might be an interesting thing to track so I wrote a quick Perl script that parses the mbox file associated with the ACTION folder looking for non-deleted messages (I look at the X-Mozilla-Status header and check that the 0x8 bit is not set) to get a count of the messages in ACTION.

Then I pump that data into rrdtool using the RRD::Simple interface from CPAN and use it to create graphs of the number of waiting items. All this runs off a cron job every five minutes (and once an hour for graph creation).

You can see the 'live' graphs here (these will update hourly when my machine is on):

Tuesday, June 05, 2007

A little light relief in an 'enlargement' spam

Jason Steer from IronPort sent me over an image taken from a recent spam for an 'enlargement' product. Here's the image:

Since the text is large I thought it would be fun to run this through gocr to OCR out the text and URL. Here's the output of gocr on the image (I removed a few blanks lines for clarity here):

__ ___

_ ____

So, that worked out well :-) Nevertheless, the domain is listed in the SURBL:

$ dig
; IN A


So, if you can extract the domain name from the image it's possible to check it against the SURBL and blacklist the message. Switching over to Google's Tesseract OCR system revealed the following:

(I LIT inl 3
IE\)' :
Q ii ,g @1

H ; ik
s$i` wg `i?! J

Friday, May 25, 2007

Back from the EU Spam Symposium; here's my talk

So I'm back home from the 2007 EU Spam Symposium which was held in Vienna in Austria and you can grab my presentation here. You'll notice that the presentation template is from MailChannels. They very kindly sponsored my trip to Vienna and so I did a little publicity for them. There's only one slide, however, that's actually anything to do with MailChannels in the entire presentation, so don't expect a product pitch!

One thing I didn't mention in my talk was that as the number of Internet hosts expands and the number of broadband subscribers grows the number of competing botnets can also grow. That means I'd expect to see the price of botnet rental dropping as the Internet grows leading to lower costs for spammers.

I'll give a complete round up of the conference in my newsletter next week, but overall there were some interesting talks, and meeting some people like Richard Cox from SpamHaus and Richard Clayton was very useful.

Tuesday, May 15, 2007

Some architectural details of Signal Spam

Finally, Signal Spam, France's new national anti-spam system, launched and I'm able to talk about it. For a brief introduction in English start here.

I'm not responsible for the idea behind Signal Spam, nor for its organization, but I did write almost all the code used to run the site and the back end system. This blog post talks a little bit about the design of Signal Spam.

Signal Spam lets people send spams via either a web form, or a plug-in. Plug-ins are currently available for Outlook 2003, Outlook 2007 and Thunderbird 2.0; more coming. Currently Signal Spam does three things with every message: it keeps a copy in a database after having extracted information from the body and headers of the message; it figures out if the message came from an ISP in France and if so sends an automatic message to the ISP indicating that they've got a spammer or zombie in their network; it figures out if the message was actually a legitimate e-marketing message from a French mailer and informs the person reporting the spam of how to unsubscribe.

The original plan was that the system be capable of handling 1,000,000 messages per day allowing for peaks of up to 1000 messages per minute (such as when people first come to work in the morning) and that messages would be handled in near real-time (i.e. the time from a message being received by the system to it being analyzed and forwarded to an ISP would be under 60 seconds). Signal Spam also wanted a lot of flexibility in being able to scale the system up as use of the site grew and being able to do maintenance of the site without taking it down.

Here's the last 12 hours of activity on the site, which pretty much matches what we expected with a peak once people get to work and start reading their mail. (These charts are produced automatically by the lovely RRDTool.)

The system I built is split into two parts: the front end (everything that the general public sees including the API used by the plug-ins) and the back end (the actual database storing the messages sent, the software that does analysis and the administration interface). Communication between the front end and the back end uses a web service running over HTTPS.

To make things scale easily the back end is entirely organized around a simple queuing system. When a message arrives from a user it is immediately stored in the database (there are, in fact, two tables: one table contains the actual complete message as a BLOB and the other contains the fields extracted from the message. The messages have the same ID in each table and the non-BLOB table is used for searching and sorting).

Once stored in the database the message ID is added to a FIFO queue (which is actually implemented as a database table). An arbitrary number of processus handle message analysis by dequeuing IDs from the FIFO queue (using row-level locking so that only one process gets each ID). Once dequeued the message is analyzed: fields such as From, Subject, Date are extracted and stored in the database, the Received headers are walked using a combination of blacklist lookup and forgery detection to find the true IP address that injected the message into Internet, the IP address is mapped to the network that manages the IP address, fingerprints of the message are taken and all URLs inside the message are extracted.

Once the analysis is complete the process decides whether the message needs to be sent to an ISP. If so it enqueues the message ID on another FIFO queue for a separate forwarding process to handle. If the message is in fact a legitimate message then the message ID is enqueued on a FIFO queue for another response process to handle.

The forwarding process generates an ARF message to the appropriate ISP and sends the message for every ID that it dequeues from the queue using VERP for bounce or reponse handling.

The response process dequeues IDs and responsed to the original reporter of the spam with a message tailored for the specific e-marketer with full unsubscribe details.

The use of queues and a shared database to handle the queues, plus a simple locking strategy means that arbitrary numbers of processes can be added to handle the load on the system as required (currently there is only one process of each type running and handling all messages in the delay required). It also means that the processus do not need to be on the same machine and the system can scale by adding processus or adding hardware.

Stopping the processes does not stop the front end from operating. Messages will still be added to the database and the analysis queue will grow. In fact, the length of the queue makes measuring the health of the system trivial: just look at the length of the queue to see if we are keeping up or not.

Since the queue has the knowledge about the work to be done processus can be stopped and upgraded as needed without taking the system off line.

To hide all this the entire system (which is written in Perl---in fact, the back end is entirely LAMP) uses an object structure. For example, creating the Message object (passing the raw message into the constructor) performs the initial message analysis and queues the message for further analysis. Access to the database queues is entirely wrapped in a Queue object (constructor takes the queue name). These objects are dynamically loaded by Perl and can be upgraded as needed.

Finally, all the objects (and related scripts) have unit tests using Perl's Test::Class and the entire system can be tested with a quick 'make test'. One complexity is that most of the classes require access to the database. To work around this I have created a Test::Database class that is capable of setting up a complete MySQL system from scratch (assuming MySQL is currently installed) and loading the right schema, that is totally independent of any other MySQL instance. The class then returns a handle (DBI) to that instance plus a connect string. This means the unit tests are totally independent of having a running database.

In addition, the unit tests include a system that I created for POPFile which allows me to get line-by-line coverage information showing what's tested and what's not. By running 'make coverage' it's possible to run the test suite with coverage information. This gives percentage of lines tested and for every Perl module, class and script a corresponding HTML file is generated with lines colored green (if they were executed during testing) or red (if not). The coloring is achieved by hooking the Perl debugger (see this module from POPFile for details).

Here's an example of what that looks like (here I'm showing Classifier/Bayes.html which corresponds to Classifier/ in the POPFile code, but it looks the same in Signal Spam):

The green lines (with line numbers added) were executed by the test suite; the red line was not (you can see here that my test suite for POPFile didn't test the possibility that the database connect would fail).

Perhaps OCRing image spams really is working?

I've previously been skeptical of the idea that OCRing image spams was a worthwhile effort because of the variety of image-obfuscation techniques that spammers had taken to using.

But Nick FitzGerald has recently sent me an example of an image spam that seems to indicate that spammers are concerned about the effectiveness of OCR. Here's the image:

What's striking is that the spammer has used the same content-obscuring tricks that we've seen with text (e.g. Viagra has become [email protected]@), perhaps out of fear that the OCRing of images is working and revealing the text within the images.

Or perhaps this spammer is just really paranoid.

Thursday, April 19, 2007

UseTheSource (re-)launches: social code snippet web site

Almost 8 years ago I registered the domain name (as in Use the Source, Luke!) and for a long time ran it as a blog (back when people were first starting to say the word 'weblog'). It even got me an appearance with Leo Laporte on The Screen Savers (if you are old enough to remember that show :-)

Well, UseTheSource is back, but not as a blog, it's back as a Reddit/Digg style social news web site with a twist: it's only intended to accept links to pieces of code. It's unashamedly for programmers who want to see working code.

So if you see a cool hack, a great explanation of an API, an neat algorithm think about submitting it to UseTheSource. You can also submit articles that are relevant to the site, but bear in mind that I'm trying to encourage people to submit things with working code!

The code snippets are categorized by language (I'll add languages on demand), and the site supports Digg-style voting plus user-defined tags. To get things going I've submitted a bunch of suitable items (most from my own writing).

Of course, it's very much in a 'beta' state and I'll be happy to receive feedback on bugs you encounter or suggestions for improvements.

Tuesday, April 17, 2007

No newsletter for April 15, 2007

This blog post is really for people who read my spam and anti-spam newsletter. There won't be one this week (it was due on April 15) because I wanted to spend time in the newsletter reviewing the MIT Spam Conference 2007.

Unfortunately, the videos of the presentations on the web site are either of very poor quality or have no sound at all. This makes reviewing the presentations very hard. Bill Yerazunis has promised better videos 'coming soon', and I'm waiting for them.

While I'm complaining I also find that the 'download an ISO' to get the papers to be ridiculous. I've asked Bill to provide individual links to each of the papers and presentations so that people can just click and download what they want. He says...

We *could* have individual links. I assume the
individual authors would do so in any case.

And I'll probably do that later.

But right now, I'm sorta trying to get people to
at least glance at all the papers, and put the CDROMs into
their local libraries; that's the motivation.

So, my next newsletter will be out on April 30 with a review of the MIT Spam Conference and other regular news.

Monday, April 16, 2007

Debugging: Solaris bus error caused by taking pointer to structure member

Take a look at this sample program that fails horribly when compiled on Solaris using gcc (I haven't tried other compilers, and I'm not pointing my finger at gcc here, this is a Sun gotcha).

Here's an example program (simplified for something much more complex that I was debugging), that illustrates how memory alignment on SPARC systems can bite you if you are doing low-level things in C. In the example the program allocates space for a thing structure which will be prepended with a header. The header structure has a dummy byte array called data which will be used to reference the start of the thing.

struct thing {
int an_int;

struct header {
short id;
char data[0];

struct header * maker( int size ) {
return (struct header *)malloc( sizeof( struct header ) + size );

int main( void ) {
struct header * a_headered_thing = maker( sizeof( struct thing ) );

struct thing * a_thing = (struct thing *)&(a_headered_thing->data[0]);

a_thing->an_int = 42;

If you build this on a SPARC machine you'll get the following error when you run it:

Bus Error (core dumped)

Annoyingly, if you build a debugging version of this program the problem magically goes away and doesn't dump core in the debugger. So you either resort to printf-style debugging or going into gdb and looking at the assembly output.

Here's what happens when you run this in gdb (non-debug code):

(gdb) run

Program received signal SIGSEGV, Segmentation fault.
0x000106d8 in main ()

Since you can't get back to the source we're forced to do a little disassembly:

(gdb) disassemble
Dump of assembler code for function main:
0x000106b0 : save %sp, -120, %sp
0x000106b4 : mov 4, %o0
0x000106b8 : call 0x10688
0x000106bc : nop
0x000106c0 : st %o0, [ %fp + -20 ]
0x000106c4 : ld [ %fp + -20 ], %o0
0x000106c8 : add %o0, 2, %o0
0x000106cc : st %o0, [ %fp + -24 ]
0x000106d0 : ld [ %fp + -24 ], %o1
0x000106d4 : mov 0x2a, %o0
0x000106d8 : st %o0, [ %o1 ]
0x000106dc : mov %o0, %i0
0x000106e0 : nop
0x000106e4 : ret
0x000106e8 : restore
0x000106ec : retl
0x000106f0 : add %o7, %l7, %l7
End of assembler dump.

I've highlighted the offending instruction. From the code you can clearly see that the o0 register contains the value 0x2a (which is, of course, 42) and hence we are looking at code corresponding to the line a_thing->an_int = 42;. The st instruction is going to write the 42 into the an_int field of thing. The address of an_int is stored in o1.

Asking gdb for o1's value shows us:

(gdb) info registers o1
o1 0x2094a 133450

An int is 4 bytes and you can easily see that the address of an_int stored in o1 is not 4 byte aligned (133450 mod 4 = 2, or just stare at the bottom nybble). The SPARC architecture insists that the data accesses by correctly aligned for the size of the access. In this case we need 4 byte assignment (note that malloc will make sure that things are correctly aligned and the compiler will pack structures to the correct alignment while minimizing space).

In this case the code fails because the data member is byte aligned (since we declared it as a character array), but then we case a pointer to it and treat it as structure with an integer member. Oops. Bus error.

(Note you could have discovered this with printf and %p to get the pointer values without going into the debugger and poking around in the assembly code).

There are a couple of ways to fix it. The first is to pad the header structure so that data is correctly aligned: adding 4 bytes of padding in the form of a short while make the problem go away:

struct header {
short id;
short padding;
char data[0];

That's ugly and requires careful commenting and could be a maintenance problem if maker is used to make things requiring a different alignment, or the header structure is modified.

It's slightly cleaner to not have padding but change the type of data to something like the alignment you want:

struct header {
short id;
int data[0];

(Or even double data[0] to get 8 byte alignment). With gcc you could even make this really clear by using the aligned attribute to create a special type:

typedef char aligned_data __attribute__ ((aligned (8)));

struct header {
short id;
aligned_data data[0];

I think that's the clearest option of all. With a little documentation around this it should be maintainable.

Friday, April 06, 2007

Ego Blogging: My Portrait

I was looking for a portrait to place on my web site as part of the overhaul and after having unsuccessfully worked with various providers at Elance I finally asked Leo Laporte who had done his picture.

Turns out it's Snaggy at GeekCulture so I put a small number of $ on the table and received the following:

The first person to see this said, "It makes you look about 20!". Even if I do look a little younger than I am, it's a very good likeness. Now, I just need to find a way to insert it into my home page.

Wednesday, March 28, 2007

Some statistics about usage

A regular user of my service wrote in to ask if I had any statistics about its use. I didn't until I grep/cut/wced a few log files, and now I do. Since its inception:

Unique URLs monitored2,256
Unique email addresses (i.e. # of users)586
URLs that were still down after a week of monitoring205
# of URLs monitored by the most active user763
# of rules used to identify non-working sites67
Total ad revenue$13.37 now uses 67 separate rules to determine whether a site is available or not. That number is growing as is improved. The more users the better it gets.

It's interesting to look at the distribution of 'dead times' (how long a site remained dead from it being reported to for monitoring and then actually being available.

The big initial bar are people who get a response in under 10 minutes; many of those people are no doubt kicking the tires to see if works at all! Once you pass 10 minutes you can see that the average web site comes back on line after 1h45m.

Code to decode an a.b.c.d/n IP address range to the list of IP addresses

I needed to map some addresses in the standard IP address prefix syntax to the actual list of addresses, and after I'd searched Google for all of 11 usecs for a suitable on-line decoder I hacked one up in Perl.

Since someone else might find this handy, here it is:

use strict;
use warnings;

my $address = $ARGV[0] || '/';

my ( $ip, $bits ) = split( /\//, $address );
my @octets = split( /\./, $ip );

if ( ( $address eq '' ) ||
( $bits eq '' ) ||
( $#octets != 3 ) ) {
die "Usage: ip-decode a.b.c.d/x\n";

my $base = ( ( $octets[0] * 256 + $octets[1] ) * 256 +
$octets[2] ) * 256 + $octets[3];

my $remaining = 32 - $bits;

for my $i (0..(2 ** $remaining) - 1) {
print_ip( $base + $i );

sub print_ip
my ( $address ) = @_;

my @octets;

for my $i (0..3) {
push @octets, ($address % 256);
$address >>= 8;

print "$octets[3].$octets[2].$octets[1].$octets[0]\n";

For example, if this script is called ip-decode and you want to decode the address prefix you type ip-decode and you'll get the output:

This script has undergone no testing at all...

Single 330ml can, take away, Coca Cola pricing within a two block radius of my home

The lowest price is 1/4 of the highest for no additional work.

So how much difference do stopwords make in a Naive Bayes text classifier?

POPFile contains a list of stopwords (words that are completely ignored for the purpose of classifying messages) that I put together by hand years ago (if you are a POPFile user they are on the Advanced tab and called Ignored Words). The basic idea is that stopwords are useless from the perspective of classification and should be ignored; they are just too common to provide much information.

My commercial machine learning tools did not, until recently, have a stopwords facility. This was based on my belief that stopwords didn't make much difference: if they are common words they'll appear everywhere and probabilities will be equal for each category of classification. I had a 'why bother' attitude.

Finally, I got around to testing this assumption. And I can give some actual numbers. Taking the standard 20 Newsgroups test and using a (complement) Naive Bayes text classifier as described here I can give you some numbers.

The bottom line is that stopwords did improve classification accuracy for 20 Newsgroups (and for other test datasets that I have), but the improvement was by 0.3%. The stopword list used was the top 100 most frequently occurring English words.

So, I was wrong... they do make a difference, but not by much.

Monday, March 26, 2007

Introducing Usman's Law

Back at Electric Cloud I worked with a smart guy named Usman Muzaffar. As part of his job he spent a lot of time dealing with our customers, many of whom used GNU Make other other Make tools to build their software.

One of the constant problems that Usman encountered was that most people had no way to get back to a truly clean build. No matter what they'd put in place for doing a totally scratch, clean build it was hard for everyone because their process often accidentally ommitted to delete something.

I've observed this problem in my own code. Like many people I have a 'make clean' option which deletes all the output files: in my case by rm -rfing an obj directory:

.PHONY: clean
@rm -rf $(OUT)/*

And I make sure that generated things only go under $(OUT). But it's easy to screw up. Consider a program like yacc or bison which'll create temporary source code files in the same place as the source code being analyzed. The truth is you have to be very careful to ensure that everything goes in one deletable place. (Not to mention the difficulties involved if the Makefile output different versions of objects for, say, different processor targets or platforms).

That leads me to Usman's Law: make clean doesn't.

Live by it and you'll be on the look out for poorly coded Makefiles that leave generated files in places they should not.

Saturday, March 24, 2007

Whoops! Hotel room menu system fscked

A few days ago I had to be in London for business so I stayed at the Gatwick Airport Sofitel. Nice hotel, and right inside the terminal... very convenient. When I got to my room the TV was switched with a short ad looping. It was inviting me to press the INFO button on the remote for the hotel menu.

There was no INFO button, but there was a button called MENU. I pressed that button... oops:

Looks like the hotel menu system was running RedHat Linux and the disk was fscked. (Sorry about the poor quality, the camera in my Motorola RAZR sucks). You can just make out the hda2 is screwed.

The last part says that it was dropping into a shell. Unfortunately, there was no way that I could type anything :-(

Friday, March 23, 2007

Electric Cloud wins a Jolt Productivity Award

Back in 2005 POPFile (which is now in desperate need of an updated version) won a Productivity Award at the 15th Annual Jolt awards. This week the company I co-founded, Electric Cloud, won the exact same award for its product ElectricCommander.

OK, I should stop bragging now.

And show a little humility.

Truth be told, the glow from the second award is strictly reflected... I didn't design, code, or do anything to make ElectricCommander :-) But being a company founder is a good thing; you get to pretend you had all the smart ideas.

Wednesday, March 21, 2007 gets a major overhaul

Yesterday I pushed out a completely revamped with a new look and feel and much better organization. Here are some of the highlights:

RSS Feeds Everywhere

The site now offers a total of 34 RSS feeds (all of which are listed on the main page). Through the site RSS feeds for specific sections are highlighted using the symbol; click to subscribe.

This large number of feeds means that users can subscribe to parts of the content of the site that they care about. For example, to follow the evolution of The Spammers' Compendium there's a feed for that, you could just subscribe to find out when I update any of the Free Software I maintain, or if you are interested in GNU Make you can follow my writings on that topic here.

Of course, if you want to know when anything changes there's this feed.

Better Organization

Pages have be rearranged so that writing, talks and other content can be accessed by the publication or venue, or by the topic. Clicking on a subject expands all the content relevant to that subject.

In addition, the pages The Spammers' Compendium and ASTLT have been totally rewritten for a clearer presentation. The Spammers' Compendium is also now presenting the SPUTR names for each of the tricks for better categorization.

It's All Automatic

I've built my own little content-management system that lets me update and maintain the site from a single script. The pages you see are static but behind the scenes they are generated when things change.

Please let me know if you have trouble with the site. I've tested it on recent browsers, but haven't been testing everywhere...