Wednesday, May 17, 2006

What the Slashdot effect looks like

On May 15 my site was mentioned in a story on the front page of Slashdot. Up until the moment when a link to the site appeared on Slashdot people had been visiting the site and classifying about 100 messages per hour (that's period A in the chart below).

All that changed at 1530 GMT on May 15. In the first half an hour (until 1600) around 3,300 classifications were made (an increase of 33x). Point B on the chart is the peak of 9,803 classifications in an hour that occurred during the hour starting 2000 GMT. By this time the story was no longer number one on the Slashdot (it was demoted from the number one spot at 1628 GMT).

At point C the site dropped from its peak classification rate, this occurred nine hours after the story first hit Slashdot. Those nine hours correspond to 0700 to 1600 in California and 1000 to 1900 in New York; peak working hours in the US.

The second significant peak occurs at point D occurred 22 hours after the story hit Slashdot; that corresponds to 1100 GMT on May 16. During this period the US is sleeping, but Europe is awake and working.

There's a smaller peak that I've labeled E which occurs 16 hours after the story hit the front page. That's 0500 GMT which just happens to be 1500 in eastern Australia and similarly the middle of the afternoon in Asia.

Happily although the site is no longer on Slashdot, classifications remain 5x higher than before it was mentioned.

Monday, May 15, 2006

There's one born every minute: spam and phishing

It's been a little while since I launched where people perform a spam filtering task and their results are compared against best of breed spam filters. I set out to make sure that the spam filters were doing a good job on the assumption that people would be able to spot errors that the filter was making.

Bad assumption. It turns out, based on preliminary data, that people suck at spam filtering. Here's some initial figures: people agree with 89.1% of the classifications that they've examined. Now that could mean that the original spam filter sucked, but guess again!

Ignoring all the emails that have only been voted on once, and looking at the emails that have been seen by multiple people (who've agreed that they believe that the message is a ham or a spam), there are some really surprising results:

Here's one that people think is a spam:

and this one too:

and many people think this US Airways message is spam:

Now for the prize winning classification. The people who thought the following phish was a genuine message, could you please forward your bank account details and PIN to me so that I can deposit your prize in your account:

Happily, people are finding genuine errors that the spam filter made. For example, this really is a genuine message from Travelocity and not a spam:

Thursday, May 11, 2006

Interest in spam declining

Yesterday, Google released their Google Trends tool that lets you examine the trend in search terms over the last couple of years. I was able to use it to examine something I'd been suspecting for a while: the general public is losing interest in spam and spam filtering.

Here's a first graph that shows the trend in people searching for 'spam filter' (click the graph to go to the Google Trends site for more details):

A similar trend is seen when people are just searching for the term 'spam':

And all the major spam filters are showing declines in interest (I chose three here, but you'll see similar trends for any filter you ask for). POPFile is in blue, SpamBayes in orange and DSPAM in red.

'Phishing' isn't declining, but it also seems to have been fairly stable for the last year. So if you are doing a startup in the phishing world you'd better get moving.

It's time I moved on from spam, here's a trend I think I'd like to follow. My guess is that there's still some up here.

Monday, May 01, 2006

First hacking attempt?

When I built the backend code for I was sure that some malpeople would attempt to mess up the results by voting deliberately incorrectly. And I was worried that other people might try automate this messing with the system.

I now have the first evidence that someone (within just two days of launch) attempted to subvert the controls that I put in place. Here's part of the security log that my system generates (I've removed the IP address used, but it was in Canada):

Epoch Time Time Since Page Served Error

1146387238 94 captcha wrong
1146387295 52 captcha wrong
1146387997 42 captcha wrong
1146392018 36 captcha wrong
1146392675 24 captcha wrong
1146394141 25 captcha wrong
1146416725 9309 hash doesn't match, captcha wrong
1146419961 205 hash doesn't match

This person (or robot) got the CAPTCHA wrong repeatedly (which would have caused their connection to be tarpitted) and then the hash match failed. The hash match failure means that fields hidden in the page were tampered with (or that the IP address the user was using suddenly changed in the middle of a session---which is possible if their Internet connection went down and up between pages). The hidden fields are used to track the exact message that's being examined, timeout information and the random data used to generate the CAPTCHA. Any tampering is detected automatically without maintaining server-side state (I blogged about this system earlier).

So it's possible that this happened for some innocent reason, but it looks to me like someone tried to see if the controls in place were subvertable and gave up.