Wednesday, March 28, 2007

Some statistics about usage

A regular user of my service wrote in to ask if I had any statistics about its use. I didn't until I grep/cut/wced a few log files, and now I do. Since its inception:

Unique URLs monitored2,256
Unique email addresses (i.e. # of users)586
URLs that were still down after a week of monitoring205
# of URLs monitored by the most active user763
# of rules used to identify non-working sites67
Total ad revenue$13.37 now uses 67 separate rules to determine whether a site is available or not. That number is growing as is improved. The more users the better it gets.

It's interesting to look at the distribution of 'dead times' (how long a site remained dead from it being reported to for monitoring and then actually being available.

The big initial bar are people who get a response in under 10 minutes; many of those people are no doubt kicking the tires to see if works at all! Once you pass 10 minutes you can see that the average web site comes back on line after 1h45m.

Code to decode an a.b.c.d/n IP address range to the list of IP addresses

I needed to map some addresses in the standard IP address prefix syntax to the actual list of addresses, and after I'd searched Google for all of 11 usecs for a suitable on-line decoder I hacked one up in Perl.

Since someone else might find this handy, here it is:

use strict;
use warnings;

my $address = $ARGV[0] || '/';

my ( $ip, $bits ) = split( /\//, $address );
my @octets = split( /\./, $ip );

if ( ( $address eq '' ) ||
( $bits eq '' ) ||
( $#octets != 3 ) ) {
die "Usage: ip-decode a.b.c.d/x\n";

my $base = ( ( $octets[0] * 256 + $octets[1] ) * 256 +
$octets[2] ) * 256 + $octets[3];

my $remaining = 32 - $bits;

for my $i (0..(2 ** $remaining) - 1) {
print_ip( $base + $i );

sub print_ip
my ( $address ) = @_;

my @octets;

for my $i (0..3) {
push @octets, ($address % 256);
$address >>= 8;

print "$octets[3].$octets[2].$octets[1].$octets[0]\n";

For example, if this script is called ip-decode and you want to decode the address prefix you type ip-decode and you'll get the output:

This script has undergone no testing at all...

Single 330ml can, take away, Coca Cola pricing within a two block radius of my home

The lowest price is 1/4 of the highest for no additional work.

So how much difference do stopwords make in a Naive Bayes text classifier?

POPFile contains a list of stopwords (words that are completely ignored for the purpose of classifying messages) that I put together by hand years ago (if you are a POPFile user they are on the Advanced tab and called Ignored Words). The basic idea is that stopwords are useless from the perspective of classification and should be ignored; they are just too common to provide much information.

My commercial machine learning tools did not, until recently, have a stopwords facility. This was based on my belief that stopwords didn't make much difference: if they are common words they'll appear everywhere and probabilities will be equal for each category of classification. I had a 'why bother' attitude.

Finally, I got around to testing this assumption. And I can give some actual numbers. Taking the standard 20 Newsgroups test and using a (complement) Naive Bayes text classifier as described here I can give you some numbers.

The bottom line is that stopwords did improve classification accuracy for 20 Newsgroups (and for other test datasets that I have), but the improvement was by 0.3%. The stopword list used was the top 100 most frequently occurring English words.

So, I was wrong... they do make a difference, but not by much.

Monday, March 26, 2007

Introducing Usman's Law

Back at Electric Cloud I worked with a smart guy named Usman Muzaffar. As part of his job he spent a lot of time dealing with our customers, many of whom used GNU Make other other Make tools to build their software.

One of the constant problems that Usman encountered was that most people had no way to get back to a truly clean build. No matter what they'd put in place for doing a totally scratch, clean build it was hard for everyone because their process often accidentally ommitted to delete something.

I've observed this problem in my own code. Like many people I have a 'make clean' option which deletes all the output files: in my case by rm -rfing an obj directory:

.PHONY: clean
@rm -rf $(OUT)/*

And I make sure that generated things only go under $(OUT). But it's easy to screw up. Consider a program like yacc or bison which'll create temporary source code files in the same place as the source code being analyzed. The truth is you have to be very careful to ensure that everything goes in one deletable place. (Not to mention the difficulties involved if the Makefile output different versions of objects for, say, different processor targets or platforms).

That leads me to Usman's Law: make clean doesn't.

Live by it and you'll be on the look out for poorly coded Makefiles that leave generated files in places they should not.

Saturday, March 24, 2007

Whoops! Hotel room menu system fscked

A few days ago I had to be in London for business so I stayed at the Gatwick Airport Sofitel. Nice hotel, and right inside the terminal... very convenient. When I got to my room the TV was switched with a short ad looping. It was inviting me to press the INFO button on the remote for the hotel menu.

There was no INFO button, but there was a button called MENU. I pressed that button... oops:

Looks like the hotel menu system was running RedHat Linux and the disk was fscked. (Sorry about the poor quality, the camera in my Motorola RAZR sucks). You can just make out the hda2 is screwed.

The last part says that it was dropping into a shell. Unfortunately, there was no way that I could type anything :-(

Friday, March 23, 2007

Electric Cloud wins a Jolt Productivity Award

Back in 2005 POPFile (which is now in desperate need of an updated version) won a Productivity Award at the 15th Annual Jolt awards. This week the company I co-founded, Electric Cloud, won the exact same award for its product ElectricCommander.

OK, I should stop bragging now.

And show a little humility.

Truth be told, the glow from the second award is strictly reflected... I didn't design, code, or do anything to make ElectricCommander :-) But being a company founder is a good thing; you get to pretend you had all the smart ideas.

Wednesday, March 21, 2007 gets a major overhaul

Yesterday I pushed out a completely revamped with a new look and feel and much better organization. Here are some of the highlights:

RSS Feeds Everywhere

The site now offers a total of 34 RSS feeds (all of which are listed on the main page). Through the site RSS feeds for specific sections are highlighted using the symbol; click to subscribe.

This large number of feeds means that users can subscribe to parts of the content of the site that they care about. For example, to follow the evolution of The Spammers' Compendium there's a feed for that, you could just subscribe to find out when I update any of the Free Software I maintain, or if you are interested in GNU Make you can follow my writings on that topic here.

Of course, if you want to know when anything changes there's this feed.

Better Organization

Pages have be rearranged so that writing, talks and other content can be accessed by the publication or venue, or by the topic. Clicking on a subject expands all the content relevant to that subject.

In addition, the pages The Spammers' Compendium and ASTLT have been totally rewritten for a clearer presentation. The Spammers' Compendium is also now presenting the SPUTR names for each of the tricks for better categorization.

It's All Automatic

I've built my own little content-management system that lets me update and maintain the site from a single script. The pages you see are static but behind the scenes they are generated when things change.

Please let me know if you have trouble with the site. I've tested it on recent browsers, but haven't been testing everywhere...

Thursday, March 15, 2007

Calibrating a machine learning-based spam filter

I've been reading up about calibration of text classifiers, and I recommend a few papers to get you started:

The overall idea is that the scores output by a classifier need to be calibrated so that they can be understood. And, specifically, if you want to understand them as a probability that a document falls into a particular category then calibration gives you a way to estimate the probability from the scores.

The simplest technique is bucketing or binning. Suppose that the classifier outputs a score s(x) in the range [0,1) for an input document x. Once a classifier is trained it's possible to calibrating it by classifing known documents, recording the output scores and then for each of a set of ranges (e.g. divide [0,1) into 10 ranges [0.0,0.1), [0.1,0.2) and so on) count the number of documents in each class, and get a fraction the represents the probability within that range. Then when you classify an unknown document you look at the range in which its score falls to get probability estimates.

For spam filtering each range would consist of the number of emails that are ham in the range, and the number that are spam. The ham probability then just comes out to # of ham emails in the range / total number of emails in the range.

I decided to run a little test of this and took the complete TREC 2005 Public Spam Corpus and trained a Gary Robinson style spam filter on it. Then I classified the entire corpus to get calibration data. Instead of looking at the final scores, I looked at the pair of scores (one for ham, one for spam) that the classifier generates and used those to generate bins. The following chart shows (crudely) the percentage of spams and hams in each bin:

The x-axis is the ham score in the range [0.0,0.9) with each square representing a single 0.1-width bin. The left-most square means that the classifier had a very low score in terms of hamminess, right-most square means that the classifier had a very high score.

The y-axis is the spam score in the [0.0,0.9), so that the bottom means a low spamminess score and the top a high one. Each square is colored red and green. Red is the percentage of messages in that bin that are spam, and green the percentage of messages in that bin that are ham.

So as you'd expect the top left corner (low ham score, high spam score) is generally colored red indicating that these are spam messages, and the bottom right corner (low spam score, high ham score) is colored green (lots of ham). Some squares are black because there's too little data (I arbitrarily said that if there were less than 5 messages in the bin it was colored black).

But there's a really interesting square in the top right: it's almost all red (in fact, it's 93% red or spam). So we can have confidence that a message there (which corresponds to a final 'score' of 0.5) is in fact spam.

I'm sure that there are other interesting insights to be gained from calibration and I'd like to spend the time to evaluate the effectiveness of using the probabilities generated in this way (instead of the rather arbitrary selection of a score 'cutoff' point) as the way to determine whether a message is spam or ham (or undetermined). For example, the square at (0.9,0.2) (which corresponds to a 'score' of 0.18 is 57% ham, 43% spam so looks like a good candidate for undetermined; it looks like a much better candidate than the typical "area around 0.5" (which corresponds to (0.9,0.9) and is 93% spam).

An image-based spam that brags about its own delivery

Nick FitzGerald sent me a simple image-based spam for Viagra which starts with a brag on the part of the spammer:

Nick calls it a 'self aware' spam.