Skip to main content

Calibrating a machine learning-based spam filter

I've been reading up about calibration of text classifiers, and I recommend a few papers to get you started:

The overall idea is that the scores output by a classifier need to be calibrated so that they can be understood. And, specifically, if you want to understand them as a probability that a document falls into a particular category then calibration gives you a way to estimate the probability from the scores.

The simplest technique is bucketing or binning. Suppose that the classifier outputs a score s(x) in the range [0,1) for an input document x. Once a classifier is trained it's possible to calibrating it by classifing known documents, recording the output scores and then for each of a set of ranges (e.g. divide [0,1) into 10 ranges [0.0,0.1), [0.1,0.2) and so on) count the number of documents in each class, and get a fraction the represents the probability within that range. Then when you classify an unknown document you look at the range in which its score falls to get probability estimates.

For spam filtering each range would consist of the number of emails that are ham in the range, and the number that are spam. The ham probability then just comes out to # of ham emails in the range / total number of emails in the range.

I decided to run a little test of this and took the complete TREC 2005 Public Spam Corpus and trained a Gary Robinson style spam filter on it. Then I classified the entire corpus to get calibration data. Instead of looking at the final scores, I looked at the pair of scores (one for ham, one for spam) that the classifier generates and used those to generate bins. The following chart shows (crudely) the percentage of spams and hams in each bin:

The x-axis is the ham score in the range [0.0,0.9) with each square representing a single 0.1-width bin. The left-most square means that the classifier had a very low score in terms of hamminess, right-most square means that the classifier had a very high score.

The y-axis is the spam score in the [0.0,0.9), so that the bottom means a low spamminess score and the top a high one. Each square is colored red and green. Red is the percentage of messages in that bin that are spam, and green the percentage of messages in that bin that are ham.

So as you'd expect the top left corner (low ham score, high spam score) is generally colored red indicating that these are spam messages, and the bottom right corner (low spam score, high ham score) is colored green (lots of ham). Some squares are black because there's too little data (I arbitrarily said that if there were less than 5 messages in the bin it was colored black).

But there's a really interesting square in the top right: it's almost all red (in fact, it's 93% red or spam). So we can have confidence that a message there (which corresponds to a final 'score' of 0.5) is in fact spam.

I'm sure that there are other interesting insights to be gained from calibration and I'd like to spend the time to evaluate the effectiveness of using the probabilities generated in this way (instead of the rather arbitrary selection of a score 'cutoff' point) as the way to determine whether a message is spam or ham (or undetermined). For example, the square at (0.9,0.2) (which corresponds to a 'score' of 0.18 is 57% ham, 43% spam so looks like a good candidate for undetermined; it looks like a much better candidate than the typical "area around 0.5" (which corresponds to (0.9,0.9) and is 93% spam).


Justin Mason said…
Interesting to note your use of 'Gary Robinson style spam filter' -- I thought I was the only one using that clunky style of phrasing ;) It's true, though -- many of the so-called "Bayesian" algorithms are really based on Gary's algorithms. Including SpamAssassin's ;)

I wonder if the preponderance of spam results in the "unsure bucket" at [0.9,0.9] is a result of the use of "Bayes poisoning" in the spams?

Also, this is the results from a very heavily-trained filter. It would be interesting to see what happens if you train on a much smaller subset (like 10%)?

good post!

Popular posts from this blog

Your last name contains invalid characters

My last name is "Graham-Cumming". But here's a typical form response when I enter it: Does the web site have any idea how rude it is to claim that my last name contains invalid characters? Clearly not. What they actually meant is: our web site will not accept that hyphen in your last name. But do they say that? No, of course not. They decide to shove in my face the claim that there's something wrong with my name. There's nothing wrong with my name, just as there's nothing wrong with someone whose first name is Jean-Marie, or someone whose last name is O'Reilly. What is wrong is that way this is being handled. If the system can't cope with non-letters and spaces it needs to say that. How about the following error message: Our system is unable to process last names that contain non-letters, please replace them with spaces. Don't blame me for having a last name that your system doesn't like, whose fault is that? Saying "Your

All the symmetrical watch faces (and code to generate them)

If you ever look at pictures of clocks and watches in advertising they are set to roughly 10:10 which is meant to be the most attractive (smiling!) position for the hands . They are actually set to 10:09.14 if the hands are truly symmetrical. CC BY 2.0 image by Shinji I wanted to know what all the possible symmetrical watch faces are and so I wrote some code using Processing. Here's the output (there's one watch face missing, 00:00 or 12:00, because it's very boring): The key to writing this is to figure out the relationship between the hour and minute hands when the watch face is symmetrical. In an hour the minute hand moves through 360° and the hour hand moves through 30° (12 hours are shown on the watch face and 360/12 = 30). The core loop inside the program is this:   for (int h = 0; h <= 12; h++) {     float m = (360-30*float(h))*2/13;     int s = round(60*(m-floor(m)));     int col = h%6;     int row = floor(h/6);     draw_clock((r+f)*(2*col+1), (r+f)*(row*2+1),

The Elevator Button Problem

User interface design is hard. It's hard because people perceive apparently simple things very differently. For example, take a look at this interface to an elevator: From flickr Now imagine the following situation. You are on the third floor of this building and you wish to go to the tenth. The elevator is on the fifth floor and there's an indicator that tells you where it is. Which button do you press? Most people probably say: "press up" since they want to go up. Not long ago I watched someone do the opposite and questioned them about their behavior. They said: "well the elevator is on the fifth floor and I am on the third, so I want it to come down to me". Much can be learnt about the design of user interfaces by considering this, apparently, simple interface. If you think about the elevator button problem you'll find that something so simple has hidden depths. How do people learn about elevator calling? What's the right amount of