Skip to main content


Showing posts from March, 2007

Code to decode an a.b.c.d/n IP address range to the list of IP addresses

I needed to map some addresses in the standard IP address prefix syntax to the actual list of addresses, and after I'd searched Google for all of 11 usecs for a suitable on-line decoder I hacked one up in Perl. Since someone else might find this handy, here it is: use strict; use warnings; my $address = $ARGV[0] || '/'; my ( $ip, $bits ) = split( /\//, $address ); my @octets = split( /\./, $ip ); if ( ( $address eq '' ) || ( $bits eq '' ) || ( $#octets != 3 ) ) { die "Usage: ip-decode a.b.c.d/x\n"; } my $base = ( ( $octets[0] * 256 + $octets[1] ) * 256 + $octets[2] ) * 256 + $octets[3]; my $remaining = 32 - $bits; for my $i (0..(2 ** $remaining) - 1) { print_ip( $base + $i ); } sub print_ip { my ( $address ) = @_; my @octets; for my $i (0..3) { push @octets, ($address % 256); $address >>= 8; } print "$octets[3].$octets[2].$octets[1].$octets[0]\n"; }

So how much difference do stopwords make in a Naive Bayes text classifier?

POPFile contains a list of stopwords (words that are completely ignored for the purpose of classifying messages) that I put together by hand years ago (if you are a POPFile user they are on the Advanced tab and called Ignored Words). The basic idea is that stopwords are useless from the perspective of classification and should be ignored; they are just too common to provide much information. My commercial machine learning tools did not, until recently, have a stopwords facility. This was based on my belief that stopwords didn't make much difference: if they are common words they'll appear everywhere and probabilities will be equal for each category of classification. I had a 'why bother' attitude. Finally, I got around to testing this assumption. And I can give some actual numbers. Taking the standard 20 Newsgroups test and using a (complement) Naive Bayes text classifier as described here I can give you some numbers. The bottom line is that stopwords did

Introducing Usman's Law

Back at Electric Cloud I worked with a smart guy named Usman Muzaffar. As part of his job he spent a lot of time dealing with our customers, many of whom used GNU Make other other Make tools to build their software. One of the constant problems that Usman encountered was that most people had no way to get back to a truly clean build. No matter what they'd put in place for doing a totally scratch, clean build it was hard for everyone because their process often accidentally ommitted to delete something. I've observed this problem in my own code. Like many people I have a 'make clean' option which deletes all the output files: in my case by rm -rf ing an obj directory: .PHONY: clean clean: @rm -rf $(OUT)/* @$(MAKE_DIRECTORIES) And I make sure that generated things only go under $(OUT) . But it's easy to screw up. Consider a program like yacc or bison which'll create temporary source code files in the same place as the source code being analyzed. The

Electric Cloud wins a Jolt Productivity Award

Back in 2005 POPFile (which is now in desperate need of an updated version) won a Productivity Award at the 15th Annual Jolt awards. This week the company I co-founded, Electric Cloud , won the exact same award for its product ElectricCommander . OK, I should stop bragging now. And show a little humility. Truth be told, the glow from the second award is strictly reflected... I didn't design, code, or do anything to make ElectricCommander :-) But being a company founder is a good thing; you get to pretend you had all the smart ideas.

Calibrating a machine learning-based spam filter

I've been reading up about calibration of text classifiers, and I recommend a few papers to get you started: Transforming Classifier Scores into Accurate Multiclass Probability Estimates , Zadrozny/Elkan, SIGKDD 2002 Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , Zadrozny/Elkan Predicting Good Probabilities With Supervised Learning , Niculescu-Mizil/Caruana. The overall idea is that the scores output by a classifier need to be calibrated so that they can be understood. And, specifically, if you want to understand them as a probability that a document falls into a particular category then calibration gives you a way to estimate the probability from the scores. The simplest technique is bucketing or binning. Suppose that the classifier outputs a score s(x) in the range [0,1) for an input document x . Once a classifier is trained it's possible to calibrating it by classifing known documents, recording the output scores and

An image-based spam that brags about its own delivery

Nick FitzGerald sent me a simple image-based spam for Viagra which starts with a brag on the part of the spammer: Nick calls it a 'self aware' spam.