Wednesday, March 28, 2007

So how much difference do stopwords make in a Naive Bayes text classifier?

POPFile contains a list of stopwords (words that are completely ignored for the purpose of classifying messages) that I put together by hand years ago (if you are a POPFile user they are on the Advanced tab and called Ignored Words). The basic idea is that stopwords are useless from the perspective of classification and should be ignored; they are just too common to provide much information.

My commercial machine learning tools did not, until recently, have a stopwords facility. This was based on my belief that stopwords didn't make much difference: if they are common words they'll appear everywhere and probabilities will be equal for each category of classification. I had a 'why bother' attitude.

Finally, I got around to testing this assumption. And I can give some actual numbers. Taking the standard 20 Newsgroups test and using a (complement) Naive Bayes text classifier as described here I can give you some numbers.

The bottom line is that stopwords did improve classification accuracy for 20 Newsgroups (and for other test datasets that I have), but the improvement was by 0.3%. The stopword list used was the top 100 most frequently occurring English words.

So, I was wrong... they do make a difference, but not by much.

1 comment:

Justin Mason said...

Where'd you get the list of stopwords?

What I did for the SpamAssassin classifier was to train on a "sample" spam/ham corpus, then select the tokens between 0.4 and 0.6 -- the wishy-washy tokens that spamassassin would ignore anyway (we ignore tokens that are not "strong" enough for classification). so in other words, the stopword list was derived from training, rather than a "500 common english words" list.