Saturday, April 29, 2006

spamorham.org launched

Today I'm officially launching spamorham.org. It's Hot or Not for email.

The basic idea is to get humans (that means you) to read a small number of messages (some are ham; some are spam) and decide what they are. I'm doing this because there are currently two usable corpuses of spam and ham: the SpamAssassin Public Corpus (which was hand sorted) and the TREC 2005 Public Corpus (which was machine sorted).

The TREC 2005 Public Corpus is the first target of spamorham.org. With your help we can all verify that the machine sorted messages in the corpus were correctly identified as spam or ham. Given that there are so few public test resources, it's essential that those that are out there are accurate.

Starting today you can visit spamorham.org and be shown rendered and unrendered emails from the TREC 2005 Public Corpus. Once I've got enough human decisions (I'd love to get 10 per message; that means almost 1,000,000 human classifications) I'll make all the data public. And specifically I'll highlight any emails where people disagree with the current classification published by Gordon Cormack.

Hopefully, this effort will ensure that the TREC 2005 Public Corpus is rock solid. And, I expect it'll through up some interesting data... for example, just how good are humans are sorting spam? Since we'll be able to look at where the corpus and the humans disagree we'll be able to spot machine errors and human errors.

3 comments:

JoeChongq said...

This is a very interesting project. Hopefully it will be useful. I just worry that you have found yet another thing I can kill time on.

My other concern is that the corpus is already showing its age (what I saw was all 2001-2002 I think). As you know spam mutates over time so testing filters on data this old is not the best. But otherwise this is a very good and large public corpus so at least it will provide a good basis for comparison even if not totally up to date.

Bill Brown said...

I'll echo that it is an interesting project. I ran through about a dozen messages, and two of them I couldn't read - one in Russian and one in German.

A third looked like a legit conversation two replies to a brief message discussing the price of a stock. Probably not spam, but I couldn't be absolutely sure either way. I could have been a very, very subtle spam if the recipient was not a participant in the conversation.

JoeChongq said...

Bill, that is exactly the reason machines are sometimes better at classifying email than humans. And the result of this project certainly will shead more light on that topic.

I found many I was unsure of as well, and I found a few that I later realized were not spam when I got to other emails from the conversation. It is a good thing John is looking for at least 10 opinions on each mail.