The other day on Hacker News a user posted an anonymous comment. Regular Hacker News participant jacquesm wanted to unmask the writer and posted a challenge to unmask the user.
He also emailed me because he thought I might be the anonymous user. I agreed to help him with a little bit of text mining. Jacques had a nice database of all Hacker News submissions and comments and gave me a 250Mb SQL file suitable for loading into MySQL. Unfortunately Jacques' database only had comments up until September 2009 so if the user who wrote the anonymous comment has joined recently then it won't be possible to identify them.
I quickly whipped up a naive Bayesian text classifier in Perl (similar to this one that I wrote about in Dr Dobbs five years ago) with one category per Hacker News user. I took the text of the anonymous comment and fed it to the classifier. The classifier used the complete set of comments for each user to train each category and then scored the anonymous comment against those categories.
Culling the list for users who have commented recently the 10 most likely users (according to a text classification) are (in order of likelihood): sh1mmer, stcredzero, gojomo, patio11, andreyf, anigbrowl, teej, physcab, thorax, run4yourlives.
Now none of those users actually commented on the thread in question. Assuming it was someone who was commenting and then switched to a different account the most likely candidates are (again in order): petercooper and jacquesm.
So did this approach work?
PS As a quick test of my classifier I ran it against one of jacquesm's own comments and got the following people in order of likelihood: jacquesm, geoscripting, kunqiana, jerf, mixmax.