Saturday, April 29, 2006

HOWTO: ipod nano, iTunes, Windows 2000, Ubuntu, VMWare

I recently bought an iPod nano and wanted to use it with my laptop. The laptop is running a fully updated Ubuntu 5.10, and I book into Windows 2000 (Service Pack 4 also fully updated) using VMWare 5.51 (build 19175) which is quite capable of virtualizing the CD-ROM drive (for ripping) and USB (for syncing to the iPod). I have iTunes installed.

But the syncing didn't work.

When I plugged the iPod nano with VMWare having focus, I would get a warning that VMWare was going to have to take the iPod nano away from usb_storage (the Linux module that will happily mount an iPod as a hard disk). Even though VMWare claimed to do this (and the iPod was no longer mounted in Ubuntu) it didn't work. The iPod was not visible in iTunes or in My Computer.

The solution is very simple:

1. Start VMWare, boot Windows, load iTunes
2. In a Ubuntu shell type 'sudo rmmod usb_storage'
3. Go straight back to VMWare and full screen the VM
4. Plug iPod into USB port

This works for me everytime. As long as usb_storage doesn't get to see the iPod it'll sync without a problem. launched

Today I'm officially launching It's Hot or Not for email.

The basic idea is to get humans (that means you) to read a small number of messages (some are ham; some are spam) and decide what they are. I'm doing this because there are currently two usable corpuses of spam and ham: the SpamAssassin Public Corpus (which was hand sorted) and the TREC 2005 Public Corpus (which was machine sorted).

The TREC 2005 Public Corpus is the first target of With your help we can all verify that the machine sorted messages in the corpus were correctly identified as spam or ham. Given that there are so few public test resources, it's essential that those that are out there are accurate.

Starting today you can visit and be shown rendered and unrendered emails from the TREC 2005 Public Corpus. Once I've got enough human decisions (I'd love to get 10 per message; that means almost 1,000,000 human classifications) I'll make all the data public. And specifically I'll highlight any emails where people disagree with the current classification published by Gordon Cormack.

Hopefully, this effort will ensure that the TREC 2005 Public Corpus is rock solid. And, I expect it'll through up some interesting data... for example, just how good are humans are sorting spam? Since we'll be able to look at where the corpus and the humans disagree we'll be able to spot machine errors and human errors.

Thursday, April 27, 2006

Free GNU Make documentation

If, like me, you use GNU Make a lot then you should be aware of two really important pieces of documentation for GNU Make that are totally free (speech and beer):

1. The GNU Make Manual. This is the standard manual that comes with GNU Make and is available on the web here:

2. Robert Mecklenburg's "Managing Projects with GNU Make". This is the book published by O'Reilly but released under the Free Documentation License. PDFs of each of the sections are available here:

It's great that these two resources are freely available, but don't let that stop you buying them. Supporting the FSF and Mecklenburg with a little cash is a good way of keeping free documents free.

Monday, April 24, 2006

Bayesian Poisoning paper pointers

Over in the POPFile forums there's a lively discussion going on about a potential way of getting spam through POPFile and then corrupting the POPFile user's database to increase the false positive rate.

This particular attack uses common English words and relies on an implementation detail of POPFile: the fact that POPFile counts the number of times a word appears in an email. That detail is a little different from most spam filters that might consider a restricted range of hammy or spammy words. However, the attack described by POPFile user Olivier Guillion probably would work. On the other hand, I don't think we're likely to see it in the field primarily because it only affects POPFile and not other spam filters.

During the discussion I mentioned a number of papers that I thought Olivier should read concerning attacks on Bayesian filters. I then realized that they are not necessarily easily available. So, for the sake of everyone being able to read up quickly on this areas, here's a quick bibliography:

Any I've missed?

Friday, April 21, 2006

Are Citibank crazy?

I blogged a while ago about Thunderbird's phishing filter trapping a seemingly innnocent mail. Now, a reader has forwarded to me a genuine email from Citibank that he says was trapped by Thunderbird. I'm not going to reproduce the email here because it contains private details of the user, but it is a valid Citibank message.

Thunderbird thinks it's a scam because Citibank uses one of the oldest phishing tricks in the book. The have a URL displayed in the message then when clicked goes to a totally different URL. Here's the offending HTML:

If you do not wish to receive future account-related email,
select the last option at the following link:
<a href="">

So the geniuses send out a message that disguises the link with the link


Shortly after the disguised link there's the following text which links to various sites with information about protecting yourself online. The first link takes you to a Citibank page which has a sub page about email security.
There are simple steps you can take to protect yourself from fraud while online, such as never sending personal or financial information by email. (We'll never ask for it.) For more information, please review the recommendations of the U.S. Government and others at the following sites:
On the email security page ( there are some examples of actual Citibank phish mails that almost certainly used the same technique of URL hiding that Citibank is employing!

Thursday, April 20, 2006

Do you have third party cookies enabled?

There's been a discussion over at GRC about a new service that Steve Gibson is going to offer automatically testing whether visitors have third party cookies enabled. These are probably the most annoying types of cookies because they are used by Internet marketers to secretly tag and follow your surfing habits and potentially aggregate all sorts of information about you.

While we wait for Steve to come up with a polished solution, here's a little thing that I hacked together with Javascript and an IFRAME to do third party cookie testing.

Of course, if you block Javascript or the IFRAME tag then it's not going to work for you (the space above will remain blank), otherwise it should detect whether you have third party cookies enabled. It works by having the IFRAME request a page from my site and then looking for the cookie and writing out an appropriate message.

The IFRAME is just the following:

<iframe src=""
height="40" width="300">

and the page that's loaded by the IFRAME has a small piece of Javascript:

document.cookie = "TestForThirdPartyCookie=yes;";
if ( document.cookie.indexOf( "TestForThirdPartyCookie=" ) == -1 ) {
document.write( "<b>You do not 3rd party cookies enabled</b>" );
} else {
document.write( "<b>You have 3rd party cookies enabled</b>" );

The cookie that's set is for the session-only so it won't be kept hanging around and contains nothing nasty. It's just TestForThirdPartyCookie=yes.

Stateless web pages with hashes

Recently I've been working on a web application that requires some state to be passed between pages. I really didn't want to keep server side state and then give the user a cookie or some other token that I'd have to track in the server side application, age out if discarded etc.

I hit upon the idea of keeping everything in hidden fields passed between page transitions by form POSTs. Of course, the problem with hidden fields is that someone could fake the information and submit a form with their notion of state. For example, if this were a commerce application, someone could alter the contents of their own shopping cart and perhaps even the prices they have to pay.

To get around this problem I include two extra pieces of information in the form: the Unix epoch time when the form was delivered to the user and a hash that covers all the contents of the form. For example, a typical form might look like:
<form action= method=POST>
<input type=hidden name=hash value=ff9f5c4a0d10d7ab384ad0f95ff3727f>
<input type=hidden name=now value=1145516605>
<input type=hidden name=cart value=agwiji8973cnwiei938943>
<input type=submit value="Checkout" name=checkout>
Here the form contains a cart value that is just an encoded version of the contents of the user's cart (note I say encoded and not encrypted; there's no protection inherent in the string encoding the cart contents: it's just safe to be passed in a form).

The time the form was sent to the user is in the now field and is just the Unix epoch time when the page was generated on the server side.

The hash is an MD5 hash of the now, the cart, the IP address of the person who requested the page and a salt value known only to the web server. The salt prevents an attacker from generating their own hashes and hence faking the form values, but it means that the web server can verify the validity of the form.

The now value means that old forms can be timed out just be checking the epoch time against the value in the form. The hashing of the IP address means that only the person for whom the form was generated can submit it.

I'm sure this isn't new to anyone who's written web applications. And it appears that Steve Gibson over at GRC is doing something similar with his e-commerce system and there's apparently something called View State in ASP.

Anyone who is a web expert care to comment?

Wednesday, April 19, 2006

Would you buy a "GNU Make Cookbook" e-book?

I've been thinking about taking all the recipes for GNU Make things that I've written over the years as articles, or blog entries, or answers to people's questions on help-make and writing them up as an e-book for purchase and download from this web site.

Here's a sample recipe in PDF format so that you can see what I'm taking about.

So, the critical questions:

1. Would you buy such a book?
2. If so, how much would you pay for it?
3. What format would be best? PDF?


A small bug fix to my keep state shoehorning

A while back I wrote about a way to shoehorn Sun Make's "keep state" functionality into GNU Make. With a fairly simple Makefile it's possible to get GNU Make to rebuild targets when the targets' commands have changed. I blogged this here and wrote it up for Ask Mr Make here.

One reader was having trouble with the system because every single Make he did caused a certain target to be built. It turned out this was because he'd done something like:

$(call do, commands)

The space after the , and before the commands was messing up my signature system's comparison and causing it to think that the commands changed every time. This is easily fixed by stripping the commands.

Here's the updated signature file (with the changed parts highlighted in blue):

include gmsl

last_target :=

dump_var = \$$(eval $1 := $($1))

define new_rule
@echo "$(call map,dump_var,@ % < ? ^ + *)" > $S
@$(if $(wildcard $F),,touch $F)
@echo $@: $F >> $S

define do
$(eval S := $*.sig)$(eval F := $*.force)$(eval C := $(strip $1))
$(if $(call sne,$@,$(last_target)),$(call new_rule),$(eval last_target := $@))
@echo "$(subst $$,\$$,$$(if $$(call sne,$(strip $1),$C),$$(shell touch $F)))" >> $S

Another common thing people have asked for is that the signature system rebuild targets when the Makefile has changed. Currently the signature system cannot spot an edit to the Makefile that changes the commands. It's pretty simple to Make this happen (although this will cause all targets to be built when the Makefile is updated) by adding the following line in new_rule above:

define new_rule
@echo "$(call map,dump_var,@ % < ? ^ + *)" > $S
@$(if $(wildcard $F),,touch $F)
@echo $@: $F >> $S
@echo $F: Makefile >> $S

It's an exercise or the reader to replace Makefile with the actual name of the Makefile that is including active when new_rule is called.

Tuesday, April 18, 2006

Rebuilding when the hash has changed, not the timestamp

GNU Make decides whether to rebuild a file based on whether any of its prerequisites are newer or if the file is missing. But sometimes this isn't desirable: when using GNU Make with a source code control system the time on a prerequisite might be updated by the source code system when the files are checked out, even though the file itself hasn't changed.

It's desirable, in fact, to change GNU Make to check a hash of the file contents and only rebuild if the file has actually changed (and ignore the timestamp).

You can hack this into GNU Make using md5sum (I'm assuming you're on a system with Unix-like commands). Here's a little example that builds foo.o from foo.c but only updates foo.o when foo.h has changed... and changed means that its checksum has changed:

.PHONY: all

to-md5 = $(patsubst %,%.md5,$1)
from-md5 = $(patsubst %.md5,%,$1)

all: foo.o

foo.o: foo.c
foo.o: $(call to-md5,foo.h)

%.md5: FORCE
@$(if $(filter-out $(shell cat $@ 2>/dev/null),
$(shell md5sum $*)),md5sum $* > $@)


This works because when foo.h was mentioned in the prerequisite list of foo.o it was changed to foo.h.md5 by the to-md5 function. So GNU Make sees the prerequisites of foo.o to be foo.c and foo.h.md5.

Then there's a pattern rule to build foo.h.md5 (the %.md5 rule) that will only update the .md5 file if the checksum has changed. Thus if and only if the checksum has changed does the .md5 file get changed and foo.o rebuilt.

The %.md5 rule is forced to run by having a dummy prereq called FORCE so that every MD5 hash is checked for every prerequisite that GNU Make needs to examine.

First the rule uses a filter-out/if combination to check to see if the MD5 hash has changed. If it has then the %.md5 rule will run md5sum $* > $@ (in the example md5sum foo.h > foo.h.md5). This will both update the hash in the file and change the .md5 file's timestamp and force foo.o to build.

If within the rule for foo.o $?, $^ or other automatics that work on the prerequisite list were used these need to be passed through from-md5 to remove the .md5 extension so that the real prerequisite is used in the commands to build foo.o.

In the example this isn't necessary.

If the foo.h.md5 file does not exist then the %.md5 rule will create it and force foo.o to get built.

You can also adapt this tip to work with different definitions of 'changed'. For example, the .md5 file could store the version number of a file from the source control system and rebuilds would only happen when the version had changed.

Wednesday, April 05, 2006

Really bad day for this email

Take a look at this message from ALM Expo. It's a message that I wasn't expecting, but I was glad to receive because I write for CM Crossroads and do expect to get mail from them. And I'm talking at the ALM Expo. Looks perfectly ok, right?

It had three strikes against it: first GMail stuck it in the spam folder so I fished it out by clicking "Not Spam", then POPFile thought it was spam and stuck it in my spam folder and finally Thunderbird reported that it thought the message was an email scam (i.e. phishing).

I don't know what GMail saw that it didn't like, I know that POPFile saw some suspicious words (like unsubscribe, unlimited and event) and the email used font size 1 (a favorite of spammers).

According to this post Thunderbird's scam filter looks for forms in email, URLs that don't go where they say they do and IP address-only URLs. The email doesn't appear to contain any of those things. Anyone know if it's possible to get Thunderbird to give its reasoning?