Thursday, December 07, 2006

Bug fix: 12 Tasty Make Recipes Part II

Back in May, 2006 I gave a talk called 12 Tasty Make Recipes Part II in which I talked about a user-defined GNU Make function to recursively search from a directory for a file or set of files.

The function was written like this:

search = $(foreach d,$(wildcard $1/*),$(call search,$d)$(filter $(subst *,%,$2),$d))

Unfortunately, there was a small mistake in the code that appeared in the presentation. However, two bugs collided to cause the mistake to have no effect. In the above function I wrote $(call search,$d) when I should have written $(call search,$d,$2). But since GNU Make had a bug that caused it to not reset $2 (or any other arguments in the form $n) on a nested $(call) and since I was reusing $2 as the second argument, my function worked.

That is, until GNU Make 3.81 was released and fixed bug 1744.

The correct version of search which will work with all versions of GNU Make is:

search = $(foreach d,$(wildcard $1/*),$(call search,$d,$2)$(filter $(subst *,%,$2),$d))

Thanks for Lou Iacoponi for writing in and pointing out my error.

Sunday, December 03, 2006

Two weeks of image spam innovation

Since I last blogged about image spam I've received numerous image spams myself, and reports from Nick FitzGerald, Sorin Mustaca and Nicholas Johnston about interesting image spams that I've just got to see.

On November 14, I received a report of image spams where the background noise had been updated from dots or small lines to polygons. Here's a sample:



A day later it was clear that the same spammer was randomizing the image size, the background noise and the color palette used:



Then on November 17 I was shown this interesting pump'n'dump technique:


83622874400056543 047183602660 41478311028418 100278 84807407350 05087016712772
78810870435635016 71651855827222 4725576405300038 84252840 12157351038885 630188325737443
23414 23133 41104 4312 7131 6341874402 48244 02522 1428 3224
55263 16021 42114 6654 7782 6583 2673 53280 03201 8323 6565
65882 58041 62412 6086607534050781 3826 0641 83176 85045 35427866406418
18506 15283 12474 33388136436533 643542 80456 87628 13156 506577110786230
31645 55602 81036 816525232877 301217585451283 01540 76265 0748 3312
06035 63450 43040 5224 1553 5786434504117661 52402 78534 8836 414
58510 75013 38325 3877 04146 76870 2788 34488 42803 77840 2737 5470
11372 62751178216028 2715 7266 17812 3668 10033 28833425366310 360215874450816
31403 486020746152 7723 4552 46471 6721 77207 683820875035 23460162005106

Nice, but that didn't last very long, but on November 22 the innovaters were at it again with smaller images containing polygons, lines, random colors, jagged text:



A day later the noise element changed to pixels:



Two days later spammers were trying a little 'old school' image spam with some fonts that they hoped would be hard to OCR.



Strangely the following image spam appeared on November 27 with a perfectly filterable URL. Oops:



And right before the month ended the noise around the border had turned into something like little fireworks:

Thursday, November 16, 2006

Yet more spammer image optimization; this time it's pretty

These something new in the image-spam wave: pretty colours! This spammer is working hard to randomize his images and avoid OCR. Here's a sample:



And to give you an idea of the randomization here's another:



Thanks to Nick FitzGerald and Sorin Mustaca for samples. Notice how the letters are misaligned both vertically and horizontally to try to avoid OCR, and the background polygons are randomized. Also the aspect ratio and size of the messages have been changed for each image.

Wednesday, November 08, 2006

Ransom note spam

Back in January I added a trick called The Small Picture to The Spammers' Compendium, and in August I updated The tURLing Test trick with an example of its use in image-based spam.

The Small Picture consists of sending individual letter images attached to a message. These letter images are then used to display a message and break up words that the spammer might think a spam filter would find suspicious. Here's an example of The Small Picture where certain letters (look carefully!) are formed using images rather than text:



The tURLing Test consists of disguising a URL by breaking it up and then explaining to the user how to type in the URL, thus proving that a human is reading the spam not a spam filter. This is done with URLs so that URL blacklists are bypassed. Here's an example of that from an image-based spam:



Now comes a combination of the two, that deserves the name 'Ransom Note Spam': it combines both The Small Picture (the letters are individual images attached to the spam) and The tURLing Test (the URL is made up of letters in the images):


Friday, October 20, 2006

Why OCRing spam images is useless

Nick FitzGerald forwards me another animated GIF spam that takes the animation plus transparency trick I outlined in the blog post A spam image that slowly builds to reveal its message to a new level. And it shows why spammers will work around OCR as fast as they can.

Here's what you see in the spam image:



Looks simple enough until you take a look at the GIF file that actually generated what you see. It's animated and it has three frames:





The first image is the GIF's background and is displayed for 10ms then the second image is layered on top with a transparent background so that the two images merge together and the image the spammer wants you to see appears. That image remains on screen for 100,000 ms (or 1 minute 40 seconds). After that the image is completely blanked out by the third frame.

My favourite touch is that it's not the entire image that's transparent, not even the white background, but just those pixels necessary to make the black pixels underneath show through. If you look carefully above you can see that some of pixels appear yellow (which is the background color of this site) indicating where the transparency is.

That is darn clever.

Monday, October 16, 2006

A spam image that slowly builds to reveal its message

Nick FitzGerald sent me a stunning example of lateral thinking on the part of a spammer. The spammer has taken a standard stock pump-and-dump spam image and split it horizontally into strips.



Each of the 17 horizontal strips cuts fairly randomly through the text making OCR on each strip not very useful. The spammer has then mounted each strip in its correct position on a transparent background and put each strip into an animated GIF. Here, for example, are a couple of strips:




The end result is that only once the entire image animation has completed is the complete spam visible making this a challenge for spam filters. And the spammer has thrown in a couple of frames at the end of the image, that get displayed after such a long delay (8 minutes) that they essentially never get shown. But those final frames are there just to throw off a spam filter trying to find the actual image.

Here's what gets displayed:



and here's the final image in the animation:



Very clever! (I'm calling this 'Strip Mining')

Monday, October 02, 2006

Ye Olde OCR Buster

Regular spam-correspondent Nick FitzGerald writes with an example of a spam that he believes is trying to get around both hash busting and OCR in an image.



The image has random dots in the bottom left hand corner to mess up hashing of the GIF itself, and the fonts used are badly rendered unusual fonts.

Friday, September 22, 2006

A Test::Class gotcha

I'm working on a project that involves building a prototype application in Perl. I've made extensive use of Perl's OO features and have a collection of classes that implement the mathematical calculations necessary to drive the web site running the application. Naturally, as I've been building the classes I've been building a unit test suite.

Since Test::Class is the closest thing Perl has to junit or cppunit I'm using it to test all the class methods in my Perl classes. Everything was looking good until I told the guy writing the server to integrate with my code. His code died with an error like this:
Can't locate object method "new" via package "Class::A" (perhaps you
forgot to load "Class::A" at Class/B.pm line 147.
Taking a quick look inside Class::B revealed that it did try to create a new Class::A object and that, sure enough, there was no use Class::A; anywhere in Class::B. Easy enough bug to fix, but what left me scratching my head was why the unit test suite didn't show this.

For each class I have an equivalent test class (so there's Class::A::Test and Class::B::Test) which are loaded using a .t file which in turn is loaded with prove. The test classes all use Test::Class.

The classes are tested with a Makefile that does the following:
test:
@prove classes.t
And classes.t consists of:
use strict;
use warnings;

use Class::A::Test;
use Class::B::Test;

Test::Class->runtest;
Since the test suite for Class::A does a use Class::A; and the test suite for Class::B does a use Class::B; and the two test suites are loaded using use in classes.t, both Class::A and Class::B are loaded before running the tests. This means that the fact that use Class::A; was missing from Class::B is masked in the test suite.

The solution is to have two .t files one for each class so that only the class being tested is loaded. So I dumped classes.t and created class_a.t and class_b.t as follows:
use strict;
use warnings;

use Class::A::Test;

Test::Class->runtest;
and
use strict;
use warnings;

use Class::B::Test;

Test::Class->runtest;
and the Makefile is changed to do:
test:
@prove class_a.t class_b.t
This now works correctly. The missing use Class::A; causes a fatal error in the test suite.

Friday, September 15, 2006

Image spam filtering BOF at Virus Bulletin 2006 Montreal

I'm leading a BOF meeting at Virus Bulletin 2006 in Montreal next month. The idea is to get together in one room for a practical, tactical meeting to share experiences on how people are currently filtering image spam and what might be done in future (and what we expect spammers to do). I've already got commitments from major anti-spam vendors to be there and talk (as much as they are permitted) about their approach and I'll try to cover what the Bayesian guys are doing.

If you are interested please email me, or post a comment here. If you represent a vendor and want to be involved I'm especially interested to hear from you as I want to get all experiences out on the table (as much as is practical).


Date and Time Confirmed: Thursday, October 12. 17:40 to 18:40 in the Press Room.

Downloadable PDF flyer.

Thursday, September 14, 2006

A C implementation of my simple GPS code

Reader Chris Kuethe wrote in with a version of my simple code for entering latitude and longitude to GPS devices written in C (my demonstration code was in Perl).

Seems Chris is a bit of a GPS fanatic and maintains a page on GPS hackery.

He ported my Perl code to C and is releasing the code freely. He gave me the choice of releasing under two clause BSD license or making it public domain. I think the most generous is public domain (especially since the Perl code was public domain).

Here's the code to compute a SOC:

#include <sys/types.h>
#include <stdio.>

int
main(int argc, char **argv){
int i, j;
unsigned long long lat, lon, c, p, soc_num;
char soc[11], *alpha = "ABCDEFGHJKLMNPQRTUVWXY0123456789";
int primes[] = { 2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37 };
float f;

if (argc != 3){
printf("Usage: %s <lat> <lon>\n", argv[0]);
exit(1);
}

sscanf(argv[1], "%f", &f);
lat = (int)((f + 90.0) * 10000.0);

sscanf(argv[2], "%f", &f);
lon = (int)((f +180.0) * 10000.0);

p = lat * 3600000 + lon;
soc_num = p * 128;

c = 0;
for(i = 0; i < (sizeof(primes)/sizeof(primes[0])); i++){
c += ((p % 32) * primes[i]);
p /= 32;
}

c %= 127;
soc_num += c;

for(i = 9; i >= 0; i--){
j = soc_num % 32;
soc[i] = alpha[j];
soc_num /= 32;
}
soc[10] = '\0';

printf("%s\n", soc);
}

And to compute latitude and longitude from a SOC:

#include <sys/types.h>
#include <stdio.h>

int
main(int argc, char **argv){
int i, j, c, k;
unsigned long long x, y, p, soc_num;
char soc[11], *alpha = "ABCDEFGHJKLMNPQRTUVWXY0123456789";
int primes[] = { 2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37 };
float lat, lon;

if ((argc != 2 )|| (strlen(argv[1]) != 10)){
printf("Usage: %s <10-digit-SOC>\n", argv[0]);
exit(1);
}

soc_num = 0;
for (i = 0; i < 10; i++){
c = (char)argv[1][i];
c = c & 0xff;
c = toupper(c);
switch(c){
case 'I': c = '1'; break;
case 'O': c = '0'; break;
case 'S': c = '5'; break;
case 'Z': c = '2'; break;
default: ;
}
for (j = 0; j < strlen(alpha); j++)
if (c == alpha[j]){
soc_num = (soc_num * 32 + j);
}
}

p = soc_num / 128;
k = soc_num % 128;

lon = ((p % 3600000) / 10000.0) -180.0;
lat = ((p / 3600000) / 10000.0) - 90.0;

c = 0;
for (i = 0; i < (sizeof(primes)/sizeof(primes[0])); i++){
c += ((p % 32) * primes[i]);
p /= 32;
}

c %= 127;
if (c != k)
printf("warning: checksum mismatch - %d %d\n", c, k);
printf("%0.4f %0.4f\n", lat, lon);
}

Thanks Chris!

Update: Chris writes to say that B1NLADEN02 can be found in Antarctica: -76.7847/-106.0187 and JIMMYHOFFA is here: -23.3433/-61.6087.

Monday, September 04, 2006

Optimal SMS keyboard layouts

One of the things I find very frustrating about typing SMS messages on my phone is that I often find that the next letter I want to type is actually on the same key that I just pressed. And that slows me down because either I wait for the timeout, or I click the right arrow key to move on.

For example, here's a standard keyboard on cell phones:

abc def
1 2 3

ghi jkl mno
4 5 6

pqrs tuv wxyz
7 8 9

Very common English letter pairs such as 'ed' and 'on' appear on the same key meaning that if you need to type one of these you are going to incur the cost of dealing with the 'next letter is on same key' problem. In addition, the most common English letters are more than one click away; the most common English letter 'e' is two clicks, 'o' is three, 'n' is two, etc.

What you really want is a keyboard layout that means that most common letters are as few clicks away as possible, and that the common letter pairs are on different keys so that you can maximize typing speed. And if possible make the layout as close as the current one so that it's easy to learn.

There are some people who propose squeezing QWERTY into the the current keyboard. This ends up with a key that starts with 'q' and another that starts with 'z': two of the least common keys are given pride of place on the keyboard.

Other propose using dictionaries. I think the fastest typing would be on an intelligently laid out keyboard without the need for a dictionary.

I took the 1000 most common words in English as a test set, and performed three keyboard optimizations: one by hand and two using different sets of common letter pairs and tested them against the 1000 most common words. Each set received a score equal to the number of clicks required to type all 1000 words. A single click on a key was worth one click (so typing 'a' is one click, typing 'q' is two, etc. on the standard keyboard), and the cost of handling the 'next letter is on same key' was set at the same time as two clicks.

I used letter and letter pair frequency information from the excellent book Cryptanalysis. And of course I wrote some code to perform the layout of the keyboards optimizing for the least number of clicks per common letter and the least number of same key clicks for letter pairs.

The standard keyboard layout gets a score of 12,447 clicks.

The following machine generated layout can be used to type the same words in 8757 clicks (70.35% of the clicks of the standard keyboard):

acb euwj
1 2 3

ipg hmx olv
4 5 6

sfq tdyz nrk
7 8 9

This keyboard doesn't look anything like the original keyboard so I then used a shorter list of letter pairs and hand optimized the keyboard to balance clicks and similarity. The result is 8,912 clicks (or 71.6% of the standard keyboard) and a nice layout:

adc efb
1 2 3

igj hlk omz
4 5 6

srpq tuv nwyx
7 8 9

Now, if there was only a way to get that on my RAZR I could save 30% of my typing time.

Subliminal advertising in spam?

Nick FitzGerald sent me a great example of subliminal advertising in a spam message. At least that's what he thinks the spammer might have been up to. The spam contains an animated GIF with four frames. One of the frames (which contains the actual spam message) remains visible for 17 seconds. The other three frames are displayed for 10ms or 40ms, and each of those contains a little random noise and the word BUY in random positions.

Was the spammer really hoping to make us fall for his pump and dump scam with a quick flash of BUY on screen?

Here's the actual GIF with the animation in place (watch out you might be forced to BUY :-)



And there are the four separate frames:

10ms
17s
40ms
40ms

Monday, July 10, 2006

A simple code for entering latitude and longitude to GPS devices

This post proposes a coding system for entering any location on earth with 10m of accuracy using a 10 character code that includes features to prevent errors in entering the code.

The idea is that any one could publish their location by writing something like VUF DDC F8UG. This short code could be entered into a GPS device giving you any spot on the globe.

I'm calling it the SOC: Simple Orientation Code.

Some example uses:
  1. I could print my company's SOC on my business cards and visitors could punch it into their car navigation system and come visit
  2. A restaurant could publish its SOC along with its phone number (after all it's the same length as a phone number so it's something people can easily grok) making the restaurant easy to find
  3. Geocachers could publish SOC trails for people hunting down caches
  4. SCUBA divers could refer to dive sites by their SOC (10m of accuracy is enough surface accuracy for most people)
Here's how the code works.

First you need the latitude and longitude of the location you are talking about to 4 decimal places of accuracy. 4 decimal places gives about 10m of accuracy. So treating latitude as ranging from 0 to 180 degrees (basically change it from -90 to 90 degrees by adding 90) and longitude as from 0 to 360 degrees (ignoring east/west or +/- values) and then treating the two numbers as integers (i.e. take the 4 decimal place latitude or longitude and multiply by 10000) you get two numbers: La and Lo.

La varies from 0 to 1,799,999 and Lo from 0 to 3,599,999. These two numbers can be combined to form a single number that I call P (your position) like this:

P = La * 3600000 + Lo

Extracting the La and Lo from P is simply a matter of dividing P by 3,600,000 (to get La) and calculating the remainder (to get Lo).

P varies from 0 to 6,479,998,200,000 which can be stored in 43 bits.

Now encoding P in some form typeable by a human requires an alphabet. The SOC alphabet consists of the following 32 characters:

ABCDEFGHJKLMNPQRTUVWXY0123456789

This is the standard English alphabet plus Arabic numerals 0 through 9 with the following letters removed: I, O, S, and Z. These are removed because I is easily confused with both 1 and J; O is easily confused with 0; S is easily confused with 5 and Z is easily confused with 2. These characters are removed to ensure that the code is minimally affected by bad handwriting.

Moreover an implementation using the SOC should silently perform the following translations: I becomes 1; O becomes 0; S becomes 5 and Z becomes 2. This way the user will not have to correct a poorly written SOC.

Each character in the alphabet represents a number between 0 and 31.

A(0) B(1) C(2) D(3) E(4) F(5) G(6) H(7) J(8) K(9) L(10) M(11) N(12) P(13)
Q(14) R(15) T(16) U(17) V(18) W(19) X(20) Y(21) 0(22) 1(23) 2(24) 3(25)
4(26) 5(27) 6(28) 7(29) 8(30) 9(31)

P can be encoded using 10 characters from this alphabet. Since each character contains 5 bits of information and only 43 bits are needed for the position that leaves 7 bits for an error checking code. The algorithm used to generate the check digit is a variant of the scheme used for ISBNs.

The 43 bit P is broken into 11 4 bit numbers with a zero padded on the left of P. The 11 numbers are p0 through p10. A check digit C is calculated as follows:

C = ( p0 * 37 + p1 * 31 + p2 * 29 + p3 * 23 + p4 * 17 + p5 * 13 + p6 * 11 + p7 * 7 + p8 * 5 + p9 * 3 + p10 * 2 ) mod 127

C is then appended to P to create the SOC.

Now for some Perl code that implements the coding and encoding of SOCs.

Converting a latitude and longitude to a SOC:

use strict;

if ( $#ARGV != 1 ) {
die "Usage: to-soc ";
}

my ( $lat, $lon ) = @ARGV;

my $alpha = 'ABCDEFGHJKLMNPQRTUVWXY0123456789';
my @alphabet = split(//,$alpha);

$lat += 90;
$lon += 180;

$lat *= 10000;
$lon *= 10000;

my $p = $lat * 3600000 + $lon;

my $soc_num = $p * 128;

my @primes = ( 2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37 );

my $c = 0;

foreach my $prime (@primes) {
$c += ($p % 32) * $prime;
$p = int($p / 32);
}

$c %= 127;

$soc_num += $c;

my $digits = 10;

my $soc = '';

while ( $digits > 0 ) {
my $d = $soc_num % 32;
$soc = $alphabet[$d] . $soc;
$soc_num = int($soc_num/32);
--$digits;
}

print "$soc\n";

Converting a SOC back to a latitude and longitude:

use strict;

if ( $#ARGV != 0 ) {
die "Usage: from-soc <10-digit-soc>";
}

my $soc = uc($ARGV[0]);
$soc =~ tr/IOSZ/1052/;

my $alphabet = 'ABCDEFGHJKLMNPQRTUVWXY0123456789';

my $soc_num = 0;

foreach my $letter (split(//, $soc)) {
$soc_num *= 32;
$alphabet =~ /(.*)$letter/;
$soc_num += length($1);
}

my $p = int($soc_num / 128);
my $check = $soc_num % 128;

my $lon = $p % 3600000;
my $lat = int($p / 3600000);

$lat /= 10000;
$lon /= 10000;

$lat -= 90;
$lon -= 180;

my @primes = ( 2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37 );

my $c = 0;

foreach my $prime (@primes) {
$c += ($p % 32) * $prime;
$p = int($p / 32);
}

$c %= 127;

if ( $check != $c ) {
die "Incorrect SOC";
} else {
print "$lat $lon\n";
}

This idea and code is being released by me into the public domain.

Those of you with a twisted mind like to try to find points on the globe that have human-readable SOCs. For example, by picking coordinates that contain a word in the SOC. Challenge: find a location on the blog that's something along the lines of TREASURE or STARTHERE.

Thursday, June 22, 2006

How I love my HP-16C

A while ago I bought an HP-16C calculator on eBay. It wasn't cheap and there was no manual; the calculator itself works fine and is in almost mint condition. Since then I've fallen in love with the device.



You probably think I'm nuts to be using a calculator that was discontinued in 1989 and only 203 bytes of memory. And I had to pay extra to get a PDF version of the scanned original manual.

Perhaps I am crazy, but here's why I love this little machine:

1. RPN. You either love this or hate it. This is my first RPN calculator and for me RPN is the right way to use a calculator. I read a short introduction to RPN tricks (of which there are very few, but filling the stack for repeated operations is one and using LST x to prevent the stack from moving is another).

2. The industrial design of HP calculators is pure art. They are the right size for your hand, the keyboard is clearly marked, keys are spaced far apart (which avoids fat fingers like mine) and the keys give good feedback on being pressed. And the calculator is slightly slanted so that when it's on the desk it's easy to type on.

3. Floating point with fixed display of decimal places. Just right for balancing your check book.

4. Hex/Dec/Oct/Bin modes plus the nice 'show' feature which can display a number in one of the other bases for a few seconds without changing base. Very handy when debugging.

5. And my favorite thing... the HP 16C is 128 mm wide and 79 mm deep. Notice anything interesting? 128 ENTER 79 / is... 1.62. Or the Golden Ratio. No wonder I love that thing so much.

Thursday, April 27, 2006

Free GNU Make documentation

If, like me, you use GNU Make a lot then you should be aware of two really important pieces of documentation for GNU Make that are totally free (speech and beer):

1. The GNU Make Manual. This is the standard manual that comes with GNU Make and is available on the web here: http://www.gnu.org/software/make/manual/make.html

2. Robert Mecklenburg's "Managing Projects with GNU Make". This is the book published by O'Reilly but released under the Free Documentation License. PDFs of each of the sections are available here: http://www.oreilly.com/catalog/make3/book/index.csp

It's great that these two resources are freely available, but don't let that stop you buying them. Supporting the FSF and Mecklenburg with a little cash is a good way of keeping free documents free.

Monday, April 24, 2006

Bayesian Poisoning paper pointers

Over in the POPFile forums there's a lively discussion going on about a potential way of getting spam through POPFile and then corrupting the POPFile user's database to increase the false positive rate.

This particular attack uses common English words and relies on an implementation detail of POPFile: the fact that POPFile counts the number of times a word appears in an email. That detail is a little different from most spam filters that might consider a restricted range of hammy or spammy words. However, the attack described by POPFile user Olivier Guillion probably would work. On the other hand, I don't think we're likely to see it in the field primarily because it only affects POPFile and not other spam filters.

During the discussion I mentioned a number of papers that I thought Olivier should read concerning attacks on Bayesian filters. I then realized that they are not necessarily easily available. So, for the sake of everyone being able to read up quickly on this areas, here's a quick bibliography:

Any I've missed?

Thursday, April 20, 2006

Stateless web pages with hashes

Recently I've been working on a web application that requires some state to be passed between pages. I really didn't want to keep server side state and then give the user a cookie or some other token that I'd have to track in the server side application, age out if discarded etc.

I hit upon the idea of keeping everything in hidden fields passed between page transitions by form POSTs. Of course, the problem with hidden fields is that someone could fake the information and submit a form with their notion of state. For example, if this were a commerce application, someone could alter the contents of their own shopping cart and perhaps even the prices they have to pay.

To get around this problem I include two extra pieces of information in the form: the Unix epoch time when the form was delivered to the user and a hash that covers all the contents of the form. For example, a typical form might look like:
<form action=http://www.mywebsite.com/form.cgi method=POST>
<input type=hidden name=hash value=ff9f5c4a0d10d7ab384ad0f95ff3727f>
<input type=hidden name=now value=1145516605>
<input type=hidden name=cart value=agwiji8973cnwiei938943>
<input type=submit value="Checkout" name=checkout>
</form>
Here the form contains a cart value that is just an encoded version of the contents of the user's cart (note I say encoded and not encrypted; there's no protection inherent in the string encoding the cart contents: it's just safe to be passed in a form).

The time the form was sent to the user is in the now field and is just the Unix epoch time when the page was generated on the server side.

The hash is an MD5 hash of the now, the cart, the IP address of the person who requested the page and a salt value known only to the web server. The salt prevents an attacker from generating their own hashes and hence faking the form values, but it means that the web server can verify the validity of the form.

The now value means that old forms can be timed out just be checking the epoch time against the value in the form. The hashing of the IP address means that only the person for whom the form was generated can submit it.

I'm sure this isn't new to anyone who's written web applications. And it appears that Steve Gibson over at GRC is doing something similar with his e-commerce system and there's apparently something called View State in ASP.

Anyone who is a web expert care to comment?

Wednesday, April 19, 2006

Would you buy a "GNU Make Cookbook" e-book?

I've been thinking about taking all the recipes for GNU Make things that I've written over the years as articles, or blog entries, or answers to people's questions on help-make and writing them up as an e-book for purchase and download from this web site.

Here's a sample recipe in PDF format so that you can see what I'm taking about.



So, the critical questions:

1. Would you buy such a book?
2. If so, how much would you pay for it?
3. What format would be best? PDF?

John.

A small bug fix to my keep state shoehorning

A while back I wrote about a way to shoehorn Sun Make's "keep state" functionality into GNU Make. With a fairly simple Makefile it's possible to get GNU Make to rebuild targets when the targets' commands have changed. I blogged this here and wrote it up for Ask Mr Make here.

One reader was having trouble with the system because every single Make he did caused a certain target to be built. It turned out this was because he'd done something like:

target:
$(call do, commands)

The space after the , and before the commands was messing up my signature system's comparison and causing it to think that the commands changed every time. This is easily fixed by stripping the commands.

Here's the updated signature file (with the changed parts highlighted in blue):

include gmsl

last_target :=

dump_var = \$$(eval $1 := $($1))

define new_rule
@echo "$(call map,dump_var,@ % < ? ^ + *)" > $S
@$(if $(wildcard $F),,touch $F)
@echo [email protected]: $F >> $S
endef

define do
$(eval S := $*.sig)$(eval F := $*.force)$(eval C := $(strip $1))
$(if $(call sne,[email protected],$(last_target)),$(call new_rule),$(eval last_target := [email protected]))
@echo "$(subst $$,\$$,$$(if $$(call sne,$(strip $1),$C),$$(shell touch $F)))" >> $S
$C
endef

Another common thing people have asked for is that the signature system rebuild targets when the Makefile has changed. Currently the signature system cannot spot an edit to the Makefile that changes the commands. It's pretty simple to Make this happen (although this will cause all targets to be built when the Makefile is updated) by adding the following line in new_rule above:

define new_rule
@echo "$(call map,dump_var,@ % < ? ^ + *)" > $S
@$(if $(wildcard $F),,touch $F)
@echo [email protected]: $F >> $S
@echo $F: Makefile >> $S
endef

It's an exercise or the reader to replace Makefile with the actual name of the Makefile that is including active when new_rule is called.

Tuesday, April 18, 2006

Rebuilding when the hash has changed, not the timestamp

GNU Make decides whether to rebuild a file based on whether any of its prerequisites are newer or if the file is missing. But sometimes this isn't desirable: when using GNU Make with a source code control system the time on a prerequisite might be updated by the source code system when the files are checked out, even though the file itself hasn't changed.

It's desirable, in fact, to change GNU Make to check a hash of the file contents and only rebuild if the file has actually changed (and ignore the timestamp).

You can hack this into GNU Make using md5sum (I'm assuming you're on a system with Unix-like commands). Here's a little example that builds foo.o from foo.c but only updates foo.o when foo.h has changed... and changed means that its checksum has changed:

.PHONY: all

to-md5 = $(patsubst %,%.md5,$1)
from-md5 = $(patsubst %.md5,%,$1)

all: foo.o

foo.o: foo.c
foo.o: $(call to-md5,foo.h)

%.md5: FORCE
@$(if $(filter-out $(shell cat [email protected] 2>/dev/null),
$(shell md5sum $*)),md5sum $* > [email protected])

FORCE:

This works because when foo.h was mentioned in the prerequisite list of foo.o it was changed to foo.h.md5 by the to-md5 function. So GNU Make sees the prerequisites of foo.o to be foo.c and foo.h.md5.

Then there's a pattern rule to build foo.h.md5 (the %.md5 rule) that will only update the .md5 file if the checksum has changed. Thus if and only if the checksum has changed does the .md5 file get changed and foo.o rebuilt.

The %.md5 rule is forced to run by having a dummy prereq called FORCE so that every MD5 hash is checked for every prerequisite that GNU Make needs to examine.

First the rule uses a filter-out/if combination to check to see if the MD5 hash has changed. If it has then the %.md5 rule will run md5sum $* > [email protected] (in the example md5sum foo.h > foo.h.md5). This will both update the hash in the file and change the .md5 file's timestamp and force foo.o to build.

If within the rule for foo.o $?, $^ or other automatics that work on the prerequisite list were used these need to be passed through from-md5 to remove the .md5 extension so that the real prerequisite is used in the commands to build foo.o.

In the example this isn't necessary.

If the foo.h.md5 file does not exist then the %.md5 rule will create it and force foo.o to get built.

You can also adapt this tip to work with different definitions of 'changed'. For example, the .md5 file could store the version number of a file from the source control system and rebuilds would only happen when the version had changed.

Friday, February 24, 2006

Shoehorning Keep State into GNU Make

Sun Make has a lovely feature called Keep State: if the commands used to build a target change from build to build the target is rebuilt, even if looking at file time stamps shows that the target is "up to date". Why is this a lovely feature? Because it means that make followed by make DEBUG=1 will do that right thing. In Make's that only check time stamps the make DEBUG=1 would probably report that there was no work to do.

Of course, you can get round these problems if you really try (e.g. for the DEBUG case you could encode the fact that the objects are debug objects in either the name or path and then Make would do the right thing).

A recent post on the GNU Make mailing list got me thinking about this problem again and I've come up with a very simple solution that shoehorns Keep State into GNU Make. There's no code change to GNU Make at all; it's all done with existing GNU Make functions.

Here's an example Makefile that I've modified to rebuild foo.o and bar.o if their commands change

include signature

all: foo.o bar.o

FOO=$(FOOEY)

foo.o: foo.c
$(call do,$$(COMPILE.C) -DFOO=$$(FOO)$$(@F) -o [email protected] $$<)

bar.o: bar.c
$(call do,$$(COMPILE.C) -DBAR=$$(BAR) -o [email protected] $$<)

-include foo.sig bar.sig
There are three modifications from a standard Makefile: firstly there's 'include signature' at the start. (You'll see the definition of signature below), then the commands for each rule have been wrapped in $(call do,...) and any $'s in the commands have been quoted with an extra $. Lastly the Makefile includes a .sig file for each .o being created (if the .sig exists, hence the -include instead of include).

The .sig file is generated by code in signature when a rule is run and is used to perform the 'command has changed' checking that you need. Here, for example, is the contents of bar.sig after make has been run for the first time:

$(eval @ := bar.o)
$(eval % := )
$(eval < := bar.c)
$(eval ? := bar.c)
$(eval ^ := bar.c)
$(eval + := bar.c)
$(eval * := bar)

bar.o: bar.force

$(if $(call sne,$(COMPILE.C) -DBAR=$(BAR) -o [email protected] $<,
g++ -c -DBAR= -o bar.o bar.c),$(shell
touch bar.force))
The first set of lines captures the state of the automatic variables within the rule to make bar.o, the next line says that bar.o depends on a special file called bar.force and lastly there's a rather complex $(if ...) that uses the GMSL (see GNU Make Standard Library) string-not-equal (sne) function to check the current expansion of the commands to make bar.o against the previous expansion. It's this $(if ...) that can detect a change in the commands to run a rule. If such a change is detected bar.force is touched and hence bar.o will be rebuilt because bar.force is newer.

The signature include is where the work is done:

include gmsl

last_target :=

dump_var = \$$(eval $1 := $($1))

define new_rule
@echo "$(call map,dump_var,@ % < ? ^ + *)" > $S
@$(if $(wildcard $F),,touch $F)
@echo [email protected]: $F >> $S
endef

define do
$(eval S := $*.sig)$(eval F := $*.force)$(eval C := $1)
$(if $(call sne,[email protected],$(last_target)),$(call new_rule),$(eval
last_target := [email protected]))
@echo "$(subst $$,\$$,$$(if $$(call sne,$1,$C),
$$(shell touch $F)))" >> $S
$C
endef
I won't go into all the details of how signature works, but essentially the do macro is responsible for updating the .sig files as needed. I'll write this up for my column on CM Crossroads in March, but you can play around with the code (you need the GMSL and GNU Make 3.80 for this to work) and you'll see that changing a parameter does work.

Here's an example of starting from scratch and then changing the values of FOO and BAR in the Makefile above:

$ make
g++ -c -DFOO=foo.o -o foo.o foo.c
g++ -c -DBAR= -o bar.o bar.c
$ make
make: Nothing to be done for `all'.
$ make BAR=bar
g++ -c -DBAR=bar -o bar.o bar.c
$ make BAR=bar
make: Nothing to be done for `all'.
$ make BAR=baz
g++ -c -DBAR=baz -o bar.o bar.c
$ make BAR=baz FOO=foo
g++ -c -DFOO=foofoo.o -o foo.o foo.c
$ make BAR=bar FOO=foo
g++ -c -DBAR=bar -o bar.o bar.c
$ make
g++ -c -DFOO=foo.o -o foo.o foo.c
g++ -c -DBAR= -o bar.o bar.c
The only limitation of this scheme is that if you change the commands in a rule by editing the Makefile you need to do a clean build or at least delete the corresponding .sig file so that it gets remade. (Of course, even that could be worked around by making foo.o and bar.o depend on Makefile)

Friday, January 20, 2006

Deconstructing Sundance with POPFile

The brave folks at Unspam have taken POPFile and years of data surrounding the Sundance Film Festival in attempt to predict the outcome of the 2006 edition. Specifically, they want to get invited to parties, and they've chosen to geek out on data, bend POPFile to their use (boy, that must have been hard work) and try to predict the hits from the festival.

They claim an 81% accuracy over the past festivals and have a nice web site giving their predictions for this year. The predictions also include a breakdown of how the decisions are made indicating the most important (and worst) words to appear in a review of a movie.

There's also movie metadata like the type of film it was shot on, or who reviewed it. That's a very interesting use of POPFile and if they get it right and are invited to all the cool parties I hope they fulfill my wish and put a good word in for me with Neve Campbell.

Tuesday, January 17, 2006

The WMF SetAbortProc problem is not a backdoor

Microsoft recently fixed a problem wherein a Windows MetaFile (WMF) could contain arbitrary code that would be executed when the WMF was "played". The problem was particularly bad because a WMF could be played automatically in Internet Explorer by referencing it in an IFRAME if the Windows Picture and Fax Viewer was installed and registered (which by default on recent versions of Windows it was), because that program would automatically handle WMFs and play them to display them.

Steve Gibson suggested in a podcast, which then became a big news story, that he was convinced that this functionality was in fact an intentional backdoor inserted by Microsoft for their own purposes (or at last for the purpose of a rogue engineer inside the company).

Microsoft responded to that accusation with a blog posting by Stephen Toulouse in which he gave some more details of the problem and essentially said that Gibson was wrong (without naming him).

I've looked carefully at this and am convinced that Gibson is wrong. This is nothing more than a bug caused by the reimplementation of a legacy API. The same bug appears in WINE and was recently fixed.

Gibson based his backdoor claim on two things (both of which he now admits were incorrect): that a special incorrect value was required in one of the metafile fields to get the backdoor to activate and that the code ran in its own specially created thread. In fact, this bug occurs with correct or incorrect values in the field and uses the same thread.

Secondly, it's pretty clear from the code that this occurs because metafiles can contains a special ESCAPE record that allowed them to call into the Windows API called Escape which has a function to perform SetAbortProc. If you take a look at the WINE code you can see how this exploit works there and I bet (because it's the same API) that the same or very similar thing is happening in Windows:

First the file dlls/gdi/metafile.c contains a function called PlayMetaFileRecord with the following signature:

BOOL WINAPI PlayMetaFileRecord( HDC hdc, HANDLETABLE *ht,
METARECORD *mr, UINT handles )

Which is simply WINE's implementation of the same Win32 API (which is documented here: http://msdn.microsoft.com/library/default.asp?url= /library/en-us/gdi/metafile_1yec.asp)

The third parameter (mr) is a METARECORD pointer (a METARECORD is just an entry in the metafile and is detailed here: http://msdn.microsoft.com/library/default.asp?url= /library/en-us/gdi/metafile_8j1u.asp) and is the all important header with the following definition:

typedef struct tagMETARECORD {
DWORD rdSize;
WORD rdFunction;
WORD rdParm[1];
} METARECORD, *PMETARECORD;

With the rdSize being the size of the record in words, the rdFunction being the function and the rdParm the data (which in the case of an exploit would be executable code). PlayMetaFileRecord handles META_ESCAPE like this:

case META_ESCAPE:
Escape( hdc, mr->rdParm[0], mr->rdParm[1],
(LPCSTR)&mr->rdParm[2], NULL);
break;

You'll note that parameter 3 is a pointer into the metafile parameter block, i.e. if executed parameter 3 would execute code in the metafile. Now Escape has implemented like this (dlls/gdi/driver.c):

INT WINAPI Escape( HDC hdc, INT escape, INT in_count,
LPCSTR in_data, LPVOID out_data )

and the SETABORTPROC is handled with the following code:

case SETABORTPROC:
return SetAbortProc( hdc, (ABORTPROC)in_data );

So if you have an ESCAPE/SETABORTPROC record in a metafile then under WINE the AbortProc is set to point into the metafile (since in_data is corresponds to &mr->rdParm[2]).

So it's quite clear from the WINE implementation that this is a way to set a pointer into the metafile for execution. All it would take is that the metafile's AbortProc is called and arbitrary code could be executed.

In WINE at least this looks nothing like an intentional backdoor. It looks more like a bug caused by the fact that Escape is rather powerful and can set a pointer to code.

Now it's possible in WINE (I believe) to force the AbortProc to execute with another ESCAPE record that has NEWFRAME as the function. Again looking at the Escape code you'll see that NEWFRAME has handled like this:

case NEWFRAME:
return EndPage( hdc );

EndPage is a standard GDI function (see here for documentation: http://msdn.microsoft.com/library/default.asp?url= /library/en-us/gdi/prntspol_0d6b.asp). If you take a look at the implementation in WINE you see the following code (dlls/gdi/printdrv.c):

INT WINAPI EndPage(HDC hdc)
{
ABORTPROC abort_proc;
INT ret = 0;
DC *dc = DC_GetDCPtr( hdc );
if(!dc) return SP_ERROR;

if (dc->funcs->pEndPage)
ret = dc->funcs->pEndPage( dc->physDev );
abort_proc = dc->pAbortProc;
GDI_ReleaseObj( hdc );
if (abort_proc &&amp;amp;amp;amp;amp; !abort_proc( hdc, 0 ))
{
EndDoc( hdc );
ret = 0;
}
return ret;
}

Note that this function always called the AbortProc of the DC. So I think a metafile with an ESCAPE/SETABORTPROC followed by ESCAPE/NEWFRAME would in WINE causes arbitrary code execution.

Now if you read this article from MSDN: http://msdn.microsoft.com/archive/default.asp?url=/archive/en-us/dnargdi/html/msdn_print.asp) you learn the following about the AbortProc
The SetAbortProc function (and the SETABORTPROC escape) sets up what is known as the AbortProc. This AbortProc function resides in the application; GDI calls it during a print job to inform the application of spooler errors and to allow the application to abort the job when desired. GDI calls the AbortProc function with information about why it is being called; this value is either an error code from the spooler or zero, which indicates that the function is being called simply to allow an abort.

The AbortProc function is called routinely during several steps of the printing process:

* After every write to the printer port when printing directly to the printer (no spooling)
* After every write to a file when printing directly to a file (no spooling)
* After every write to the spooler file when spooling
* Periodically when out of disk space for spooling as a result of other spool jobs
* Before playing every metafile record when GDI is simulating banding
* Occasionally from some older printer drivers

When GDI calls the AbortProc function, the application can continue the print job by returning a nonzero value or abort the print job by returning zero.

In there it mentions that the AbortProc will get called when playing a metafile record. Stephen Toulouse all but said that when he said: "The way this functionality works is by registering the callback to be called after the next metafile record is played.".

So my take is that Gibson is wrong, very wrong. This is no backdoor. It's just a side effect of the ESCAPE/SETABORTPROC handling in a metafile and the fact that the metafile processing is calling the AbortProc for you.

I've used Steve Gibson's WMF_dbg.exe and WinDBG to step through the implementation of PlayMetaFileRecord and especially how it handles the Escape function and it appears to be implemented in exactly the same fashion as the equivalent WINE function. Here's my commented disassembly:

77f493fd 8d4b0a lea ecx,[ebx+0xa]

;
; ecx now contains &mr->Param[2]
;
; When entering this structure edi is a pointer to the out parameters
; for the Escape and seems always to be null. So the next instruction
; pushes the last parameter of Escape (LPVOID lpvOutData) as NULL.
;

77f49400 57 push edi

;
; Pushes the LPCSTR lpvInData parameter from ecx which is pointing to
; &mr->Param[2] which is inside the metafile the is being played
;

77f49401 51 push ecx

;
; Now load ecx and eax with the size of the input structure and the
; function number for the escape (int cbInput and int nEscape). In the
; case of a SETABORTPROC nEscape/eax is 9 and is taken from the
; mr->Param[0] and cbInput/ecx is from mr->Param[1]. Note that it
; doesn't matter if cbInput is correct of not for SETABORTPROC.
;

77f49402 0fb74b08 movzx ecx,word ptr [ebx+0x8]
77f49406 0fb7c0 movzx eax,ax
77f49409 51 push ecx
77f4940a 50 push eax

; The final parameter is the original DC passed to PlayMetaFileRecord

77f4940b ff7508 push dword ptr [ebp+0x8] ss:0023:0006ff0c=2621029e
77f4940e e8113c0000 call GDI32!Escape (77f4d024)
77f49413 e981060000 jmp GDI32!PlayMetaFileRecord+0xd19 (77f49a99)

By varying the metafile function and parameters I've verified that all that's happening in PlayMetaFileRecord is that when it encounters an ESCAPE it does the same thing as WINE and extracts cbInput, nEscape and lpvInData from the METARECORD and calls GDI Escape.

Sunday, January 15, 2006

Do spammers fear OCR?

Nick FitzGerald recently sent me two sample spams that seem to indicate that some spammers fear that using images in place of words isn't enough. They've started to obscure their messages to prevent optical character recognition.

The first spam appears to be a scan of a document that's been skewed slightly. Now this could be a simple and bad scan.



But the second is even more interesting. It appears to be perfectly normal:



Until you look at the fact that this was actually constructed using <DIV> tags for layout and the breaks between the lines are in the middle of words. Here Nick has kindly inserted borders showing that the words are broken horizontally and then put back in the right position:



But is anyone doing OCR, or are spam filters getting good enough that the spammers are being really paranoid about what they are sending?

The funny think about the second example is that the URL they include is not obfuscated, is clickable and appears in the SURBL :-) So despite the effort to obscure the content a simple check of the spamminess of the URL gets this email canned.

Monday, January 09, 2006

Python is the new Tcl; Ruby is the new Perl

Given how bad I am at predictions I'd like to give you the following (which you'd better take with a grain of salt): Python is the new Tcl and Ruby is the new Perl. My prediction for 2006 is that most Perl 5 programmers will decide that Ruby looks cool and learn the language and that Python will be a niche language that a die hard core continues to love and embellish.

I think Ruby's the new Perl because of the confluence of three things: Perl 6, RubyGems and Ruby's own Perliness. Firstly, given how hideous Perl is if you're a Perl person (like me) why learn Perl 6 when you could learn Ruby? Ruby's core language is clean, regular expressions are strong and there are many Perl influences (which Ruby is slowly removing to ensure that language doesn't become Perl). Secondly, RubyGems will do for Ruby what CPAN did for Perl.

Python's lack of good package management (if you ignore ActiveState's PyPPM and the recently started Cheese Shop) and idiosyncratic language constraints ("hey we liked signficant white space in Make so much we used it ourselves") make it look like Tcl to me. And where's the killer app? Arguably dynamic web pages were Perl's killer app, and perhaps Rails is Ruby's. Python has... Zope?

All of which leaves me wondering how to complete "PHP is the new..."

(Naturally LISP is the new LISP).