Monday, December 11, 2006

The Midas Number (or why divide by zero?)

A recent BBC news article has lead to a storm of commentary on sites like Digg, Reddit and Slashdot about a Reading University lecturer's claim to have 'solved the problem of dividing by zero'.

He makes this claim in a pair of papers: Perspex Machine VIII: Axioms of Transreal Arithmetic and Perspex Machine IX: Transreal Analysis on his personal web site. The papers were or will be published by the Proceedings of the Society of Photo-Optical Engineers, which, I must admit, was not one of the mathemetical journals with which I am familiar.

The first paper introduces a formal system which is vaguely related to the standard axioms of arithmetic with which everyone is familiar. However, it would need to be studied as a totally separate system since it has a number of significant differences which I think make its usefulness a little doubtful. For an excellent overview of the major problems read: Open Letter to James Anderson.

There are some major problems IMHO without delving deeply into the mathematical structures which Dr Anderson's formal system is not.

The most striking first problem is that the 'what you do to one side you do to the other' rule taught to school children does not work. This is caused by a special non-number that Dr Anderson introduces called nullity and written Φ.

For example, it is not the case that if a + b = a + c then b = c. Any school child will be familiar with the idea that you could subtract a from both side of that equation to reveal that b = c.

Unfortunately, the introduced of Φ means that if a = Φ, b and c can be absolutely anything at all. This occurs because on of the axioms of Dr Anderson system is that Φ + a = Φ (and addition is commutative).

So if you are going to subtract something from both side of an equation you need to make sure that it's not Φ. That's a little like the following case in regular arithmetic where you have to ensure that a is not 0: if b / a = c / a then b = c. Under Dr Anderson's system things are even more complex: a must not be 0 or Φ or positive or negative infinity (also non-numbers that Dr Anderson introduces in the first paper with related axioms).

So just regular work with equations gets a little tricky.

The "problem" that Dr Anderson wishes to "solve" it appears is that nobody (but he seems particularly worried about computers) can divide by 0. This is a well known fact, or you might think of it as an axiom, you can't divide by 0, or, put another way, the result of dividing by 0 is undefined.

People "solve" this problem when they see a divide by zero by saying "That doesn't work then"; computers "solve" this problem with an exception, or error, and by assigning the result of a divide by zero the special name NaN (which stands for Not a Number). The computer program has to be designed to either never divide by zero (many programs specifically check to see if they are about to and signal an error), or deal with the exception, or sometimes they'll crash.

Dr Anderson's solution is that a computer should be allowed to divide by zero and instead of having an exception, or crashing, etc. you'll get the result Φ. In the BBC article he says:

"Imagine you're landing on an aeroplane and the automatic pilot's working," he suggests. "If it divides by zero and the computer stops working - you're in big trouble. If your heart pacemaker divides by zero, you're dead."

OK, Dr A. so how does Φ solve this?

I can give you the answer right here: it doesn't. And that's because Dr. A's Φ is cancerous. As soon as variable becomes Φ everything it touches becomes Φ. It's the number equivalent of King Midas: everything it touches turns to Φ.

That's because of two axioms in the first paper: Φ + a = Φ and Φ × a = Φ.

So, basically instead of getting an exception, or error, Dr. Anderson's arithmetic gets rid of the problem and replaces it with Φ. From a programming perspective it's irrelevant, if your auto-pilot suddenly computes that required speed is Φ or your pacemaker wants Φ beats per minute it's useless.

The answer is simple: don't divide by zero. It's undefined!

Thursday, December 07, 2006

Bug fix: 12 Tasty Make Recipes Part II

Back in May, 2006 I gave a talk called 12 Tasty Make Recipes Part II in which I talked about a user-defined GNU Make function to recursively search from a directory for a file or set of files.

The function was written like this:

search = $(foreach d,$(wildcard $1/*),$(call search,$d)$(filter $(subst *,%,$2),$d))

Unfortunately, there was a small mistake in the code that appeared in the presentation. However, two bugs collided to cause the mistake to have no effect. In the above function I wrote $(call search,$d) when I should have written $(call search,$d,$2). But since GNU Make had a bug that caused it to not reset $2 (or any other arguments in the form $n) on a nested $(call) and since I was reusing $2 as the second argument, my function worked.

That is, until GNU Make 3.81 was released and fixed bug 1744.

The correct version of search which will work with all versions of GNU Make is:

search = $(foreach d,$(wildcard $1/*),$(call search,$d,$2)$(filter $(subst *,%,$2),$d))

Thanks for Lou Iacoponi for writing in and pointing out my error.

Sunday, December 03, 2006

Two weeks of image spam innovation

Since I last blogged about image spam I've received numerous image spams myself, and reports from Nick FitzGerald, Sorin Mustaca and Nicholas Johnston about interesting image spams that I've just got to see.

On November 14, I received a report of image spams where the background noise had been updated from dots or small lines to polygons. Here's a sample:



A day later it was clear that the same spammer was randomizing the image size, the background noise and the color palette used:



Then on November 17 I was shown this interesting pump'n'dump technique:


83622874400056543 047183602660 41478311028418 100278 84807407350 05087016712772
78810870435635016 71651855827222 4725576405300038 84252840 12157351038885 630188325737443
23414 23133 41104 4312 7131 6341874402 48244 02522 1428 3224
55263 16021 42114 6654 7782 6583 2673 53280 03201 8323 6565
65882 58041 62412 6086607534050781 3826 0641 83176 85045 35427866406418
18506 15283 12474 33388136436533 643542 80456 87628 13156 506577110786230
31645 55602 81036 816525232877 301217585451283 01540 76265 0748 3312
06035 63450 43040 5224 1553 5786434504117661 52402 78534 8836 414
58510 75013 38325 3877 04146 76870 2788 34488 42803 77840 2737 5470
11372 62751178216028 2715 7266 17812 3668 10033 28833425366310 360215874450816
31403 486020746152 7723 4552 46471 6721 77207 683820875035 23460162005106

Nice, but that didn't last very long, but on November 22 the innovaters were at it again with smaller images containing polygons, lines, random colors, jagged text:



A day later the noise element changed to pixels:



Two days later spammers were trying a little 'old school' image spam with some fonts that they hoped would be hard to OCR.



Strangely the following image spam appeared on November 27 with a perfectly filterable URL. Oops:



And right before the month ended the noise around the border had turned into something like little fireworks:

Thursday, November 16, 2006

Yet more spammer image optimization; this time it's pretty

These something new in the image-spam wave: pretty colours! This spammer is working hard to randomize his images and avoid OCR. Here's a sample:



And to give you an idea of the randomization here's another:



Thanks to Nick FitzGerald and Sorin Mustaca for samples. Notice how the letters are misaligned both vertically and horizontally to try to avoid OCR, and the background polygons are randomized. Also the aspect ratio and size of the messages have been changed for each image.

Wednesday, November 08, 2006

Ransom note spam

Back in January I added a trick called The Small Picture to The Spammers' Compendium, and in August I updated The tURLing Test trick with an example of its use in image-based spam.

The Small Picture consists of sending individual letter images attached to a message. These letter images are then used to display a message and break up words that the spammer might think a spam filter would find suspicious. Here's an example of The Small Picture where certain letters (look carefully!) are formed using images rather than text:



The tURLing Test consists of disguising a URL by breaking it up and then explaining to the user how to type in the URL, thus proving that a human is reading the spam not a spam filter. This is done with URLs so that URL blacklists are bypassed. Here's an example of that from an image-based spam:



Now comes a combination of the two, that deserves the name 'Ransom Note Spam': it combines both The Small Picture (the letters are individual images attached to the spam) and The tURLing Test (the URL is made up of letters in the images):


Monday, October 23, 2006

l8tr.org gets an upgrade

My 'tell me later when this web site is available' server l8tr.org got an upgrade today. There are three things that are officially being released:

1. There's a l8tr.org bookmarklet which you can drag and drop to your toolbar. Just monitor a URL with l8tr.org and you'll be offered the bookmark customized to your email address. (Thanks for Iain Wallace for the code).

2. There's a l8tr.org Firefox extension that makes using l8tr.org a breeze. It's on the main l8tr.org page: click it to install it. Once configured with your email address a simply right-click on a link you want to monitor gives the option Monitor with l8tr. Click that and l8tr.org starts monitoring the link. (Thanks to Barrett for the code; he gets the $50 bounty).

3. l8tr.org's cache is now working. As well as checking to whether a site is available, l8tr.org caches the site's content and offers users both the original URL and the cached version.

In addition much has happened behind the scenes to make sure that site availability is correctly recognized.

Friday, October 20, 2006

Why OCRing spam images is useless

Nick FitzGerald forwards me another animated GIF spam that takes the animation plus transparency trick I outlined in the blog post A spam image that slowly builds to reveal its message to a new level. And it shows why spammers will work around OCR as fast as they can.

Here's what you see in the spam image:



Looks simple enough until you take a look at the GIF file that actually generated what you see. It's animated and it has three frames:





The first image is the GIF's background and is displayed for 10ms then the second image is layered on top with a transparent background so that the two images merge together and the image the spammer wants you to see appears. That image remains on screen for 100,000 ms (or 1 minute 40 seconds). After that the image is completely blanked out by the third frame.

My favourite touch is that it's not the entire image that's transparent, not even the white background, but just those pixels necessary to make the black pixels underneath show through. If you look carefully above you can see that some of pixels appear yellow (which is the background color of this site) indicating where the transparency is.

That is darn clever.

Monday, October 16, 2006

A spam image that slowly builds to reveal its message

Nick FitzGerald sent me a stunning example of lateral thinking on the part of a spammer. The spammer has taken a standard stock pump-and-dump spam image and split it horizontally into strips.



Each of the 17 horizontal strips cuts fairly randomly through the text making OCR on each strip not very useful. The spammer has then mounted each strip in its correct position on a transparent background and put each strip into an animated GIF. Here, for example, are a couple of strips:




The end result is that only once the entire image animation has completed is the complete spam visible making this a challenge for spam filters. And the spammer has thrown in a couple of frames at the end of the image, that get displayed after such a long delay (8 minutes) that they essentially never get shown. But those final frames are there just to throw off a spam filter trying to find the actual image.

Here's what gets displayed:



and here's the final image in the animation:



Very clever! (I'm calling this 'Strip Mining')

Wednesday, October 04, 2006

A peek inside ReadNotify

Recently the service ReadNotify has been in the news as it was used to track emails and documents sent during the recent HP spying scandal. I'd heard of ReadNotify before but never played with it, but since they offer free accounts I signed up and sent myself some emails. Here's what I found inside those messages.

Using ReadNotify couldn't be simpler. Once you've registered your From address with the service you can send email through it by appending .readnotify.com to the email of the person you are writing to. For example, to send a tracked email to me ([email protected]) you'd send it to [email protected]. ReadNotify will add their tracking features to the message and forward it to the real recipient.

To test the service I sent the following email to a email address on Hotmail. The email was sent from my regular email address via ReadNotify. The email was composed in Mozilla Thunderbird which I have configured to send only plain text email. (Throughout this blog post I have obscured details in the messages by replacing private information with XXX or 123).

Original message:

Date: Tue, 03 Oct 2006 13:20:03 +0200
From: John Graham-Cumming <[email protected]>
Reply-To: [email protected]
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040208
Thunderbird/0.5 Mnenhy/0.6.0.104
MIME-Version: 1.0
To: [email protected]
Subject: A test of this email tracking service to a hotmail account
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

I'd like to see how this works.

John.

What Hotmail received:

Received: from esmtp.emsvr.com ([208.185.251.19]) by
bay0-mc3-f7.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2444);
Tue, 3 Oct 2006 04:21:24 -0700
Received: from esmtp.emsvr.com (localhost.localdomain [127.0.0.1])
by esmtp.emsvr.com (8.13.6/8.12.11) with ESMTP id k93BKLB1030009
(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
for <[email protected]>; Tue, 3 Oct 2006 11:20:22 GMT
Received: (from [email protected])
by esmtp.emsvr.com (8.13.6/8.12.11/Submit) id k93BKLoY030003
for [email protected]; Tue, 3 Oct 2006 11:20:21 GMT
Resent-Date: Tue, 3 Oct 2006 11:20:21 GMT
Resent-Message-Id: <[email protected]>
Resent-From: [email protected]
Received: from [66.249.92.168] by emsvr.com [208.185.251.19]
for <[email protected]>
on-behalf-of [email protected]; Tue Oct 3 11:20:19 2006
Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.168])
by esmtp (8.13.6/8.12.11) with ESMTP id k93BKDi8029929
for <[email protected]>; Tue, 3 Oct 2006 11:20:14 GMT
Received: by ug-out-1314.google.com with SMTP id t30so548551ugc
for <[email protected]>; Tue, 03 Oct 2006 04:20:07 -0700 (PDT)
Received: by 10.67.121.15 with SMTP id y15mr3639480ugm;
Tue, 03 Oct 2006 04:20:07 -0700 (PDT)
Received: from ?192.168.1.2? ( [10.254.8.232])
by mx.gmail.com with ESMTP id e33sm6037799ugd.2006.10.03.04.20.05;
Tue, 03 Oct 2006 04:20:06 -0700 (PDT)
Message-ID: <[email protected]>
Date: Tue, 03 Oct 2006 13:20:03 +0200
From: John Graham-Cumming <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Usr-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040208
Thunderbird/0.5 Mnenhy/0.6.0.104
To: [email protected]
Subject: A test of this email tracking service to a hotmail account
Sender: John Graham-Cumming <[email protected]>
MIME-Version: 1.0
Content-Type: text/html; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Disposition-Notification-To: "them"
<[email protected]>
X-Confirm-Reading-To: [email protected]
Return-Receipt-To: [email protected]
Notice-Requested-Upon-Delivery-To: [email protected]
Errors-To: [email protected]
X-Read-Notification: Courtesy of ReadNotify.com -
http://www.r7vkv5yav10gu1.ReadNotify.com
Return-Path: [email protected]
X-OriginalArrivalTime: 03 Oct 2006 11:21:24.0793 (UTC)
FILETIME=[0FBED290:01C6E6DE]

<HTML><HEAD>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</HEAD><BODY><DIV></DIV><DIV>I'd like to see
how this works.
</DIV><DIV>
</DIV><DIV>John.
</DIV>
<div alt="r7vkv5yav10gu1."><pre> </pre><pre>
<br><Img moz-do-not-send="true" border=0 height=1 width=3 alt=""
lowsrc=""
Src=http://www.r7vkv5yav10gu8.ReadNotify.com/nocache/r7vkv5yav10gu9/footer0.gif>
<Img moz-do-not-send="true" Border=0 Height=1 Width=2 Alt=""
Lowsrc=http://www.readnotify.com/ca/rspr47.gif ><BgSound volume=-10000
Alt='' Lowsrc=""
Src=https://tssls.r7vkv5yav10guv.ReadNotify.com/nocache/r7vkv5yav10guv/rspr47.wav>
</pre><table height=1 width=3 border=0><tr><td
background
=http://0320.185.64275/nocache/r7vkv5yav10guP/rspr47.gif> </td>
</tr></table>
<BODY bgColor="#ffffff;background-image:
url(http://www.r7vkv5yav10gum.ReadNotify.com/lis/r7vkv5yav10guq/rspr74.gif)" bgColor="#FFFFFF">
</div><div><title> A test of this email tracking service to
a hotmail account </title>
<title>&rlm;‏‌‌‎‎‍‍‏‎‏‎

[snipped 10s of lines like this]

&rlm;‎‌‌‎‎‏‏‌‎‏‎‎
<title> A test of this email tracking service to a hotmail account
</title>
</div alt="r7vkv5yav10gu1."></BODY></HTML>

Not only has my little plain text email become an HTML mail but there's a whole lot of additional stuff in the message that enables ReadNotify to track my receipt and opening of the message.
  1. The message headers contain no less than six different requests that receipt of the message be reported back to ReadNotify. Specifically, it contains the header Disposition-Notification-To, X-Confirm-Reading-To, Return-Receipt-To, Notice-Requested-Upon-Delivery-To, Errors-To and X-Read-Notification. All of these go to the address [email protected] where the [email protected] is my obscured email address and the ddntqqiabybpiic is a unique string generated for just this message.

  2. That seem unique address also appears in the Return-Path and Resent-From header. All these headers mean that ReadNotify can watch the progress of my message as it passes from server to server just because the servers will be checking information from these headers thus acting as a beacon showing which IP addresses looked at the message.

  3. The message body contains four separate web bugs using a standard image, a background sound, a background image on a table and a background image on the body using CSS.

    The background image is <img send="true" alt="" lowsrc="" src="http://www.r7vkv5yav10gu8.ReadNotify.com/" border="0" height="1" width="3" /> where the r7vkv5yav10gu8 is unique to this message.

    The background sound is <bgsound volume="-10000" alt="''" lowsrc="" src="%20https://tssls.r7vkv5yav10guv.ReadNotify.com/ nocache/r7vkv5yav10guv/rspr47.wav">. Notice the volume being set to -10000 so that there's no sound at all and the same unique string in the path to get the sound.

    The table contains a <td> tag with a background image using the same unique string: <td background= http://0320.185.64275/nocache/r7vkv5yav10guP/rspr47.gif>

    Finally, the same unique string appears in the <body> tag using CSS <BODY bgColor="#ffffff;background-image:url(http://www.r7vkv5yav10gum. ReadNotify.com/lis/r7vkv5yav10guq/rspr74.gif)" bgColor="#FFFFFF">

  4. Finally, there's that large block of stuff at the end written using HTML entities. In fact it consists of preciesly four different invisible HTML entities repeated over and over again: &rlm; (right-to-left-mark), &rlm; (left-to-right-mark), &zwnj; (zero-width non-joiner) and &zwj; (zero-width joiner). There's clearly a pattern there, but I'm not sure of its purpose, perhaps it's yet another unique identifier on the message.
It's also possible to send the message via .silent.readnotify.com. I tried that too, with the same message. The only differences are that the return receipt headers are missing (which means that the person receiving the message will not be notified by their mail client of a return receipt) and that the entire message had been base 64 encoded (I wonder why? I assume ReadNotify is trying to hide something from either a mail server or mail client). Unencoding the message revealed that it contained essentially the same HTML as above with a different unique string (since this was a different message).

Going over to the ReadNotify UI shows the two message that I sent and when they were last opened.



Clicking on one of the messages gives details of when and where the message was opened. The physical location was absolutely correct.



The company can also track attachments such as Microsoft Word documents and PDF files with similar accuracy.

Introducing l8tr.org

My latest little venture is a free service called l8tr.org. It's for all those times when you want to visit a web page but can't because the web page is running too slowly, or is completely overloaded, or you are in the middle of your work day and the page is NSFW.

Just type in the URL of the web page, and your email address. l8tr.org will check the availability of the web site and once it becomes available you'll receive an email.

That means there's no need to be frustrated when you can't get to a web site. l8tr.org will watch it for you and send you a simple email reminder.

Monday, October 02, 2006

Ye Olde OCR Buster

Regular spam-correspondent Nick FitzGerald writes with an example of a spam that he believes is trying to get around both hash busting and OCR in an image.



The image has random dots in the bottom left hand corner to mess up hashing of the GIF itself, and the fonts used are badly rendered unusual fonts.

Friday, September 22, 2006

A Test::Class gotcha

I'm working on a project that involves building a prototype application in Perl. I've made extensive use of Perl's OO features and have a collection of classes that implement the mathematical calculations necessary to drive the web site running the application. Naturally, as I've been building the classes I've been building a unit test suite.

Since Test::Class is the closest thing Perl has to junit or cppunit I'm using it to test all the class methods in my Perl classes. Everything was looking good until I told the guy writing the server to integrate with my code. His code died with an error like this:
Can't locate object method "new" via package "Class::A" (perhaps you
forgot to load "Class::A" at Class/B.pm line 147.
Taking a quick look inside Class::B revealed that it did try to create a new Class::A object and that, sure enough, there was no use Class::A; anywhere in Class::B. Easy enough bug to fix, but what left me scratching my head was why the unit test suite didn't show this.

For each class I have an equivalent test class (so there's Class::A::Test and Class::B::Test) which are loaded using a .t file which in turn is loaded with prove. The test classes all use Test::Class.

The classes are tested with a Makefile that does the following:
test:
@prove classes.t
And classes.t consists of:
use strict;
use warnings;

use Class::A::Test;
use Class::B::Test;

Test::Class->runtest;
Since the test suite for Class::A does a use Class::A; and the test suite for Class::B does a use Class::B; and the two test suites are loaded using use in classes.t, both Class::A and Class::B are loaded before running the tests. This means that the fact that use Class::A; was missing from Class::B is masked in the test suite.

The solution is to have two .t files one for each class so that only the class being tested is loaded. So I dumped classes.t and created class_a.t and class_b.t as follows:
use strict;
use warnings;

use Class::A::Test;

Test::Class->runtest;
and
use strict;
use warnings;

use Class::B::Test;

Test::Class->runtest;
and the Makefile is changed to do:
test:
@prove class_a.t class_b.t
This now works correctly. The missing use Class::A; causes a fatal error in the test suite.

Thursday, September 21, 2006

BOUTS: Best of UseTheSource

Long, long ago, OK, 1999, I registered the domain name usethesource.com (as in Use the source, Luke!) and used it to start a web site which would these days be called a blog. The site was powered by Slashdot's code Slashcode and featured a mix of my commentary on the news and original articles. You can still read the old site at archive.org. The site even got me an appearance with Leo Laporte on The Screen Savers.

Most of the articles published are irrelevant today. The commentary is often on start ups that have fizzled out long ago, and I shut down the site in 2004. But some of the articles are worth repeating. So, from time to time, I'll be republishing original pieces from UseTheSource as BOUTS entries in this blog.

To get things rolling here's an article I wrote back in 2002 about calculating the area of an annulus based on the length of a tangent.

Originally published June 12, 2002

Take a look at the following shape. It's an annulus: two concentric circles, something like a simple washer or donut.



Imagine that you know only one fact about this shape, the length of a tangent of the inner circle where it touches the edges of the outer circle. Call that length x. Can you calculate the area of the yellow shaded part?



This problem was presented on the NPR radio show CarTalk a few weeks back and after I solved it I realized that there were a couple of interesting ways of calculating the area. Both require knowledge of the formula for the area of a circle: πr2, where r is the radius of the circle. One requires remembering Pythagoras' Theorem, the other a little logical reasoning.

Solution by logical reasoning

Insight: there must be many such concentric circles where it's possible that the tangent has length x.

In fact if I start with the small circle in the middle it must always be possible to choose the size of the outer circle so that the tangent is x.



So what if I make then inner circle have size zero. Then all I need is an outer circle with diameter x.



Since we know there's only one solution (surely the person posing this question knew that there was only one solution), then we can just calculate the area of the outer circle when the inner circle has zero radius.

The outer circle in that case has diameter x or radius x/2 and so the area is π(x/2)2 or πx2/4.

Solution by Pythagoras

To calculate the area of the annulus we need to calculate the area of the big circle and subtract the area of the small circle. If we name the radius of the big circle r and the radius of the small circle s then we need to calculate πr2 - πs2 or π(r2 - s2). Hmm. That r2 - s2 bit looks a lot like something we might get from Pythagoras' Theorem (the square on the hypotenuse is equal to the sum of the squares on the other two sides).



For Pythagoras we need a right angle triangle. Low and behold we have one. Since we have a tangent we know it's at right angles to a radius of the inner circle. The complete triangle has sides r, s and x/2.



So run through Pythagoras on this triangle and we get r2 = (x/2)2 + s2. Subtract s2 from both sides and you've got r2 - s2 = (x/2)2. Now we know how to calculate r2 - s2, it's (x/2)2 and so the area of the annulus is π(x/2)2 or πx2/4.

iPod nano in diagnostic mode

This morning I was running out to do some shopping. I grabbed my iPod nano... when I slipped it out of its case it was somehow in diagnostic mode. Here's what it looked like:



This text interface includes a text scroll bar on the right hand side with a $ sign indicating the current position. The menu is scrolled with the << and >> keys. Clicking on the FiveInOne test reveals:



Great, SDRAM is OK and I have a 3.81GB HD. Asking for HD information from the menu reveasls that it's FAT-32 partition in the device, and the ID number of the drive:



And then I got just too excited on seeing the screen pattern test and couldn't hold the camera still:



More on this here.

Wednesday, September 20, 2006

Watching a phishing attack live

Yesterday a phishing mail for a community bank in a US east coast state (throughout this blog post I have obscured many details including names, domains and IP addresses) slipped through GMail's spam/phish filter and then right through POPFile. Only Thunderbird bothered to warn me that it might be a scam.

The message itself was sent from an ADSL connected machine in China.

Of course, since I don't have an account with this bank it was an obvious phish, but I was curious about it so I followed the link in the message.

The link appeared to go to https://*****bank.com/Common/SignOn/Start.asp but actually went to http://***.***.164.158:82/*****bank.com/Common/SignOn/Start.html. Clearly a phish running on a compromised host.

A reverse DNS lookup on the IP address of the host revealed that the phish was being handled by a web server installed in a school in a small central Californian town. The machine appeared to be running IIS, but the phishing server identified itself (on port 82) as Apache/2.0.55 (Win32) Server.

The Start.html page was identical to the actual sign on page used by the bank. In fact taking a screen shot of the real page and doing a screen shot of the phishing page revealed that they were identical. Even the MD5 checksum of the images was the same. Naturally, not everything was the same in the HTML.

Although almost all the HTML was identical (with the phishing site even pulling its images off the real bank's site), the name of the script that handled validation of the user name and password had been changed from SignOn.asp (the actual bank uses ASP) to verify.php (the phisher used PHP).

The only significant diff between the phisher site and the real site is:

272c272
< <form action="verify.php" method="post" id="form1" name="form1">
---
> <form action="SignOn.asp" method="post" id="form1" name="form1">

Once a username and password was entered the phishee was taken to a page asking for name, email address, credit card number, CVV2 number and PIN (with the PIN asked for a second time for validation). After that the user was thanked for verifying their details.

The user name, password, credit card number, CVV2 number and PINs were saved to a file called red.txt in the same directory as the HTML and PHP files used to make the phishing site. How do I know that? Simple, by popping up one level in the phishing URL to http://***.***.164.158:82/*****bank.com/Common/SignOn/ I was able to get a directory listing. In the directory there were three HTML files, two PHP scripts and red.txt. Clicking on that file gave me access to the phished details as they came in.

I quickly informed the bank and US CERT of the phishing site. I tried to figure out how to contact the school, but it was 0500 in California.

Here's a sample entry from the actual log file.

###################################
Tue Sep 19, 2006 5:33 am
Username: youare
Password: stuipd
***.***.118.70
###################################
Tue Sep 19, 2006 5:34 am
cc: 4111111111111111
expm: 10
expy: 2006
cvv: 321
pin: 1122
pin2: 1122
***.***.118.70
###################################

The time is local to California and you can see the details that the person entered. Here clearly a vigilante has decided to mess with the phisher by entering bogus details. In fact, the last time I was able to access the site (before it was pulled down) there were 33 entries in the log file. Of these 32 contained nothing, or offensive user names and passwords.

But one seemed to contain legitimate information.

The log file had a first entry at 0454 California time from a machine owned by MessageLabs (I assume that they are doing some automated testing of phishing sites), the last entry was as 1226 California time.

The one legitimate entry contained a valid Visa card number (valid in the sense that the number validated against the standard Luhn check digit algorithm). Also the user name and password looked legitimate and a quick Google search revealed that the username was also used as part of the email address of a small business in the same town as one of this small bank's branches. It looked very likely that this entry was legitimate and the person had given away their real card number and PIN.

US CERT quickly responded with an auto-response assigning me an incident number and I received an email from the bank's IT Ops Manager Jack. Jack told me that he was already aware of the site and that this was the third time this little bank had been phished from machines in California and Germany. I gave Jack the name of the school in California, and he said he'd get in contact with them (he'd already called the FBI). I also told Jack about the one card number that looked totally legitimate; he told me he was in charge of all card operations at the bank and had the power to deal with it.

Some hours after that the site went offline.

Friday, September 15, 2006

Image spam filtering BOF at Virus Bulletin 2006 Montreal

I'm leading a BOF meeting at Virus Bulletin 2006 in Montreal next month. The idea is to get together in one room for a practical, tactical meeting to share experiences on how people are currently filtering image spam and what might be done in future (and what we expect spammers to do). I've already got commitments from major anti-spam vendors to be there and talk (as much as they are permitted) about their approach and I'll try to cover what the Bayesian guys are doing.

If you are interested please email me, or post a comment here. If you represent a vendor and want to be involved I'm especially interested to hear from you as I want to get all experiences out on the table (as much as is practical).


Date and Time Confirmed: Thursday, October 12. 17:40 to 18:40 in the Press Room.

Downloadable PDF flyer.

Thursday, September 14, 2006

A C implementation of my simple GPS code

Reader Chris Kuethe wrote in with a version of my simple code for entering latitude and longitude to GPS devices written in C (my demonstration code was in Perl).

Seems Chris is a bit of a GPS fanatic and maintains a page on GPS hackery.

He ported my Perl code to C and is releasing the code freely. He gave me the choice of releasing under two clause BSD license or making it public domain. I think the most generous is public domain (especially since the Perl code was public domain).

Here's the code to compute a SOC:

#include <sys/types.h>
#include <stdio.>

int
main(int argc, char **argv){
int i, j;
unsigned long long lat, lon, c, p, soc_num;
char soc[11], *alpha = "ABCDEFGHJKLMNPQRTUVWXY0123456789";
int primes[] = { 2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37 };
float f;

if (argc != 3){
printf("Usage: %s <lat> <lon>\n", argv[0]);
exit(1);
}

sscanf(argv[1], "%f", &f);
lat = (int)((f + 90.0) * 10000.0);

sscanf(argv[2], "%f", &f);
lon = (int)((f +180.0) * 10000.0);

p = lat * 3600000 + lon;
soc_num = p * 128;

c = 0;
for(i = 0; i < (sizeof(primes)/sizeof(primes[0])); i++){
c += ((p % 32) * primes[i]);
p /= 32;
}

c %= 127;
soc_num += c;

for(i = 9; i >= 0; i--){
j = soc_num % 32;
soc[i] = alpha[j];
soc_num /= 32;
}
soc[10] = '\0';

printf("%s\n", soc);
}

And to compute latitude and longitude from a SOC:

#include <sys/types.h>
#include <stdio.h>

int
main(int argc, char **argv){
int i, j, c, k;
unsigned long long x, y, p, soc_num;
char soc[11], *alpha = "ABCDEFGHJKLMNPQRTUVWXY0123456789";
int primes[] = { 2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37 };
float lat, lon;

if ((argc != 2 )|| (strlen(argv[1]) != 10)){
printf("Usage: %s <10-digit-SOC>\n", argv[0]);
exit(1);
}

soc_num = 0;
for (i = 0; i < 10; i++){
c = (char)argv[1][i];
c = c & 0xff;
c = toupper(c);
switch(c){
case 'I': c = '1'; break;
case 'O': c = '0'; break;
case 'S': c = '5'; break;
case 'Z': c = '2'; break;
default: ;
}
for (j = 0; j < strlen(alpha); j++)
if (c == alpha[j]){
soc_num = (soc_num * 32 + j);
}
}

p = soc_num / 128;
k = soc_num % 128;

lon = ((p % 3600000) / 10000.0) -180.0;
lat = ((p / 3600000) / 10000.0) - 90.0;

c = 0;
for (i = 0; i < (sizeof(primes)/sizeof(primes[0])); i++){
c += ((p % 32) * primes[i]);
p /= 32;
}

c %= 127;
if (c != k)
printf("warning: checksum mismatch - %d %d\n", c, k);
printf("%0.4f %0.4f\n", lat, lon);
}

Thanks Chris!

Update: Chris writes to say that B1NLADEN02 can be found in Antarctica: -76.7847/-106.0187 and JIMMYHOFFA is here: -23.3433/-61.6087.

Wednesday, September 13, 2006

Apologia: Sophos and SoftScan

After reading all the blog posts, mailing list and personal mail concerning my post yesterday (Did SoftScan, Sophos and Panda rip off my blog?) I think I need to apologize to two of the companies involved.

As I mention in the updated post both SoftScan and Sophos explain that it's a conincidence and since I have no evidence that they copied stuff from this blog (even though it appeared on the front page of Slashdot before their PR), I think I owe them an apology. It probably would have been prudent of me to restrict yesterday's posting to just Panda and ignore SoftScan and Sophos.

*sigh*

*bows head in shame*

Tuesday, September 12, 2006

Did SoftScan, Sophos and Panda rip off my blog? (Update: SoftScan and Sophos says 'no')

This morning I saw a news article about subliminal spam messages on ZDNet. I was intrigued to read about it because a few days ago Nick FitzGerald wrote to me with an example spam that he dubbed 'subliminal'. I wrote back and told him I was going to blog about it and he said go ahead.

The blog post is Subliminal advertising in spam? and was posted on Monday, September 4, 2006. That same day Slashdot picked up my blog post here. Later it was also picked up by Digg.

So I was a little surprised that the ZDNet article didn't mention Nick, me, my blog, Slashdot, or Digg. In fact, the article contains a link to Panda's press release on the subject: PandaLabs detects a new spam technique in which they state "PandaLabs has detected a spam message that uses subliminal advertising techniques.". No mention of this blog anywhere there either, but there are two images of such a spam, both of which I believe were lifted directly from my blog without attribution. The press release is dated the day after my post/Slashdot headline: Tuesday, September 5, 2006.

Here are the images side by side for comparison


Image from my blog post


Image from Panda's press release (local archive of the image)

And I named my image sub2.gif when I extracted it from the spam, and Panda named the same image sub2.gif. The MD5 checksum of my image is 9cace353b2d8b2db1d8868c07986f768 and the Panda image has the checksum 9cace353b2d8b2db1d8868c07986f768. And I also thought the original was a bit large for my blog so I reduced it from 603x451 to 302x226, the Panda image has the same reduced dimension. Hmm. Exactly the same image.

The other image in the press release is also, I believe, from my blog:


Image from my blog post


Image from Panda's press release (local archive of the image)

Once again, I named my image sub3.gif when I extracted it from the spam, and Panda named the same image sub3.gif. The MD5 checksum of my image is 6e16df2d3b67a7578ca7b09f0ccb9fc1 and the Panda image has the checksum 6e16df2d3b67a7578ca7b09f0ccb9fc1. Again I thought the original was a bit large for my blog so I reduced it from 603x451 to 302x226, the Panda image has the same reduced dimension. Hmm. Exactly the same image, again.

So it looks a lot to me like Panda heard about my blog post (perhaps through Slashdot) and then passed Nick's example off as their own research. Of course, it's possible that Panda the day after my blog post, independently found the same thing, named it subliminal spam, named the frames within the gif the same thing as me, extracted them from exactly the same spam image (which they managed to capture even though spammers are adding random noise so that hashing is impossible) and issued their press release.

On Wednesday, September 6, 2006 (two days after my blog post/Slashdot headline) Sophos put out a press release Spammers use subliminal messages in latest pump-and-dump scams in which they state: "Experts at SophosLabs™, Sophos's global network of virus, spyware and spam analysis centers, have identified a "pump-and-dump" stock spam campaign which uses an animated graphic to display a "subliminal" message to potential investors."

Once again the release doesn't mention me, Nick, this blog, Slashdot, Digg, ... It too includes an image that appears to be from the same spam campaign I was blogging about (a pump and dump for the stock TMXO), but there's no image borrowing here. The image is from the same campaign but different, and they no doubt didn't borrow any images from me.

Clearly, Sophos could have seen the same spam campaign as Nick and I and come to the same conclusion and called it 'subliminal' spam.

On Thursday, September 7, 2006 it appears that SoftScan got into the game too. They are mentioned in this article where it's written: "SoftScan's analysis of the latest pump-and-dump scam has discovered that an image appears for a split second every so often in the email with the word 'buy' repeated several times."

Disclaimer: I can't prove that any of these companies saw my blog post on Slashdot and then issued press releases, but the timing is interesting: my blog post comes first followed by press releases and articles using either the same image, the same campaign and all calling it 'subliminal spam'. Perhaps 'subliminal' spam was an obvious name, and I'm crazy, but...

An offer: on the other hand, if any company would like free reign to pass off things on my blog as their own work I have a simple offer for you: give me a small stock option in your company, call me a 'technical advisor' or similar, and feel free to take what you want from here.

UPDATE: SoftScan's Corporate Communications Manager Bo Engelbrechtsen comments below (see comments section) that they independently found this, and had never heard of this blog before.

UPDATE: In a private email a Sophos employee I know well says: "I personally alerted Sophos's PR team about this spammer trick [...] The word "subliminal" was the first thing that came to my mind when I saw it. [...] I don't read John's blog and am very disappointed with this insinuation. We receive millions of spam e-mails to our traps every day, many of which get analyzed and looked at by spam analysts around the world. We don't need to steal someone else's story..."

Wednesday, September 06, 2006

Slashdot effect = 3.5 * Digg effect

On Monday a post on this blog was on the front page of Slashdot and then on Tuesday the same link made it to the front page of Digg. Since my blog has Google Analytics enabled this gives me an unprecedented opportunity to measure the number of visitors from each site for the same story.

Here are the referrer stats for the period:

slashdot.org (45,473)
digg.com (13,009)
webwereld.nl (1,975)
fayerwayer.com (1,197)
fark.com (988)
sensibleerection.com (248)

So Slashdot brought in 45,473 unique visitors and Digg 13,009. That means the posting on Slashdot was worth 3.5 times as many visitors as Digg.

There's one big question which means that the Slashdot effect might be bigger than stated here. Monday was Labor Day in the US with a lot of people taking time off. Perhaps Slashdot's readership was lower on Monday than normal meaning that the Slashdot effect is more than 3.5 the Digg effect.

Monday, September 04, 2006

Optimal SMS keyboard layouts

One of the things I find very frustrating about typing SMS messages on my phone is that I often find that the next letter I want to type is actually on the same key that I just pressed. And that slows me down because either I wait for the timeout, or I click the right arrow key to move on.

For example, here's a standard keyboard on cell phones:

abc def
1 2 3

ghi jkl mno
4 5 6

pqrs tuv wxyz
7 8 9

Very common English letter pairs such as 'ed' and 'on' appear on the same key meaning that if you need to type one of these you are going to incur the cost of dealing with the 'next letter is on same key' problem. In addition, the most common English letters are more than one click away; the most common English letter 'e' is two clicks, 'o' is three, 'n' is two, etc.

What you really want is a keyboard layout that means that most common letters are as few clicks away as possible, and that the common letter pairs are on different keys so that you can maximize typing speed. And if possible make the layout as close as the current one so that it's easy to learn.

There are some people who propose squeezing QWERTY into the the current keyboard. This ends up with a key that starts with 'q' and another that starts with 'z': two of the least common keys are given pride of place on the keyboard.

Other propose using dictionaries. I think the fastest typing would be on an intelligently laid out keyboard without the need for a dictionary.

I took the 1000 most common words in English as a test set, and performed three keyboard optimizations: one by hand and two using different sets of common letter pairs and tested them against the 1000 most common words. Each set received a score equal to the number of clicks required to type all 1000 words. A single click on a key was worth one click (so typing 'a' is one click, typing 'q' is two, etc. on the standard keyboard), and the cost of handling the 'next letter is on same key' was set at the same time as two clicks.

I used letter and letter pair frequency information from the excellent book Cryptanalysis. And of course I wrote some code to perform the layout of the keyboards optimizing for the least number of clicks per common letter and the least number of same key clicks for letter pairs.

The standard keyboard layout gets a score of 12,447 clicks.

The following machine generated layout can be used to type the same words in 8757 clicks (70.35% of the clicks of the standard keyboard):

acb euwj
1 2 3

ipg hmx olv
4 5 6

sfq tdyz nrk
7 8 9

This keyboard doesn't look anything like the original keyboard so I then used a shorter list of letter pairs and hand optimized the keyboard to balance clicks and similarity. The result is 8,912 clicks (or 71.6% of the standard keyboard) and a nice layout:

adc efb
1 2 3

igj hlk omz
4 5 6

srpq tuv nwyx
7 8 9

Now, if there was only a way to get that on my RAZR I could save 30% of my typing time.

Subliminal advertising in spam?

Nick FitzGerald sent me a great example of subliminal advertising in a spam message. At least that's what he thinks the spammer might have been up to. The spam contains an animated GIF with four frames. One of the frames (which contains the actual spam message) remains visible for 17 seconds. The other three frames are displayed for 10ms or 40ms, and each of those contains a little random noise and the word BUY in random positions.

Was the spammer really hoping to make us fall for his pump and dump scam with a quick flash of BUY on screen?

Here's the actual GIF with the animation in place (watch out you might be forced to BUY :-)



And there are the four separate frames:

10ms
17s
40ms
40ms

Friday, September 01, 2006

The hell of Dell France

Last October I started a company in France. The French government kindly supplied my details to various companies without me asking. I suspect this happened because information about companies is a public record and certain marketing-savvy companies slurped up my information and sent my 'useful' junk mail: catalogs for office equipment for example.

One of the companies that felt the sudden urge to write to me was Dell. For a while I owned Dell computers and for various reasons (mostly to do with they terrible support for small business and their weird 'you need to buy Dell racks for your gear') I stopped buying anything from them.

So as each piece of junk mail came in I would unsubscribe. Sometimes this was a phone call, sometimes a fax and sometimes it was necessary to return the item with 'Désinscription' or similar written on it.

And it worked great, except for Dell.

For 10 months I've tried to unsubscribe.

I've emailed them at [email protected] as requested. I've faxed them on 0825 004 682 as they also suggest and I've mailed them at Koba D/03-F, ZI de Chevreuil, F-60490 Ressons Sur Metz. And still their junk keeps coming.

Aargh!

Friday, July 28, 2006

Unbanned from Digg

After two emails to Digg's abuse email address, and an intervention by Leo Laporte (thanks Leo!) direct to Kevin Rose, my Digg account has been unbanned.

Here's the official word I got from Digg:

Your account has been unbanned. Your account was banned for violating digg Terms Of Use, submitting the exact same story in less then 3 minutes time, that's what spammers usually do on digg. As a consequence, we banned your account. Your account was NOT banned for linking to reddit.com or for submitting a joke.

-The Digg Watch Team.

So I'm letting this go now. They say they were OK with the joke, and just thought (erroneously) that I was spamming. I never believed that it was because of some reddit/digg rivalry.

So, it's over.

But a note to Digg: please alter the way Bury Story works so that it's obvious when a story has been buried, that the reason it was buried is clear, and that the list of people who buried the story is given (just like you do for who dugg a story). Just doing that one thing would end a lot of confusion.

Update on August 1, 2006: I just came across a blog entry that makes some untrue claims about me concerning this:

1. The problem was, then he submitted the story multiple times. Actually, I submitted the same story twice, not multiple times and as I've explained this was because I screwed up the URL on the Reddit side and had to start again. You can check for yourself how many times I submitted the story my looking at my Digg account submission history.

You can also see that the two stories have different URLs, and that one story has just one digg. That's partly because I buried that story myself once I realized I had screwed up.

2. and then created multiple fake accounts and dugg his own stories. Once I discovered that my account was banned I did two things: I emailed [email protected] asking why and I created a new account for myself. So multiple here is 1, and I only did that because Digg killed my regular account.

Thursday, July 27, 2006

Badges of Honor

So after yesterday's little recursion hack I've received a few badges of honor that I'm very proud of:
  1. I've been banned from Digg and the Digg folks don't have the decency to answer my polite email asking for an explanation of my banning. At this point, I consider this to be a badge of honor: if a technical web site bans you for submitting a clever link that demonstrates a well known programming paradigm and harms no one, and the founder of the web site claims to be such a cool hax0r, then your submission revealed something very important: that site is run by fools.

  2. The reddit folks honored me with a special reddit logo for the day celebrating my never ending recursion between Digg and reddit. I'm probably violating reddit's terms of service by publishing this, but here's a copy of the logo:


  3. My story was also #1 on reddit yesterday and so I've been awarded a Golden Reddit in recognition. Cool, thanks guys!


  4. And I guess the reddit guys really liked it because they've offered me a free shirt from their store.
So, it seems like the reddit people and community have a sense of humor, and the Digg people don't. Not only did their staff ban me, but it looks like part of their community buried the story and since Digg has not feedback on when stories are buried it just magically disappeared.

Wednesday, July 26, 2006

Sense of humor failure at Digg

This afternoon I decided to play a little joke on Digg and Reddit by submitting a story about recursion that pointed from the Reddit story to the Digg story and back. Since Digg URLs are totally predictable it was possible to set this up by writing the Reddit story first (and they don't verify that URLs work) and then posting the correct Digg story with the Reddit link (which I couldn't predict).

It's a classic programmer joke (it's even in the Digg Programming section).

And for that my Digg account was cancelled without warning.

You can start the recursion at Digg here.

But Digg folks, get a life, ok? It was a joke, a classic programming joke. And this from a guy who's a 'dark tipper' (scary music).

A well informed reader writes to me:
Your amusing recursion hack has just had a practical consequence:
you inadvertantly caused the smoking gun proving Digg censors
stories on the front page. Your link actually made the digg
front page, according to the rss feed used by sites like popurls:

http://popurls.com/

but it's been censored from the actual front page.

This also shows an unexpected consequence of feeds: you can't
get away with censoring your site, unless you have some kind of
delay in the feed.
So I looked into this and sure enough the story is buried. If you try searching on Digg for 'reddit recursion' you get no results, but if you look at the main RSS feed my story is there as #2 on the front page. Yet on the actual page the story is not there.

In the RSS feed it appears between "Bug or Conspiracy? Digg users go CRAZY modding down comments!" and "Metallica finally joins iTMS". I'll email anyone a copy of the RSS feed showing my story.

And finally the number of stories promoted to the front page for user jgrahamc (my now dead account) just dumped from 1 to 2... I guess my silly recursion story should be on the front page but it isn't. This appears to be because members of the Digg community buried the story because they didn't like it, but Digg has no mechanism for showing when a story is buried.

Tuesday, July 25, 2006

The Open Source Elephant in the room: the source sucks

Firstly, since I'm about to slam Free and Open Source Software let me point out that I am not a Microsoft fan boy, that I use Firefox, Thunderbird, Ubuntu, EMACS, all the GNU tools, the GIMP, Apache, TRAC, OpenOffice, etc. on a daily basis. I live and work on FOSS.

But there's a real problem with most FOSS: for something that prides itself on the source being readable by everyone, and even cames up with 'laws' like 'given enough eyeballs all bugs are shallow', the actual source code of most FOSS is horrible, unreadable, garbage. Actually, I wonder if 'Linus' Law' shouldn't actually be something like 'Linus' Necessity': given that the source is so horrible we need lots of people so that one of them will be able to figure out what the hell it was we wrote.

When I started my well-known open source project I decided that I'd better make the code readable for two reasons: firstly, I was sure that I wouldn't get to work on it often so I'd have to come back and read old code and comments and other coding standards would make that easier; secondly, I was sure that other people were going to read my code.

The second thing turned out to be really important for two reasons: firstly, other people were able to read my code and contribute and I kept them to a similar coding standard and style and hence the code is (reasonably, I'm not claiming I'm perfect) readable. Perhaps more importantly one day I was being interviewed for a job and the interviewer said: "Yes, we've all read your code". They'd downloaded my project and checked me out. (I got the job).

Now, I'm not trying to slam all FOSS here and for the purposes of this entry I have not examined some of the most famous projects (e.g. Linux kernel, Apache, Firefox, ...), but I decided to take a look at the top 50 most downloaded projects of all time on SourceForge.

Then I would pick at random two source files (each source file had to be fairly large, i.e. more than 100 lines of code) and score them being as generous as possible using the following categories and assigned a score to each. I weighted the scores heavily towards doing simple things that have a high benefit (for example, describing the purpose of a file of function):
  • File Description (FD): did the file I open have some sort of description (near the top) of what the purpose of the file was for. I wasn't asking for a detailed explanation, but just a little helper so that a new reader could get going on the purpose. Score: +5 (if present), -5 (if not)
  • Function/Interface Description (FID): did any of the functions, or interfaces, in the file have a description. I would have liked to have seen all the arguments specified and return codes and caveats explained, but I was extremely generous: even if one function had a little header with a minimal description of the function it got into this category. Score: +5 (if present), -5 (if not)
  • Useful Comments (UC): did the file contain at least one useful comment. A useful comment points out something that isn't obvious to the reader, or some trap for the unwary. Score: +1 (if present), -1 (if not)
  • Stupid Comments (SC): did the file contain at least one stupid comment like 'increment i' or 'loop through records'. Score: -1 (if present), +1 (if not)
  • Understandable (U): did I feel like I would be able to understand most of the code given 30 minutes of reading the file and browsing the rest of the source. This was very subjective, but was used to take into account things like clearly named functions, or really well named member variables. Score: +5 (if understandable), -5 (if not)
  • Commented out code (COC): people we have source code control systems. Don't // out your code, or #if 0 it. ok? Score: -1 (if present), +1 (if not)
  • Bonus (B): I had a special bonus category which I could hand out if I felt like it. A positive score here was for particularly well documented, and written code, neutral for most code and negative for really hideous stuff. Score: +10 (loved it), -10 (yuck), 0 (in general)
Of the top 50 projects one (XAMPP) was excluded because it's a distribution of other code and not new code.

What I found was not a pretty picture:
  • 65% don't bother describing even in the most minimal way even one of the functions I saw
  • 60% of the projects don't bother with describing the purpose of a file
  • 59% of the projects scored negatively using my system
  • 53% contained useless comments
  • 40% looked incomprehensible to me without major effort
  • 33% contained commented out or #if 0 code
There was one bright spot: 85% contained at least one useful comment. But given that my percentages underestimate the problems (because I was very generous) these figures are horrible.

The best projects were (in order of score): GNUWin32 (thanks GNU Project!), GTK+ and The GIMP installers for Windows, NASA World Wind, Ghostscript, WINE, Miranda, MinGW (thanks GNU Project!), Erases, and DC++.

Come on FOSS people. Have some pride in your work! Remember, writing some decent comments is a gift you are given to people who read your code, and to yourself.

(Note that if you are the author of one of the projects above it's possile that I made a mistake and just happened to pick the wrong files to read. Send me examples of how great your code is and I'll publish a rebuttal here).

Here's a table with all the data:



















































ProjectFDFIDUCSCUCOCBScorePop.
eMule-1-111110-61
Azureus -1-111110-62
BitTorrent -1-1-1-1-1-10-143
DC++ 11111-10164
Ares Galaxy 1-11-1-1-10-25
CDex -1-111110-66
VirtualDub -1-11-1-1-10-127
Shareaza -1-111110-68
eMule Plus 1-11-111069
GTK+ and The GIMP installers for Windows 111-11-112810
7-Zip -1-11-1-1-10-1211
FileZilla -1-1-1-1-1-10-1412
guliverkli 1-1-11-1-10-613
Gaim 1-11-11-10814
Audacity 1-11-1-1-10-215
phpBB -1-1111-10-416
ZSNES -1-11-1110-417
TightVNC -1-111110-618
phpMyAdmin 1111-1-10619
NASA World Wind 111-11-101820
ABC [Yet Another Bittorrent Client] -11111-10621
Dev-C++ -1-11-11-10-222
aMSN -1111110423
ffdshow -1-1-1-1-1-1-1-2424
WinSCP -1-11-1-1-10-1225
JBoss.org 111-1-1-10826
AC3Filter -1-111-1-10-1427
Ghostscript 11111-101628
[email protected] -1-111-110-1629
Webmin 11-1-1-1-10630
PDFCreator -1-111-1-10-1431
VisualBoyAdvance -1-11-1-110-1432
MinGW - Minimalist GNU for Windows 11111-101633
eMule Morph -1-111-110-1634
GnuWin32 111-11-112835
The CvsGui project 11-111-101436
FlasKMPEG -1-11-1110-437
VirtualDubMod -1-11-1-110-1438
Gallery 1-11-1110639
DOSBox DOS Emulator -1-1111-10-441
Miranda 11111-101642
DScaler Deinterlacer/Scaler -11111111443
Celestia -1-11-11-10-244
PeerGuardian -11111-10645
XOOPS Dynamic Web CMS -1-11-11-10-246
Eraser 111-11101647
Wine Is Not an Emulator 11111-101649
burst! -1-1-1-1-1-10-1450

Monday, July 10, 2006

A simple code for entering latitude and longitude to GPS devices

This post proposes a coding system for entering any location on earth with 10m of accuracy using a 10 character code that includes features to prevent errors in entering the code.

The idea is that any one could publish their location by writing something like VUF DDC F8UG. This short code could be entered into a GPS device giving you any spot on the globe.

I'm calling it the SOC: Simple Orientation Code.

Some example uses:
  1. I could print my company's SOC on my business cards and visitors could punch it into their car navigation system and come visit
  2. A restaurant could publish its SOC along with its phone number (after all it's the same length as a phone number so it's something people can easily grok) making the restaurant easy to find
  3. Geocachers could publish SOC trails for people hunting down caches
  4. SCUBA divers could refer to dive sites by their SOC (10m of accuracy is enough surface accuracy for most people)
Here's how the code works.

First you need the latitude and longitude of the location you are talking about to 4 decimal places of accuracy. 4 decimal places gives about 10m of accuracy. So treating latitude as ranging from 0 to 180 degrees (basically change it from -90 to 90 degrees by adding 90) and longitude as from 0 to 360 degrees (ignoring east/west or +/- values) and then treating the two numbers as integers (i.e. take the 4 decimal place latitude or longitude and multiply by 10000) you get two numbers: La and Lo.

La varies from 0 to 1,799,999 and Lo from 0 to 3,599,999. These two numbers can be combined to form a single number that I call P (your position) like this:

P = La * 3600000 + Lo

Extracting the La and Lo from P is simply a matter of dividing P by 3,600,000 (to get La) and calculating the remainder (to get Lo).

P varies from 0 to 6,479,998,200,000 which can be stored in 43 bits.

Now encoding P in some form typeable by a human requires an alphabet. The SOC alphabet consists of the following 32 characters:

ABCDEFGHJKLMNPQRTUVWXY0123456789

This is the standard English alphabet plus Arabic numerals 0 through 9 with the following letters removed: I, O, S, and Z. These are removed because I is easily confused with both 1 and J; O is easily confused with 0; S is easily confused with 5 and Z is easily confused with 2. These characters are removed to ensure that the code is minimally affected by bad handwriting.

Moreover an implementation using the SOC should silently perform the following translations: I becomes 1; O becomes 0; S becomes 5 and Z becomes 2. This way the user will not have to correct a poorly written SOC.

Each character in the alphabet represents a number between 0 and 31.

A(0) B(1) C(2) D(3) E(4) F(5) G(6) H(7) J(8) K(9) L(10) M(11) N(12) P(13)
Q(14) R(15) T(16) U(17) V(18) W(19) X(20) Y(21) 0(22) 1(23) 2(24) 3(25)
4(26) 5(27) 6(28) 7(29) 8(30) 9(31)

P can be encoded using 10 characters from this alphabet. Since each character contains 5 bits of information and only 43 bits are needed for the position that leaves 7 bits for an error checking code. The algorithm used to generate the check digit is a variant of the scheme used for ISBNs.

The 43 bit P is broken into 11 4 bit numbers with a zero padded on the left of P. The 11 numbers are p0 through p10. A check digit C is calculated as follows:

C = ( p0 * 37 + p1 * 31 + p2 * 29 + p3 * 23 + p4 * 17 + p5 * 13 + p6 * 11 + p7 * 7 + p8 * 5 + p9 * 3 + p10 * 2 ) mod 127

C is then appended to P to create the SOC.

Now for some Perl code that implements the coding and encoding of SOCs.

Converting a latitude and longitude to a SOC:

use strict;

if ( $#ARGV != 1 ) {
die "Usage: to-soc ";
}

my ( $lat, $lon ) = @ARGV;

my $alpha = 'ABCDEFGHJKLMNPQRTUVWXY0123456789';
my @alphabet = split(//,$alpha);

$lat += 90;
$lon += 180;

$lat *= 10000;
$lon *= 10000;

my $p = $lat * 3600000 + $lon;

my $soc_num = $p * 128;

my @primes = ( 2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37 );

my $c = 0;

foreach my $prime (@primes) {
$c += ($p % 32) * $prime;
$p = int($p / 32);
}

$c %= 127;

$soc_num += $c;

my $digits = 10;

my $soc = '';

while ( $digits > 0 ) {
my $d = $soc_num % 32;
$soc = $alphabet[$d] . $soc;
$soc_num = int($soc_num/32);
--$digits;
}

print "$soc\n";

Converting a SOC back to a latitude and longitude:

use strict;

if ( $#ARGV != 0 ) {
die "Usage: from-soc <10-digit-soc>";
}

my $soc = uc($ARGV[0]);
$soc =~ tr/IOSZ/1052/;

my $alphabet = 'ABCDEFGHJKLMNPQRTUVWXY0123456789';

my $soc_num = 0;

foreach my $letter (split(//, $soc)) {
$soc_num *= 32;
$alphabet =~ /(.*)$letter/;
$soc_num += length($1);
}

my $p = int($soc_num / 128);
my $check = $soc_num % 128;

my $lon = $p % 3600000;
my $lat = int($p / 3600000);

$lat /= 10000;
$lon /= 10000;

$lat -= 90;
$lon -= 180;

my @primes = ( 2, 3, 5, 7, 11, 13, 17, 23, 29, 31, 37 );

my $c = 0;

foreach my $prime (@primes) {
$c += ($p % 32) * $prime;
$p = int($p / 32);
}

$c %= 127;

if ( $check != $c ) {
die "Incorrect SOC";
} else {
print "$lat $lon\n";
}

This idea and code is being released by me into the public domain.

Those of you with a twisted mind like to try to find points on the globe that have human-readable SOCs. For example, by picking coordinates that contain a word in the SOC. Challenge: find a location on the blog that's something along the lines of TREASURE or STARTHERE.

Tuesday, June 27, 2006

Action shot

I don't have many pictures of myself around the web, so I thought I'd share one from the recent EU Spam Symposium. Here I am talking about spam trickery:

Friday, June 23, 2006

Proposed uniform naming scheme for spammer/phisher content trickery

This post is a proposal to rename all the tricks in The Spammers' Compendium to a uniform scheme that means that tricks can be referred to easily by spam filtering products, that includes information about the purpose and technology used in the trick, and preserves unique naming for each trick.

I'd love to hear comments on this.

Each name consists of three ! separated parts: a purpose, a name, and a technology. The purpose is the reason for the trick (for example, the trick is used to obscure a URL, or to insert innocent words). The name is derived from the current TSC perjorative name. The technology identifies the way in which the trick is coded (for example, with HTML or MIME).

For a single name there could be multiple tricks using different technologies (e.g. some tricks might be implemented using HTML or CSS), or for different purposes (words might be inserted to fool a Bayesian filter or break a hash).

I propose the following purposes for a trick:
  • BWO (Bad Word Obfuscation) Making it hard for a filter to parse potentially bad words (e.g. Viagra)
  • GWI (Good Word Insertion) Adding words likely to confuse a statistical filter
  • HB (Hash Busting) Inserting randomness designed to make message hashing hard
  • TA (Tokenization Avoidance) Preventing a filter from tokenizing a message
  • UH (URL Hiding) Hiding a URL so that a user is fooled into clicking an incorrect link
  • UO (URL Obfuscation) Making it hard for a filter to identify a URL and check it against a black list
  • WB (Web Bugs) Inserting a beacon that tells the spammer that a message has been read
The following technologies would be recognized in the naming scheme:
  • CSS Use of CSS
  • HTML Any HTML without using CSS
  • Javascript Use of Javascript for trickery
  • MIME Manipulating of MIME
  • Plain Plain text
For example, the original Invisible Ink trick written using HTML would be referred to as GWI!Invisible!HTML and a CSS variant would be GWI!Invisible!CSS. Names would only be generated for tricks actually seen in the wild.

With such uniform naming it would be possible to analyze spams and phishes (perhaps even specific Perl recognizers for each trick could be written) and then trends built up over time to see how individual tricks and individual classes of tricks are changing.

Currently, TSC contains 55 tricks, although I'm not sure that all of them are suitable for renaming. Here's my proposed naming of the current state of TSC:

The Big Picture TA!BigPicture!HTML
Invisible Ink GWI!Invisible!HTML and GWI!Invisible!CSS
The Daily News GWI!BigTag!HTML
Hypertextus Interruptus BWO!Interruptus!HTML
Slice and Dice TA!SliceNDice!HTML
MIME is Money GWI!PlainNotHTML!MIME
Lost in Space BWO!Space!Plain
Enigma UO!Enigma!HTML
Script Writer TA!Script!Javascript
Ze Foreign Accent BWO!Accent!Plain
Speaking in Tongues HB!Tongues!Plain
The Black Hole BWO!BlackHole!HTML
A Numbers Game BWO!Numbers!HTML
Bogus Login UO!BogusLogin!HTML
Honey, I Shrunk the Font GWI!ShrunkFont!HTML
No Whitespace, No Cry TA!NoWhitespace!Plain
Honorary Title GWI!Title!HTML
Camouflage GWI!Camouflage!HTML
And in the right corner HB!RightCorner!Plain
A Form of Desperation GWI!Form!HTML and BWO!Form!HTML
It's Mini Marquee! GWI!Marquee!HTML
You've been framed BWO!Framed!HTML
Control Freak TA!ControlFreak!Plain
Don't Cramp My Style GWI!Style!CSS
The Microdot BWO!Microdot!CSS
WYSI_not_WYG UH!WYSINotWYG!Javascript
Ultra See Engima
Internet Exploiter UH!InternetExploiter!HTML
Style Wars: Episode 1 Included in other tricks
The tURLing Test UO!TurlingTest!Plain
Flex Hex BWO!FlexHex!CSS
Sound of Silence WB!Silence!HTML
Blankety Blank BWO!BlanketyBlank!HTML
Doing the Splits BWO!Splits!Plain
But is it art? BWO!ASCIIArt!Plain
Absolute Zero Same as Control Freak
Spell Breaker BWO!Splelnig!Plain
About Face BWO!AboutFace!HTML
Catch a Wave TA!Wave!HTML
Treasure Map UH!TreasureMap!HTML
You cannot be serious UO!Mcenroe!HTML
The Matrix TA!Matrix!Plain
Sticky Fingers BWO!StickyFingers!Plain
Floatation Device TA!Floatation!CSS
The Small Picture TA!SmallPicture!HTML
Chop GUI TA!ChopGUI!HTML or perhaps HB!ChopGUI!HTML
Big Header-ed ? Not sure of the purpose of this perhaps TA?
The Rake BWO!TheRake!CSS
Now you see it; now you don't BWO!Copperfield!CSS
Slick Click Trick UH!Caption!HTML
Whiter shade of Pale TA!Pale!HTML

This list is an order of discovery. It's interesting to see the rise of UH (URL Hiding) tricks as phishing has grown.

Thursday, June 22, 2006

How I love my HP-16C

A while ago I bought an HP-16C calculator on eBay. It wasn't cheap and there was no manual; the calculator itself works fine and is in almost mint condition. Since then I've fallen in love with the device.



You probably think I'm nuts to be using a calculator that was discontinued in 1989 and only 203 bytes of memory. And I had to pay extra to get a PDF version of the scanned original manual.

Perhaps I am crazy, but here's why I love this little machine:

1. RPN. You either love this or hate it. This is my first RPN calculator and for me RPN is the right way to use a calculator. I read a short introduction to RPN tricks (of which there are very few, but filling the stack for repeated operations is one and using LST x to prevent the stack from moving is another).

2. The industrial design of HP calculators is pure art. They are the right size for your hand, the keyboard is clearly marked, keys are spaced far apart (which avoids fat fingers like mine) and the keys give good feedback on being pressed. And the calculator is slightly slanted so that when it's on the desk it's easy to type on.

3. Floating point with fixed display of decimal places. Just right for balancing your check book.

4. Hex/Dec/Oct/Bin modes plus the nice 'show' feature which can display a number in one of the other bases for a few seconds without changing base. Very handy when debugging.

5. And my favorite thing... the HP 16C is 128 mm wide and 79 mm deep. Notice anything interesting? 128 ENTER 79 / is... 1.62. Or the Golden Ratio. No wonder I love that thing so much.