Saturday, December 01, 2007

Double-checking Dawkins

I've been reading through Richard Dawkins' books and am currently half way through The Blind Watchmaker (2006 paperback edition) and on page 119 he writes:

In my computer's ROM, location numbers 64489, 64490 and 64491, taken together, contain a particular pattern of contents---1s and 0s which---when interpreted as instructions, result in the computer's little loudspeaker uttering a blip sound. This bit pattern is 10101101 00110000 11000000.

Of course, this piqued my curiosity. Did Dawkins just make that up, or is this really the contents of a specific bit of memory on a specific computer?

The book was first published in 1986, so I just had to figure out what it was. Starting with the instructions and converting to hex we have AD 30 C0. Now, considering the main processors around at the time there are three possible interpretations of these three bytes:

Z-80/8080 XOR L ; JR NC C0

6502: LDA C030

6809: JSR 30C0

The first didn't look at all plausible, but both the other two do. The 6809 looks like it could be calling a sub-routine at location 30C0 and the 6502 would work if C030 were actually memory-mapped I/O and a read was necessary to cause the blip.

I couldn't find any machine with a meaningful interpretation of the 30C0 location on a 6809 machine, but 6502 turned out to be a different story.

On an Apple ][ memory in the range C000-C0FF is mapped to various bits of I/O:

The memory space on the Apple II between $C000 and $CFFF was assigned to handle input and output. From $C000 to $C0FF the space was reserved for various soft-switches used to control the display, and various built-in I/O devices, such as the keyboard, paddles, annunciators, and the cassette port.

Poking around further I discovered that a read from C030 (known as SPKR) will blip the speaker on an Apple ][. So Dawkins is telling the truth and he's using an Apple ][.

So returning to the memory locations what program is he talking about? The three memory locations are FBE9, FBEA and FBEB. That turns out to be inside the Monitor ROM on an Apple ][ (the stuff that Woz wrote) and happily a complete ROM listing was provided with the Apple ][.

So looking in a scanned Apple ][ manual we find that those locations are inside the BELL2 routine. Specifically on page 163 of the manual we find the following:

FBE4: A9 0C 597 BELL2 LDA #$0C TOGGLE SPEAKER AT
FBE6: 20 A8 FC 598 JSR WAIT 1 KHZ FOR .1 SEC
FBE9: AD 30 C0 599 LDA SPKR
FBEC: 88 600 DEY
FBED: D0 F5 601 BNE BELL2
FBEF: 60 602 RTS2B RTS

The line in bold is exactly the code that Dawkins is referrring to.

So, while writing this famous text on evolution, Dawkins had time to poke around inside the Apple ][ ROM and write about it in his book.

Respect.

Thursday, November 15, 2007

Steve Gibson's PPP... new version 3 in Java and C

If you've been following along you'll know that I've implemented Steve Gibson's PPP system in Java and C. The Java code is in the form of a CLDC/MIDP project for cell phones and the C code is a command-line tool.

Steve's made a major change to the algorithm which he's calling PPPv3. My updated code is here:

The C source code is available in ppp3-c.zip.

The Java source code is available in ppp3-java.zip.

The compiled Java is available in ppp3.jar.

Read my original blog posts for details.

Tuesday, November 13, 2007

Cryptographically and Constantly Changing Port Opening (or C3PO)

In another forum I was just talking about a little technique that I came up with for securing a server that I want on the Internet, but to be hard for hackers to get into. I've done all the right things with firewalling and shutting down services so that only SSH is available. But that still leaves port 22 sitting there open for someone to bang on.

So what I wanted was something like port knocking (for an introduction to that you can read my DDJ article Practical Secure Port Knocking). To avoid doing the classic port knocking where you have to knock the right way to open port 22 I came up with a different scheme which I call Cryptographically and Constantly Changing Port Opening or C3PO.

The server and any SSH client who wish to connect to it share a common secret. In my case that's just a long passphrase that we both know. Both bits of software hash this secret with the current UTC time in minutes (using SHA256) to get 256 bits of random data. This data changes every minute.

The 256 bits gives me 16 words of random data. Those 16 words can be interpreted as 16 different port numbers. Once a minute the server reconfigures iptables to open those 16 ports forwarding one of them (which corresponds to word[0] in the hash) to SSH and the other 15 to a blacklist service.

At any one time 16 ports are open (i.e. respond to a SYN) with only one being SSH and the other 15 being a trap to be sprung by an attacker. The 16 ports change once a minute.

Since both sides can compute the hash the client is able to compute where the SSH server is residing at that moment and contact it. Once contact is established the connection remains open for the duration of the session. New sessions, of course, will need to recompute the hash once a minute.

The blacklist service serves to tarpit an attacker. Any connection attempt to one of the other 15 sockets causes the IP address of the attacker to be blacklisted (again an iptables change) which means that hitting any of the 15 ports causes the attacker to shut off their access to the SSH server for the next 15 minutes.

A casual NMAP of my machine gets your IP address blacklisted and shows up a random selection of open ports. A real user connects to the SSH server first time because they know where it resides.

Of course, this doesn't replace making sure that the SSH server is up to date, and that passwords are generated carefully, but it seriously frustrates a potential attacker.

If this is of interest to others I'd be happy to release my code (which is currently in Perl) as a GPL2 project.

Sunday, November 04, 2007

PPPv2 in Java and C

Recently, I released C and Java versions of Steve Gibson's PPP system for password generation. Steve updated the algorithm to PPPv2 which uses a different hash (SHA256 instead of SHA384) and a slightly different plain text generation algorithm (see the PPP pages for details).

I've now updated my code to PPPv2 and am releasing it here.

The C source code is available in ppp-c.zip.

The Java source code is available in ppp-java.zip.

The compiled Java is available in ppp.jar.

Read my original blog posts for details. All source code is released under the BSD License.

Monday, October 29, 2007

A Java client implementation of Steve Gibson's PPP

I recently produced an open source implementation of Steve Gibson's Perfect Paper Passwords system in C. It occurred to me that a better implementation would be a Java client for my mobile phone (thus eliminating the need for printing and carrying the paper passwords).

Here's my PPP client implementation running on my Motorola RAZR. It's written in Java using CLDC 1.0 and MIDP 2.0.




You can download and install the JAR file. The current version is 1.0.0.

Thursday, October 18, 2007

Times Square: a fun spammer GIF

Nick FitzGerald reported a neat spammer image trick to me the other day. It's entered in The Spammers' Compendium that involves using animation to display the word Viagra emulating a flashing neon sign.

Since many OCR systems merge the layers together before OCR this image is actually in the 'wrong' order. Once merged the letters are in the order VIRAAG.

Monday, October 15, 2007

SOC Update and Google Maps integration

After receiving some feedback on my Simple code for entering latitudes and longitudes I've made a couple of changes:

1. Replace the letter V with the symbol @ in the alphabet to remove confusion between U and V. Implementations should automatically map V to U if entered.

2. Changed the checksum to the following calculation:


C = p0 + p1 * 2 + p2 * 3 + p3 * 4 + p4 * 5 + p5 * 6 + p6 * 7 + p7 * 8 + p8 * 9 mod 29


To make it a bit easier to visualize here's an integration of SOC with Google Maps. You can either type in an address to navigate to that address and see the SOC, or type in a SOC to navigate to that location.





Enter address to find:

Or, enter a SOC:

Friday, October 12, 2007

An open source implementation of Steve Gibson's PPP algorithm

Steve Gibson has come up with a simple two-factor password scheme that relies on printed cards of passcodes generated using a combination of SHA-384 and Rijndael. The idea is that a system could prompt the user for one of the passcodes in addition to their normal password.

Steve calls this his Perfect Paper Passwords system and has given a detailed description of the algorithm.

As usual he's released code written in assembly language as a DLL for Windows. He hasn't released his source code (he never does), so I thought it would be interesting to write my own implementation of his algorithm. Here's the C code:

#include <sys/time.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

#include "rijndael.h"
#include "sha2.h"

#pragma pack(1)

typedef unsigned char Byte;

typedef union __Passcode {
unsigned long as_long;
struct {
Byte byte[4];
} bytes;
} Passcode;

typedef struct __PasscodeString {
char character[5];
} PasscodeString;

typedef unsigned long long SixtyFour;

typedef union __OneTwoEight {
struct {
SixtyFour low;
SixtyFour high;
} sixtyfour;
Byte byte[16];
} OneTwoEight;

typedef struct __SequenceKey {
Byte byte[SHA384_DIGEST_SIZE];
} SequenceKey;

typedef unsigned long DWord;

const char * alphabet = "[email protected]#%+=:?abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPRSTUVWXYZ";

void inc( OneTwoEight * i )
{
++i->sixtyfour.low;
if ( i->sixtyfour.low == 0 ) {
++i->sixtyfour.high;
}
}

void add( OneTwoEight * to, OneTwoEight * addend )
{
SixtyFour low = to->sixtyfour.low;

low += addend->sixtyfour.low;

if ( ( low < to->sixtyfour.low ) || ( low < addend->sixtyfour.low ) ) {
++to->sixtyfour.high;
}

to->sixtyfour.low = low;
}

void ConvertPasscodeToString( PasscodeString * passcodeString,
Passcode passcodeValue )
{
Byte bytes[4];

bytes[0] = passcodeValue.bytes.byte[0] & 0x3f;
bytes[1] = ( ( passcodeValue.bytes.byte[0] & 0xc0 ) >> 6 ) +
( ( passcodeValue.bytes.byte[1] & 0x0f ) << 2 );
bytes[2] = ( ( passcodeValue.bytes.byte[1] & 0xf0 ) >> 4 ) +
( ( passcodeValue.bytes.byte[2] & 0x03 ) << 4 );
bytes[3] = ( ( passcodeValue.bytes.byte[2] & 0xfc ) >> 2 );

int i;
for ( i = 0; i < 4; ++i ) {
passcodeString->character[i] = alphabet[bytes[i]];
}

passcodeString->character[4] = '\0';
}

void RetrievePasscodes( Passcode passcodeListBuffer[],
OneTwoEight firstPasscodeNumber,
int passcodeCount,
SequenceKey * sequenceKey )
{
int i;

#define KEY_BITS (int)256
Byte key[KEYLENGTH(KEY_BITS)];

for ( i = 0; i < KEYLENGTH(KEY_BITS); ++i ) {
key[i] = sequenceKey->byte[i+16];
}

unsigned long rk[RKLENGTH(KEY_BITS)];
OneTwoEight plain;

for ( i = 0; i < 16; ++i ) {
plain.byte[i] = sequenceKey->byte[i];
}

OneTwoEight block = firstPasscodeNumber;
unsigned int skip = (unsigned int)(block.sixtyfour.low & 0xF);

SixtyFour carry = block.sixtyfour.high & 0xF;
block.sixtyfour.high >>= 4;
block.sixtyfour.low >>= 4;
block.sixtyfour.low |= (carry << 60);

OneTwoEight temp = block;
add( &block, &temp );
add( &block, &temp );
add( &plain, &block );

int nrounds = rijndaelSetupEncrypt( rk, key, KEY_BITS );
Byte cipher[16*3];

int c = 0;

while ( passcodeCount > 0 ) {
rijndaelEncrypt( rk, nrounds, (Byte *)&plain.byte[0], &cipher[0] );
inc( &plain );
rijndaelEncrypt( rk, nrounds, (Byte *)&plain.byte[0], &cipher[16] );
inc( &plain );
rijndaelEncrypt( rk, nrounds, (Byte *)&plain.byte[0], &cipher[32] );
inc( &plain );

for ( i = skip; ( i < 16 ) && ( passcodeCount > 0 ); ++i ) {
passcodeListBuffer[c].bytes.byte[0] = cipher[i*3];
passcodeListBuffer[c].bytes.byte[1] = cipher[i*3+1];
passcodeListBuffer[c].bytes.byte[2] = cipher[i*3+2];
++c;
--passcodeCount;
}

skip = 0;
}
}

void GenerateSequenceKeyFromString( char * string,
SequenceKey * sequenceKey )
{
sha384( (const unsigned char *)string, strlen( string ),
(unsigned char *)sequenceKey );
}

void GenerateRandomSequenceKey( SequenceKey * sequenceKey ) {
struct timeval t;
gettimeofday( &t, 0 );

char t_buffer[61];
strftime( t_buffer, 60, "%c%d%e%H%I%j%m", localtime( &t.tv_sec ) );

char msecs_buffer[32];
sprintf( msecs_buffer, "%ld", t.tv_usec );

char hostname_buffer[256];
gethostname( hostname_buffer, 255 );

char pointer_buffer[16];
sprintf( pointer_buffer, "%p", sequenceKey );

char loadavg_buffer[256];
double samples[3];
getloadavg( samples, 3 );
sprintf( loadavg_buffer, "%f%f%f", samples[0], samples[1], samples[2] );

char buffer[1024];
sprintf( buffer, "%s-%s-%s-%s-%s", t_buffer, msecs_buffer, hostname_buffer,
pointer_buffer, loadavg_buffer );

GenerateSequenceKeyFromString( buffer, sequenceKey );
}

int ConvertHexToKey( char * hex, SequenceKey * key )
{
int i, j;

for ( i = 0, j = 0; i < 96; i += 2, ++j ) {
char pair[3];
sprintf( pair, "%c%c", hex[i], hex[i+1] );
int x;
sscanf( pair, "%x", &x );
key->byte[j] = (Byte)x;
}
}

int main( int argc, char * argv[] )
{
if ( argc == 1 ) {
printf( "Error: You must provide the passphrase or sequence key as the first parameter\n" );
return 1;
}

SequenceKey key;

if ( strlen( argv[1] ) == 0 ) {
printf( "Generating random sequence key\n" );
GenerateRandomSequenceKey( &key );
} else {
if ( ( strlen( argv[1] ) == 96 ) && ( ConvertHexToKey( argv[1], &key ) ) ) {
printf( "Using entered sequence key\n" );
} else {
printf( "Generating sequence key from passphrase\n" );
GenerateSequenceKeyFromString( argv[1], &key );
}
}

printf( "Sequence Key: " );
int i;
for ( i = 0; i < SHA384_DIGEST_SIZE; ++i ) {
printf( "%2.2x", key.byte[i] );
}
printf( "\n" );

if ( argc == 4 ) {
OneTwoEight firstPasscode;

// Warning! This only uses the bottom 64-bits of argv[2] and hence
// can't convert a much higher number

firstPasscode.sixtyfour.low = atoi( argv[2] );
firstPasscode.sixtyfour.high = 0;

int count = atoi( argv[3] );

Passcode * pcl = malloc( sizeof( Passcode ) * count );

RetrievePasscodes( pcl, firstPasscode, count, &key );

for ( i = 0; i < count; ++i ) {
PasscodeString str;
ConvertPasscodeToString( &str, pcl[i] );
printf( "%s ", &str.character[0] );
}

printf( "\n" );
}

return 0;
}

Now that's a little less flexible than all the options given in Steve's ppp.exe implementation, but it does compute the correct output and can easily be modified if you want your own implementation with source of PPP.

It uses this SHA-384 implementation and this Rijndael implementation.

Here's the output of my ppp program producing the first 70 passcodes:

$ ./ppp 53303f97ddcf91ed74391fc5c3661246
32427e1c93c1a2e2836d006fa2653dc1
fb94f8fbeefa5f1e9263c12878e0a95e 0 70
Using entered sequence key
Sequence Key: 53303f97ddcf91ed74391fc5c3661246
32427e1c93c1a2e2836d006fa2653dc1
fb94f8fbeefa5f1e9263c12878e0a95e
VJNV gHoF PaRp T8FS tGw2 s%iT u7rp
[email protected] MWGb %574 ?DVF btRq PLTA DDtm
C2TP Yin8 [email protected] a8%H zHvq Uwxc qkF7
YuUk 8Ca? :ZvZ T9:? wki+ KiHq d?9b
GY%5 !igR [email protected] [email protected] eyVm 5PwY CAVs
oKzK 43Mc nR%? [email protected] oZUs Tbec xn6B
9bVA UvJt DfAX =Gqp 7Abj M:6Y ENRs
aXX= Eokx WjTj %MPV McSA GFTK XMdY
49?Z Z?Hk G+A? zoK5 :Z8N z8NU WpM!
=AB% RrSq %7:Y %=P8 RKXr di#5 4T3L


Feel free to take my code and use it under the BSD license.

Thursday, October 11, 2007

More spammer crossword creativity

Nick FitzGerald writes in with a variant of the "1 across, 3 down" spammer content trick which looks like this:



The neat thing is that the crossword is created using HTML in a way that prevents a simple HTML-stripping spam filter from reading the brand names. To a simple spam filter this looks like:

CA
BREIT
OM
R
O
L
E
TIER
ING
G
X

The actual HTML (cleaned up by Nick) is:

<TABLE>
<TR>
<TD>
<DIV align=right>
CA<BR>
<BR>
BREIT<BR>
OM
</DIV>
</TD>
<TD>
<DIV align=center>
R<BR>
O<BR>
L<BR>
E
</DIV>
</TD>
<TD>
TIER<BR>
<BR>
ING<BR>
GA
</TD>
</TR>
<TR>
<TD>
<DIV align=center>
X
</DIV>
</TD>
</TR>
</TABLE>

Wednesday, August 29, 2007

Why you don't want to code for a government department

Back in the mists of time, straight after my doctorate, I worked for a UK start-up called Madge Networks initially maintaining device drivers that implemented LLC, NetBIOS, IPX/SPX protocols and then writing a TCP/IP stack. Most of this work was done in C and assembler (x86 and TMS380).

When I first joined the company I was sent on an x86 assembly training course run by QA Training. (It rained on the first day and we were locked out so one of the company big cheeses ran over with QA Training umbrellas; to this day I use that umbrella).

During the course we were asked to write a simple function in C. I've forgotten what it was, but let's say it was a classic factorial function. I wrote something like:

unsigned int fac( unsigned int i )
{
if ( i < 2 ) {
return 1;
}

return i * fac( i - 1 );
}

Later we looked at an assembly equivalent of the function, but before that I took at look at the person sitting next to me. His function looked like this:

unsigned int aq456( unsigned int aq457 )
{
if ( aq457 == 0 )
return 1;

if ( aq457 == 1 )
return 1;

return aq457 * aq456( aq457 - 1 );
}

So, naturally I asked him why he used such horrible names for functions and variables .

It turned out that he worked for a government department and all identifiers were allocated and documented before the code was written. Hence, somewhere a document would tell you that aq456 was a factorial function with excruciating detail of its parameters and return values. And also you could discover that aq457 was the parameter for aq456.

He recounted how each project had a unique two letter code (he'd chosen aq as being qa backwards) followed by a sequence number.

He'd chosen aq because all the projects had worked on were aa, ab, ac, ... the department had apparently never got past 26 projects.

I wonder why?

Saturday, July 14, 2007

Last push for POPFile voting

POPFile has been nominated for a SourceForge Community Choice Award due to the efforts of many people.

Voting closes on July 20 and there's strong competition in POPFile's category from the likes of Pidgin (formerly GAIM) and phpBB.

If POPFile wins SourceForge will be making a donation to a charity that I picked: Doctors without Borders.

If you feel like voting for POPFile, please vote in the Best Project for Communications category here:

http://sourceforge.net/awards/cca/vote.php

Tuesday, July 03, 2007

Please vote for POPFile

POPFile has been nominated for a SourceForge Community Choice Award through the efforts of its users.

Now it's time to vote.

If you love POPFile, please vote for it in the Best Project for Communications category.

Tuesday, June 26, 2007

Pretty Darn Fancy: Even More Fancy Spam

Looks like the PDF wave of spam is proceeding with a totally different tack. Sorin Mustaca sent me a PDF file that was attached to a pump and dump spam. Unlike the previous incarnation of PDF spams, this one looks a lot like a 'classic' image-spam.



It's some text (which has been misaligned to fool OCR systems), but there's a little twist. Zooming in on a portion of the spam shows that the letters are actually made up of many different colors (a look inside the PDF it's actually an image.



I assume that the colors and the font misalignement is all there to make it difficult to OCR the text (and wrapping it in a PDF will slow down some filters).

Extracting the image and passing it through gocr gave the following result:

_ Ti_ I_ F_ _ Clih! %
ogx.

%_ % _c. (sR%)
t0.42 N %x

g0 _j_ __ h_ climb mis _k
d% % %g g nle_ Frj%.
_irgs__Dw_s _ rei_ � a
f%%d __. n_ia une jg gtill
oo_i_. Ib _ _ _ _ _ _ 90I
Tgdyl

And a run through tesseract resulted in:

9{Eg Takes I:-west.tY`S Far Gawd Climb! UP
fatued Sta=:khlauih. This we 1: still

Close, but no cigar.

Thursday, June 21, 2007

Pretty Darn Fancy: Stock spammers using PDF files

Joe Chongq sent me a fascinating spam that is the first I've seen that's using a PDF file to send its information. I've long predicted that we'll see a wave of complex formats used for spam as broadband penetration increases and sending large spams becomes possible.

This particular spam has a couple of interesting attributes:

1. The PDF file itself is a really nicely formatted report about a particular stock that's being pumped'n'dumped.

2. The file name of the PDF was chosen to entice the user further by using their first name. In this case it was called joe_report.pdf.

3. The PDF is compressed using the standard PDF 'Flate' algorithm and totals 84,398 bytes. That's fairly large, but we've certainly seen image spams that were larger. Use of compression here means that a spam filter that's not aware of PDF formats would be unable to read the message content.

Here's what the actual PDF looks like (click for a larger view):

.

Monday, June 18, 2007

Escaping comma and space in GNU Make

Sometimes you need to hide a comma or a space from GNU Make's parser because GNU Make might strip it (if it's a space) or interpret it as an argument separator (for example, in a function invocation).

First the problem. If you wanted to change every , into a ; in a string in GNU Make you'd probably head for the $(subst) function and do the following:

$(subst ,,;,$(string))

See the problem? The argument separator for functions in GNU Make is , and hence the first , (the search text) is considered to be separator. Hence the search text in the above is actually the empty string, the replacement text is also the empty string and the ;, is just preprended to whatever is in $(string).

A similar problem occurs with spaces. Suppose you want to replace all spaces with ; in a string. You get a similar problem with $(subst), this time because the leading space is stripped:

$(subst ,;,$(string))

That extra space isn't an argument it's just extraneous whitespace and hence it is ignored. GNU Make just ends up appending ; to the $(string).

So, how do you solve this?

The answer is define variables that contain just a comma and just a space and use them. Because the argument splitting is done before variable expansion it's possible to have an argument that's a comma or a space.

For a comma you just do:

comma := ,
$(subst $(comma),;,$(string))

And everything works.

For a space you need to get a space into a string, I find the easiest way is like this:

space :=
space +=
$(subst $(space),;,$(string))

That works because += always space separates the value of the variable with the appended text.

Now, GNU Make has really liberal variable naming rules. Pretty much anything goes, so it's possible to define a variable with the name , or even having the name consisting of a space character.

First, here's how to define them:

, := ,
space :=
space +=
$(space) :=
$(space) +=

The first line is clear, it does an immediate define of a , to the variable named ,. The second one is a little more complex. First, I define a variable called space which contains a space character and then I use it to define a variable whose name is that space character.

You can verify that these work using $(warning) (I like to wrap the variable being printed in square brackets for absolute certainty of the content):

$(warning [$(,)])
$(warning [$( )])

$ make
Makefile:1: [,]
Makefile:2: [ ]

Yes, that's pretty odd looking, but it gets stranger. Since GNU Make will interpret $ followed by a single character as a variable expansion you can drop the braces and write:

$(warning [$,])
$(warning [$ ])

Now that starts to look like escaping. In the examples above you can use these variables to make things a little clearer:

$(subst $(,),;,$(string))
$(subst $ ,;,$(string))

Note that you have to use the $(,) form because function argument splitting occurs before the expansion and GNU Make gets confused. In the second line the space is 'escaped' with the $ sign.

You might be wondering about other crazy variable names: here are a few that's possible with GNU Make:

# Defining the $= or $(=) variable which has the value =
equals := =
$(equals) := =

# Define the $# or $(#) variable which has the value #
hash := \#
$(hash) := \#

# Define the $: or $(:) variable which has the value :
colon := :
$(colon) := :

# Define the $($$) variable which has the value $
dollar := $$
$(dollar) := $$

; := ;
% := %

You probably don't need any of those, but you never know...

POPFile v0.22.5 Released

Here are the details:

Welcome to POPFile v0.22.5

This version is a bug fix and minor feature release that's been over a
year in the making (mostly due to me being preoccupied by other
things).


NOMINATING POPFILE FOR AN AWARD

SourceForge has announced their 'Community Choice Awards' for 2007 and
is looking for nominations. If you feel that POPFile deserves such an
honour please visit the following link and nominate POPFile in the
'Best Project for Communications' category. POPFile requires multiple
nominations (i.e. as many people as possible) to get into the list of
finalists.

http://sourceforge.net/awards/cca/nomination.php?group_id=63137

Thanks!


WHAT'S CHANGED SINCE v0.22.4

1. POPFile now defaults to using SQLite2 (the Windows installer will
convert existing installations to use SQLite2).

2. Various improvements to the handling of Japanese messages and
improvements for the 'Nihongo' environment:

Performance enhancement for converting character encoding by
skipping conversion which was not needed. Fix a bug where the
wrong character set was sometimes used when the charset was not
defined in the mail header. Fix a bug where several HTML
entities caused misclassification. Avoid a warning 'uninitialized
value'. Fix a bug that the word's links in the bucket's detail
page were not url-encoded.

3. Single Message View now has a link to 'download' the message from
the message history folder.

4. Log file now indicates when an SSL connection is being made to the
mail server.

5. A number of small bug fixes to the POPFile IMAP interface.

6. Installer improvements:

Email reconfiguration no longer assumes Outlook Express is
present. Add/Remove Programs entries for POPFile now show a more
realistic estimate of the program and data size. Better support
for proxies when downloading the SSL Support files. The SSL
patches are no longer 'hard-coded', they are downloaded at
install time. This will make it easier to respond to future
changes to the SSL Support files. The Message Capture Utility
now has a Start Menu shortcut to make it easier to use the
utility. The minimal Perl has been updated to the latest
version. The installer package has been updated to make it work
better on Windows Vista (but further improvements are still
required).


WHERE TO DOWNLOAD

http://getpopfile.org/wiki/download


GETTING STARTED WITH POPFILE

An introduction to installing and using POPFile can be found in the
QuickStart guide:

http://getpopfile.org/wiki/QuickStart


SSL SUPPORT IN WINDOWS

SSL Support is offered as one of the optional components by the
installer. If the SSL Support option is selected the installer will
download the necessary files during installation.

If SSL support is not selected when installing (or upgrading) POPFile
or if the installer was unable to download all of the SSL files then
the command

setup.exe /SSL

can be used to run the installer again in a special mode which will
only add SSL support to an existing installation.


CROSS PLATFORM VERSION KNOWN ISSUES

The current version of SQLite (v3.x) is not compatible with POPFile.
You must use DBD:SQLite2 to access the database.

Users of SSL on non-Windows platforms should NOT use IO::Socket::SSL
v0.97 or v0.99. They are known to be incompatible with POPFile; v1.07
is the most recent release of IO::Socket::SSL that works correctly.


v0.22.0 RELEASE NOTES

If you are upgrading from pre-v0.22.0 please read the v0.22.0 release
notes for much more information:

http://getpopfile.org/wiki/ReleaseNotes


DONATIONS

Thank you to everyone who has clicked the Donate! button and donated
their hard earned cash to me in support of POPFile. Thank you also to
the people who have contributed their time through patches, feature
requests, bug reports, user support and translations.

http://sourceforge.net/forum/forum.php?forum_id=213876


THANKS

Big thanks to all who've contributed to POPFile over the last year.

John.

Friday, May 25, 2007

Back from the EU Spam Symposium; here's my talk

So I'm back home from the 2007 EU Spam Symposium which was held in Vienna in Austria and you can grab my presentation here. You'll notice that the presentation template is from MailChannels. They very kindly sponsored my trip to Vienna and so I did a little publicity for them. There's only one slide, however, that's actually anything to do with MailChannels in the entire presentation, so don't expect a product pitch!

One thing I didn't mention in my talk was that as the number of Internet hosts expands and the number of broadband subscribers grows the number of competing botnets can also grow. That means I'd expect to see the price of botnet rental dropping as the Internet grows leading to lower costs for spammers.

I'll give a complete round up of the conference in my newsletter next week, but overall there were some interesting talks, and meeting some people like Richard Cox from SpamHaus and Richard Clayton was very useful.

Tuesday, May 15, 2007

Some architectural details of Signal Spam

Finally, Signal Spam, France's new national anti-spam system, launched and I'm able to talk about it. For a brief introduction in English start here.

I'm not responsible for the idea behind Signal Spam, nor for its organization, but I did write almost all the code used to run the site and the back end system. This blog post talks a little bit about the design of Signal Spam.

Signal Spam lets people send spams via either a web form, or a plug-in. Plug-ins are currently available for Outlook 2003, Outlook 2007 and Thunderbird 2.0; more coming. Currently Signal Spam does three things with every message: it keeps a copy in a database after having extracted information from the body and headers of the message; it figures out if the message came from an ISP in France and if so sends an automatic message to the ISP indicating that they've got a spammer or zombie in their network; it figures out if the message was actually a legitimate e-marketing message from a French mailer and informs the person reporting the spam of how to unsubscribe.

The original plan was that the system be capable of handling 1,000,000 messages per day allowing for peaks of up to 1000 messages per minute (such as when people first come to work in the morning) and that messages would be handled in near real-time (i.e. the time from a message being received by the system to it being analyzed and forwarded to an ISP would be under 60 seconds). Signal Spam also wanted a lot of flexibility in being able to scale the system up as use of the site grew and being able to do maintenance of the site without taking it down.

Here's the last 12 hours of activity on the site, which pretty much matches what we expected with a peak once people get to work and start reading their mail. (These charts are produced automatically by the lovely RRDTool.)



The system I built is split into two parts: the front end (everything that the general public sees including the API used by the plug-ins) and the back end (the actual database storing the messages sent, the software that does analysis and the administration interface). Communication between the front end and the back end uses a web service running over HTTPS.

To make things scale easily the back end is entirely organized around a simple queuing system. When a message arrives from a user it is immediately stored in the database (there are, in fact, two tables: one table contains the actual complete message as a BLOB and the other contains the fields extracted from the message. The messages have the same ID in each table and the non-BLOB table is used for searching and sorting).

Once stored in the database the message ID is added to a FIFO queue (which is actually implemented as a database table). An arbitrary number of processus handle message analysis by dequeuing IDs from the FIFO queue (using row-level locking so that only one process gets each ID). Once dequeued the message is analyzed: fields such as From, Subject, Date are extracted and stored in the database, the Received headers are walked using a combination of blacklist lookup and forgery detection to find the true IP address that injected the message into Internet, the IP address is mapped to the network that manages the IP address, fingerprints of the message are taken and all URLs inside the message are extracted.

Once the analysis is complete the process decides whether the message needs to be sent to an ISP. If so it enqueues the message ID on another FIFO queue for a separate forwarding process to handle. If the message is in fact a legitimate message then the message ID is enqueued on a FIFO queue for another response process to handle.

The forwarding process generates an ARF message to the appropriate ISP and sends the message for every ID that it dequeues from the queue using VERP for bounce or reponse handling.

The response process dequeues IDs and responsed to the original reporter of the spam with a message tailored for the specific e-marketer with full unsubscribe details.

The use of queues and a shared database to handle the queues, plus a simple locking strategy means that arbitrary numbers of processes can be added to handle the load on the system as required (currently there is only one process of each type running and handling all messages in the delay required). It also means that the processus do not need to be on the same machine and the system can scale by adding processus or adding hardware.

Stopping the processes does not stop the front end from operating. Messages will still be added to the database and the analysis queue will grow. In fact, the length of the queue makes measuring the health of the system trivial: just look at the length of the queue to see if we are keeping up or not.

Since the queue has the knowledge about the work to be done processus can be stopped and upgraded as needed without taking the system off line.

To hide all this the entire system (which is written in Perl---in fact, the back end is entirely LAMP) uses an object structure. For example, creating the Message object (passing the raw message into the constructor) performs the initial message analysis and queues the message for further analysis. Access to the database queues is entirely wrapped in a Queue object (constructor takes the queue name). These objects are dynamically loaded by Perl and can be upgraded as needed.

Finally, all the objects (and related scripts) have unit tests using Perl's Test::Class and the entire system can be tested with a quick 'make test'. One complexity is that most of the classes require access to the database. To work around this I have created a Test::Database class that is capable of setting up a complete MySQL system from scratch (assuming MySQL is currently installed) and loading the right schema, that is totally independent of any other MySQL instance. The class then returns a handle (DBI) to that instance plus a connect string. This means the unit tests are totally independent of having a running database.

In addition, the unit tests include a system that I created for POPFile which allows me to get line-by-line coverage information showing what's tested and what's not. By running 'make coverage' it's possible to run the test suite with coverage information. This gives percentage of lines tested and for every Perl module, class and script a corresponding HTML file is generated with lines colored green (if they were executed during testing) or red (if not). The coloring is achieved by hooking the Perl debugger (see this module from POPFile for details).

Here's an example of what that looks like (here I'm showing Classifier/Bayes.html which corresponds to Classifier/Bayes.pm in the POPFile code, but it looks the same in Signal Spam):



The green lines (with line numbers added) were executed by the test suite; the red line was not (you can see here that my test suite for POPFile didn't test the possibility that the database connect would fail).

Perhaps OCRing image spams really is working?

I've previously been skeptical of the idea that OCRing image spams was a worthwhile effort because of the variety of image-obfuscation techniques that spammers had taken to using.

But Nick FitzGerald has recently sent me an example of an image spam that seems to indicate that spammers are concerned about the effectiveness of OCR. Here's the image:



What's striking is that the spammer has used the same content-obscuring tricks that we've seen with text (e.g. Viagra has become [email protected]@), perhaps out of fear that the OCRing of images is working and revealing the text within the images.

Or perhaps this spammer is just really paranoid.

Monday, April 16, 2007

Debugging: Solaris bus error caused by taking pointer to structure member

Take a look at this sample program that fails horribly when compiled on Solaris using gcc (I haven't tried other compilers, and I'm not pointing my finger at gcc here, this is a Sun gotcha).

Here's an example program (simplified for something much more complex that I was debugging), that illustrates how memory alignment on SPARC systems can bite you if you are doing low-level things in C. In the example the program allocates space for a thing structure which will be prepended with a header. The header structure has a dummy byte array called data which will be used to reference the start of the thing.
struct thing {
  int an_int;
};

struct header {
  short id;
  char data[0];
};

struct header * maker( int size ) {
  return (struct header *)malloc( sizeof( struct header ) + size );
}

int main( void ) {
  struct header * a_headered_thing = maker( sizeof( struct thing ) );

  struct thing * a_thing = (struct thing *)&(a_headered_thing->data[0]);

  a_thing->an_int  = 42;
}

If you build this on a SPARC machine you'll get the following error when you run it:
Bus Error (core dumped)

Annoyingly, if you build a debugging version of this program the problem magically goes away and doesn't dump core in the debugger. So you either resort to printf-style debugging or going into gdb and looking at the assembly output.

Here's what happens when you run this in gdb (non-debug code):
(gdb) run

Program received signal SIGSEGV, Segmentation fault.
0x000106d8 in main ()

Since you can't get back to the source we're forced to do a little disassembly:
(gdb) disassemble
Dump of assembler code for function main:
0x000106b0 
: save %sp, -120, %sp 0x000106b4
: mov 4, %o0 0x000106b8
: call 0x10688 0x000106bc
: nop 0x000106c0
: st %o0, [ %fp + -20 ] 0x000106c4
: ld [ %fp + -20 ], %o0 0x000106c8
: add %o0, 2, %o0 0x000106cc
: st %o0, [ %fp + -24 ] 0x000106d0
: ld [ %fp + -24 ], %o1 0x000106d4
: mov 0x2a, %o0 0x000106d8
: st %o0, [ %o1 ]
0x000106dc
: mov %o0, %i0 0x000106e0
: nop 0x000106e4
: ret 0x000106e8
: restore 0x000106ec
: retl 0x000106f0
: add %o7, %l7, %l7 End of assembler dump.

I've highlighted the offending instruction. From the code you can clearly see that the o0 register contains the value 0x2a (which is, of course, 42) and hence we are looking at code corresponding to the line a_thing->an_int = 42;. The st instruction is going to write the 42 into the an_int field of thing. The address of an_int is stored in o1.

Asking gdb for o1's value shows us:
(gdb) info registers o1
o1             0x2094a  133450

An int is 4 bytes and you can easily see that the address of an_int stored in o1 is not 4 byte aligned (133450 mod 4 = 2, or just stare at the bottom nybble). The SPARC architecture insists that the data accesses be correctly aligned for the size of the access. In this case we need 4 byte assignment (note that malloc will make sure that things are correctly aligned and the compiler will pack structures to the correct alignment while minimizing space).

In this case, the code fails because the data member is byte aligned (since we declared it as a character array), but then we take a pointer to it and treat it as structure with an integer member. Oops. Bus error.

(Note you could have discovered this with printf and %p to get the pointer values without going into the debugger and poking around in the assembly code).

There are a couple of ways to fix it. The first is to pad the header structure so that data is correctly aligned: adding 4 bytes of padding in the form of a short while make the problem go away:
struct header {
  short id;
  short padding;
  char data[0];
};

That's ugly and requires careful commenting and could be a maintenance problem if maker is used to make things requiring a different alignment, or the header structure is modified.

It's slightly cleaner to not have padding but change the type of data to something like the alignment you want:
struct header {
  short id;
  int data[0];
};

(Or even double data[0] to get 8 byte alignment). With gcc you could even make this really clear by using the aligned attribute to create a special type:
typedef char aligned_data __attribute__ ((aligned (8)));

struct header {
  short id;
  aligned_data data[0];
};

I think that's the clearest option of all. With a little documentation around this it should be maintainable.

Wednesday, March 28, 2007

Code to decode an a.b.c.d/n IP address range to the list of IP addresses

I needed to map some addresses in the standard IP address prefix syntax to the actual list of addresses, and after I'd searched Google for all of 11 usecs for a suitable on-line decoder I hacked one up in Perl.

Since someone else might find this handy, here it is:

use strict;
use warnings;

my $address = $ARGV[0] || '/';

my ( $ip, $bits ) = split( /\//, $address );
my @octets = split( /\./, $ip );

if ( ( $address eq '' ) ||
( $bits eq '' ) ||
( $#octets != 3 ) ) {
die "Usage: ip-decode a.b.c.d/x\n";
}

my $base = ( ( $octets[0] * 256 + $octets[1] ) * 256 +
$octets[2] ) * 256 + $octets[3];

my $remaining = 32 - $bits;

for my $i (0..(2 ** $remaining) - 1) {
print_ip( $base + $i );
}

sub print_ip
{
my ( $address ) = @_;

my @octets;

for my $i (0..3) {
push @octets, ($address % 256);
$address >>= 8;
}

print "$octets[3].$octets[2].$octets[1].$octets[0]\n";
}

For example, if this script is called ip-decode and you want to decode the address prefix 82.15.171.152/29 you type ip-decode 82.15.171.152/29 and you'll get the output:

82.15.171.152
82.15.171.153
82.15.171.154
82.15.171.155
82.15.171.156
82.15.171.157
82.15.171.158
82.15.171.159

This script has undergone no testing at all...

So how much difference do stopwords make in a Naive Bayes text classifier?

POPFile contains a list of stopwords (words that are completely ignored for the purpose of classifying messages) that I put together by hand years ago (if you are a POPFile user they are on the Advanced tab and called Ignored Words). The basic idea is that stopwords are useless from the perspective of classification and should be ignored; they are just too common to provide much information.

My commercial machine learning tools did not, until recently, have a stopwords facility. This was based on my belief that stopwords didn't make much difference: if they are common words they'll appear everywhere and probabilities will be equal for each category of classification. I had a 'why bother' attitude.

Finally, I got around to testing this assumption. And I can give some actual numbers. Taking the standard 20 Newsgroups test and using a (complement) Naive Bayes text classifier as described here I can give you some numbers.

The bottom line is that stopwords did improve classification accuracy for 20 Newsgroups (and for other test datasets that I have), but the improvement was by 0.3%. The stopword list used was the top 100 most frequently occurring English words.

So, I was wrong... they do make a difference, but not by much.

Monday, March 26, 2007

Introducing Usman's Law

Back at Electric Cloud I worked with a smart guy named Usman Muzaffar. As part of his job he spent a lot of time dealing with our customers, many of whom used GNU Make other other Make tools to build their software.

One of the constant problems that Usman encountered was that most people had no way to get back to a truly clean build. No matter what they'd put in place for doing a totally scratch, clean build it was hard for everyone because their process often accidentally ommitted to delete something.

I've observed this problem in my own code. Like many people I have a 'make clean' option which deletes all the output files: in my case by rm -rfing an obj directory:

.PHONY: clean
clean:
@rm -rf $(OUT)/*
@$(MAKE_DIRECTORIES)

And I make sure that generated things only go under $(OUT). But it's easy to screw up. Consider a program like yacc or bison which'll create temporary source code files in the same place as the source code being analyzed. The truth is you have to be very careful to ensure that everything goes in one deletable place. (Not to mention the difficulties involved if the Makefile output different versions of objects for, say, different processor targets or platforms).

That leads me to Usman's Law: make clean doesn't.

Live by it and you'll be on the look out for poorly coded Makefiles that leave generated files in places they should not.

Friday, March 23, 2007

Electric Cloud wins a Jolt Productivity Award

Back in 2005 POPFile (which is now in desperate need of an updated version) won a Productivity Award at the 15th Annual Jolt awards. This week the company I co-founded, Electric Cloud, won the exact same award for its product ElectricCommander.

OK, I should stop bragging now.

And show a little humility.

Truth be told, the glow from the second award is strictly reflected... I didn't design, code, or do anything to make ElectricCommander :-) But being a company founder is a good thing; you get to pretend you had all the smart ideas.

Thursday, March 15, 2007

Calibrating a machine learning-based spam filter

I've been reading up about calibration of text classifiers, and I recommend a few papers to get you started:

The overall idea is that the scores output by a classifier need to be calibrated so that they can be understood. And, specifically, if you want to understand them as a probability that a document falls into a particular category then calibration gives you a way to estimate the probability from the scores.

The simplest technique is bucketing or binning. Suppose that the classifier outputs a score s(x) in the range [0,1) for an input document x. Once a classifier is trained it's possible to calibrating it by classifing known documents, recording the output scores and then for each of a set of ranges (e.g. divide [0,1) into 10 ranges [0.0,0.1), [0.1,0.2) and so on) count the number of documents in each class, and get a fraction the represents the probability within that range. Then when you classify an unknown document you look at the range in which its score falls to get probability estimates.

For spam filtering each range would consist of the number of emails that are ham in the range, and the number that are spam. The ham probability then just comes out to # of ham emails in the range / total number of emails in the range.

I decided to run a little test of this and took the complete TREC 2005 Public Spam Corpus and trained a Gary Robinson style spam filter on it. Then I classified the entire corpus to get calibration data. Instead of looking at the final scores, I looked at the pair of scores (one for ham, one for spam) that the classifier generates and used those to generate bins. The following chart shows (crudely) the percentage of spams and hams in each bin:



The x-axis is the ham score in the range [0.0,0.9) with each square representing a single 0.1-width bin. The left-most square means that the classifier had a very low score in terms of hamminess, right-most square means that the classifier had a very high score.

The y-axis is the spam score in the [0.0,0.9), so that the bottom means a low spamminess score and the top a high one. Each square is colored red and green. Red is the percentage of messages in that bin that are spam, and green the percentage of messages in that bin that are ham.

So as you'd expect the top left corner (low ham score, high spam score) is generally colored red indicating that these are spam messages, and the bottom right corner (low spam score, high ham score) is colored green (lots of ham). Some squares are black because there's too little data (I arbitrarily said that if there were less than 5 messages in the bin it was colored black).

But there's a really interesting square in the top right: it's almost all red (in fact, it's 93% red or spam). So we can have confidence that a message there (which corresponds to a final 'score' of 0.5) is in fact spam.

I'm sure that there are other interesting insights to be gained from calibration and I'd like to spend the time to evaluate the effectiveness of using the probabilities generated in this way (instead of the rather arbitrary selection of a score 'cutoff' point) as the way to determine whether a message is spam or ham (or undetermined). For example, the square at (0.9,0.2) (which corresponds to a 'score' of 0.18 is 57% ham, 43% spam so looks like a good candidate for undetermined; it looks like a much better candidate than the typical "area around 0.5" (which corresponds to (0.9,0.9) and is 93% spam).

An image-based spam that brags about its own delivery

Nick FitzGerald sent me a simple image-based spam for Viagra which starts with a brag on the part of the spammer:



Nick calls it a 'self aware' spam.

Wednesday, February 28, 2007

Image spammers doing the twist

It's been quite a while since I last blogged about ever changing image spam. Anna Vlasova wakens me from my unblogging slumber with some great samples of recent image spams were the spammer has decided to rotate the entire image to try to avoid detect. Take a look at this first one:

The spammer has really gone to town here:
  • There's random speckling all over the images to upset hashing and OCR techniques
  • There's no URL in the message itself (it's in the image)
  • The entire image has been rotated to the left to obscure the text

And, of course, they are not going to be content with just one rotation and can randomize the angle per message:

And they've gone even further by slicing the image up, randomizing the angle and overlaying the elements using animation.

Monday, February 19, 2007

Jack Bauer's Management Secrets #1: I need it!

This is part one of a series of posts unlocking the valuable management secrets and strategies of 24's best agent: Jack Bauer. What is it that makes Jack successful? Sure, he's a great shot, he's been trained in all sorts of combat, sometimes he's lucky, clearly he's very driven.

But what really makes Jack a winner are his managament skills. Jack successfully motivates and manages, he handles superiors and subordinates, he gains people's trust, he has high integrity, he's a team player and ultimately he helps his team win time and again.

These posts look into Jack's management secrets. In part one I look out how Jack creates a sense of urgency while at the same time binding his team together towards a common goal. And he does all of that with a simple phrase: 'I need it!'.

I need it!

Jack doesn't say "I want this done" or "You must do this", he tells his team members (especially, Chloe) "I need it!". Why's that so important?

Firstly, by saying "I need it!" Jack makes his request personal. It's not a simple matter of something having to be done, it's that Jack (in his position of authority) needs the task completed for his own success. That's an important technique because it binds the manager's success along with the subordinate's success. If the person Jack is asking successfully completes the task they can see clearly that they've been successful personally, made the manager successful and by extension made the entire team more successful.

It also gives the employee a sense of empowerment and reason for their work. Since they know the manager needs the task completed, instead of merely wanting it done, the employee feels that the task is worth accomplishing.

Contrast that with a classic pointy-haired manager who spits out orders without being personally involved. An employee just sees the manager as a conduit for orders that probably came from above them. There's no sense that the manager is in any way involved in the decision making process, or even cares about the outcome.

Of course, it's not enough for Jack to say that he needs it, he needs to provide justification and clarity. He does this in two ways: he briefly explains why, and he sets a clear time frame. Because he's built trust with the team (a subject of a separate post) he can briefly explain his reasons without fuss, and tell the employee when the task must be completed.

Here's Jack speaking to Chloe in Season 5:

Chloe, listen to me, I have a thumb drive that's going to help us find the sentox. I need you to data mine the files.

He makes a clear request 'data mine the files', he gives a clear reason 'going to help us find the sentox' and states that he needs it. In this case he doesn't state when, because it's clear from the context that 'when' is 'asap'.

So, how can you apply Jack's "I need it!" technique in the work place? You just need to remember the three step C T U plan whenever setting a task:

C is for Context: explain the context of the request so it's clear why you are asking for a particular task to be completed.
T is for Timeframe: make sure that the expected deadline or timeframe is clearly stated.
U is for Urgency: communicate your sense of urgency by specifying that this is something you need.

Wednesday, February 07, 2007

Trusted Email Connection Signing (rev 0.2)

IMPORTANT: This blog post deprecates my previous posting on this subject. The blog post Proposal for connection signing reputation system for email is deprecated.


Sign the medium, not the message



The motivation behind TECS (Trusted Email Connection Signing) is that what managers of MX servers on the public Internet really care about is the ability to distinguish a good connection (coming from a legitimate sender and which will be used to send wanted email) from a bad connection (coming from a spammer). If you can identify a bad connection (today, you do that using an RBL or other reputation service based on the IP address of the sender) you can tarpit or drop it, or subject the mails sent on the connection to extra scrutiny. If you can identify a good connection it can bypass spam checks and help reduce the overall false positive rate.

If you are a legitimate bulk mailer (an email marketer, for example) then you care deeply that you reputation being recognizable and that mail sent from you be delivered. Currently, you have to carefully tend your IP addresses to make sure that they don't appear on blacklists, and you have to ensure that new IP addresses are clean.

If you are running a large email service (e.g. Yahoo! Mail) then you are currently trying to build white and blacklists of IP addresses, when what you really want are white and black lists of entities.

Currently, the options used to identify a bad connection are rather limited (RBLs, paid reputation services and grey listing), and good connections are hard to manage (whitelists on a per-recipient basis, or pay-per-mail services). What's needed is a different approach.

The idea is to identify and determine the reputation of the entity connecting to a mail server in real-time without resorting to a blacklist or whitelist. This is done by signing the connection itself. With the signature on a per-connection basis a mail server is able to determine who is responsible for the connection, and then look up that entity's reputation in a database.

Current reputation databases are based on IP addresses. This is a very inflexible system: IP addresses must be added to blacklists very fast as spammers churn through zombie machines, and any legitimate emailer needs to make sure their mail servers are whitelisting with multiple email providers (e.g. Yahoo!, Gmail, Brightmail, ...) to ensure delivery. And if a legitimate mailer wants to bring on line new servers, with new IP addresses they have to run through the entire whitelisting process again.

This is inefficient. The mapping between IP address and entities (e.g. knowing that Google's Gmail services uses a specific set of IP addresses) is unwieldy to manage and the wrong level of granularity. Google should be free to add and remove email servers at will, while carrying their good reputation with them.

That's what TECS gives you.

Connection Signing

TECS is an extension to the existing SMTP AUTH mechanism (see RFC 2554) and implements an authentication mechanism that I'll refer to as TECS-1 (the 1 here acts as a version number on the protocol). TECS-1 would need to be registered as a SASL (see RFC 2222) authentication mechanism.

When a mail sender connects to an SMTP server wishing to sign its connection it issues the EHLO command and if that SMTP server is capable of handling AUTH the mail sender then signs the connection using the AUTH command, with the TECS-1 mechanism followed by an initial response (which contains the TECS signature) as defined in RFC 2554.

Here's an example session:

S: 220 smtp.example.com ESMTP server ready
C: EHLO jgc.example.com
S: 250-smtp.example.com
S: 250 AUTH CRAM-MD5 DIGEST-MD5 TECS-1
C: AUTH TECS-1 PENCeUxFREJoU0Nnbmh1YWVlMDljNDBhZjJiODRhMGMyYjNiYmFlNzg2ZQ==
S: 235 Authentication successful.

The 'initial response' section of the of the AUTH command is a base-64 encoded string containing the following structure (this is deliberately similar to the DKIM fields):

a=rsa-sha256; q=dns; d=jgc.org; b=oU0Nnbmh1YWVlMDljNDBhZjJiO==

a= is the cryptographic method used (default would be RSA/SHA-256 with suitable padding as described in PKCS#1 version 1.5 RFC 3447).

d= is the name of the domain signing the connection. In the example above I am showing a connection that is being signed (and hence claimed by) jgc.org.

q= is a query type with the default being the use of a DNS TXT record. This query method is used to obtain the public key associated with the signing domain. The public key would be obtained by looking up _tecs.jgc.org and getting the associated TXT record.

b= is the binary signature for the connection generated using the method in a= by the d= domain.

The connecting server signs the tuple consisting of ( destination IP/port, source IP/port and epoch ); that way they sign the current connection and verify that they are responsible for the mail sent across it.

Each entity has an RSA key public/private key pair. When signing a connection the entity generates a SHA-256 hash of the tuple. The destination IP/port pair is the IP address and port on the mail server that the mail sender is currently connected to; similarly the source IP/port pair is the IP address and port of the connection being used by the
mail sender. The epoch is the standard Unix epoch rounded to the nearest 30 seconds.

The entity making the connection then encrypts the hash with their private key.

Update (February 8, 2007): A number of people have suggested getting the public key from the _domainkey (i.e. DKIM) label of the d= domain. This seems like a good idea since there's no need to reinvent the wheel.

Update (February 9, 2007). A few people pointed me in the direction of the MARID CSV proposal (see CSV). I've addressed this below.

I don't want to sign my outbound SMTP connections
Well don't then. If TECS were implemented you'd probably find you'd have trouble getting your mail delivered as non-signing (and hence not-taking-responsibility) would be looked upon very poorly.

I'm big ISP and I don't want to sign as myself, can I sign as a customer?
Sure, for example, jgc.org's mail is actually handled by lists.herald.co.uk. When that server was sending mail for jgc.org it could sign as jgc.org as long as it has access to jgc.org's private key. It would simply specify d=jgc.org in the TECS data.

Why don't you just use STARTTLS with certificates?
Because that's a very heavyweight system, designed for something else. A SASL extension using SMTP AUTH is simple and clean.

Why do you think connection signing is useful?
Because SMTP server resources are precious. Being able to make a decision about a connection before any mail is delivered is very useful. An SMTP server owner could use reputation data from a public or private source to decide whether to accept or reject a connection, slow down a connection, apply little or extra scrunity to a connection, etc. Being able to do this before receiving a ton of mail and tying up a server is very valuable.

Won't spammers just sign their connections?
No doubt, but that's hardly a worry, being able to identify the good senders fast is the most important goal.

Why don't you just signed the destination IP/port pair? That's known before the connection is made and avoids problems with NAT
TECS could just sign the tuple ( destination IP, port, epoch ) but I think it's a bit weaker than my proposal. Since the destination IP and port are fixed for a given MTA the signature is really a signature on the time. An eavesdropper could reply the signature within 30 seconds (or other timeone on the epoch value) and get an authenticated connection from any source IP address.

What about the MARID CSV proposal?
The CSV proposal is a lightweight (DNS-based), non-cryptographic method of estabilishing whether a host claiming a certain domain name in HELO/EHLO is authorized to be an SMTP client. Clearly, CSV aims to provide a simple method of determining whether a connecting SMTP client is authorized to be an SMTP client with the claimed name. This seems like a useful extension, but is very different from TECS. TECS operates at the level of a specific connection, and with an entity that is distinct from the domain of the SMTP client. This is valuable for two reasons: it allows the identity to be moved from SMTP service provider to provider, and it means that shared SMTP servers can operate claiming different 'responsible parties' for each connection. This latter point is important for ISPs that provide SMTP services to email marketers where the same SMTP server may be shared across many clients. This can result in a clean emailer being blacklisted because the IP of the shared server was blacklisted because of some other unrelated misbehaviour.

Whilst CSV is a useful extension which would help with the zombie problem, it does not address the needs at the connection level where I believe the problem needs to be addressed.

CSV also provides specific services for checking domain names against accreditation services. That is outside the scope of TECS, although the assumption is that such services would exist for TECS signed connections against the domain name claiming responsbility. The bottom line is that TECS deals with the party responsible for a connection, CSV the party responsible for the server.

What about mailing lists that forward mail?
By signing their connections they take responsibility for the mails they are sending. So mailing lists would need to have appropriate email policies in place for unsubscriptions, and deal themselves with spam to the list. Since the connection is signed any concern about munging of From: addresses for VERP handling, or adding headers/footers to email are irrelevant.

Is this compatible with SPF, Sender-ID, DomainKeys?
They are orthogonal. There's no direct interaction. Although, it might be sensible to use the _domainkey record from DKIM to obtain a public key thus sharing the same key between DKIM and TECS.

Will this reduce spam?
I'm not going to make any predictions. The goal would be to build a database that makes it easier to recognize someone who is legitimate, and scrutinize those who abuse the system or who choose not to sign.

What about anonymity?
Anoymous remailers are unaffected. They could sign their outbound connections with the system but that would not affect any changes they make to anonymize messages since its the conneciton, not the message content that's signed.

What if I change the mail servers or IP addresses I am using?
There's no effect. Keep signing the connections and you can take responsibility for any IP address you want to.

I think you are wrong, right, stupid, a genius.
Please comment here, or write to me directly.


Many thanks to all members of the REDACTED discussion forum, and to Toby DiPasquale.

Thursday, February 01, 2007

Proposal for connection signing reputation system for email: TECS

IMPORTANT: This blog post is deprecated. Please read Trusted Email Connection Signing (rev 0.2) instead


The motivation behind TECS (Trusted Email Connection Signing) is that what managers of MX servers on the public Internet really care about is the ability to distinguish a good connection (coming from a legitimate sender and which will be used to send wanted email) from a bad connection (coming from a spammer). If you can identify a bad connection (today, you do that using an RBL or other reputation service based on the IP address of the sender) you can tarpit or drop it, or subject the mails sent on the connection to extra scrutiny. If you can identify a good connection it can bypass spam checks and help reduce the overall false positive rate.

Currently, the options used to identify a bad connection are rather limited (RBLs, paid reputation services and grey listing), and good connections are hard to manage (whitelists on a per-recipient basis, or pay-per-mail services). What's needed is a different approach.

There are also ideas like SPF, Sender-ID and DomainKeys which all attack the problem of protecting the integrity of the From: portion of a message.

TECS is different. The idea is to identify and determine the reputation of the entity connecting to a mail server in real-time without resorting to a blacklist or whitelist. This is done by signing the connection itself. With the signature on a per-connection basis a mail server is able to determine who is responsible for the connection, and then look up that entity's reputation in a database.

Current reputation databases are based on IP addresses. This is a very inflexible system: IP addresses must be added to blacklists very fast as spammers churn through zombie machines, and any legitimate emailer needs to make sure their mail servers are whitelisting with multiple email providers (e.g. Yahoo!, Gmail, Brightmail, ...) to ensure delivery. And if a legitimate mailer wants to bring on line new servers, with new IP addresses they have to run through the entire whitelisting process again.

This is inefficient. The mapping between IP address and entities (e.g. knowing that Google's Gmail services uses a specific set of IP addresses) is unwieldy to manage and the wrong level of granularity. Google should be free to add and remove email servers at will, while carrying their good reputation with them.

That's what TECS gives you.

Now for the how. To work TECS requires two things: a reputation authority and an algorithm. Let's start with the second.

Connection Signing

When a mail sender connects to an SMTP server wishing to sign its connection it issues the EHLO command and if that SMTP server is capable a new extension command TECS will be available. After the EHLO the mail sender then signs the connection using the TECS command.

The TECS command has two parts: an identifier (this is the unique identifier of the entity signing the connection, and thus taking responsibility for the messages send across the connection) and a signature.

Each entity has an RSA key public/private key pair. When signing a connection the entity generates a SHA-256 hash of the tuple . The destination IP/port pair is the IP address and port on the mail server that the mail sender is currently connected to; similarly the source IP/port pair is the IP address and port of the connection being used by the
mail sender. The epoch is the standard Unix epoch rounded to the nearest 30 seconds.

The entity making the connection then encrypts the hash with their private key, turns that into a hex string and uses that string as the second parameter to the new SMTP TECS command.

For example, an entity with the unique identifier 1b46ef4 might sign a particular connection like this:

TECS 1b46ef3d 5dde82a341863c87be1258c02ce7f80bf214192b

to which the receiving server could reply 200 OK if the signature is good (which they verify by generating the same hash and decrypting using the entity's public key), or with an error if the signature is bad (and they should probably drop the connection).

To get the entity's public key the receiving server needs to query the reputation authority.

Reputation Authority

The TECS reputation authority would be a non-profit organization that sells public/private key pairs and allocates entity IDs to verified entities. Money gathered from selling keys would be used to maintain the database of reputation information for each entity, and in ensuring the only reputable entities can obtain keys.

In the example above the receiving server would query the DNS TXT record of the domain name produced by concatenating identifier given in the TECS command with the name of the authority. Suppose that the authority was tecs.jgc.org then a DNS TXT query would go to 1b46ef3d.tecs.jgc.org.

The reply would consist of the ascii-armored public key for that entity and a reputation measure indicating the reliability of that user. The reputation measure would take one of 4 states: unknown (a recently issued key would not have any reputation), good (only a small number of complaints against this ID), medium (some complaints), bad (large number of complaints, probable spam source). The receiving server can verify the signature and use the reputation information to decide on the handling of the connection.

The authority would accept ARF formatted complaints consisting of abusive messages giving connection information, and the full text of the TECS command. They would then investigate to ensure that the reputation database contained up to date and useful information.

How much is a key pair going to cost?
I think it should be cheap for individuals ($25?), fairly cheap for non-profits and charities ($100?), and then a sliding scale for for-profit companies based on size (say $100 for a small company, $1000 for a big one?). The goal would be to make enough money to run the list.

What about mailing lists that forward mail?
By signing their connections they take responsibility for the mails they are sending. So mailing lists would need to have appropriate email policies in place for unsubscriptions, and deal themselves with spam to the list. Since the connection is signed any concern about munging of From: addresses for VERP handling, or adding headers/footers to email are irrelevant.

Is this compatible with SPF, Sender-ID, DomainKeys?
They are orthogonal. There's no direct interaction.

Will this reduce spam?
I'm not going to make any predictions. The goal would be to build a database that makes it easier to recognize someone who is legitimate, and scrutinize those who abuse the system or who choose not to sign.

What about anonymity?
Anoymous remailers are unaffected. They could sign their outbound connections with the system but that would not affect any changes they make to anonymize messages since its the conneciton, not the message content that's signed.

What if I change the mail servers or IP addresses I am using?
There's no effect. Keep signing the connections and you can take responsibility for any IP address you want to.

I think you are wrong, right, stupid, a genius.
Please comment here, or write to me directly.

Thursday, January 25, 2007

What Makefile am I in?

A common request when using GNU Make is: "Is there a way to find the name and path of the current Makefile?". By 'current' people usually mean that Makefile that GNU Make is currently parsing. There's no built-in way to quickly get the answer, but there is a way using the GNU Make variable MAKEFILE_LIST.

MAKEFILE_LIST (documented in the manual here) is the list of Makefiles currently loaded or included. Each time a Makefile is loaded or included the variable is appended. The paths and names in the variable are relative to the current working directory (where GNU Make was started or where it moved to with the -C or --directory option). The current working directory is stored in the CURDIR variable.

So you can quite easily define a GNU Make function (let's call it where-am-i) that will return the current Makefile (it uses $(word) to get the last Makefile name from the list):
where-am-i = $(CURDIR)/$(word $(words $(MAKEFILE_LIST)),$(MAKEFILE_LIST))

then whenever you want to find out the full path to the current Makefile write the following at the top of the Makefile (the 'at the top' part is important because any include statement in the Makefile will change the value of MAKEFILE_LIST so you want to grab the location of the current Makefile right at the top):
THIS_MAKEFILE := $(call where-am-i)

Example:

Here's Makefile
where-am-i = $(CURDIR)/$(word ($words $(MAKEFILE_LIST)),$(MAKEFILE_LIST)

include foo/Makefile

foo/Makefile contains:
THIS_MAKEFILE := $(call where-am-i)
$(warning $(THIS_MAKEFILE))

include foo/bar/Makefile


foo/bar/Makefile contains:
THIS_MAKEFILE := $(call where-am-i)
$(warning $(THIS_MAKEFILE))

Running this on my machine (with the first Makefile in /tmp) gives the output:

foo/Makefile:2: /tmp/foo/Makefile
foo/bar/Makefile:2: /tmp/foo/bar/Makefile

The Tao of Debugging

I hate debuggers.

And not only do I hate them, I rarely use them. For me, a debugger is (almost) always the wrong tool. And people who habitually use debuggers are making a big mistake, because they don't truly understand their code. I suspect that the same people who use debuggers all the time, are the same people who don't unit test their code.

Any programmer not writing unit tests for their code in 2007 should be considered a pariah (*). The truth is that if you haven't written unit tests for your code then it's unlikely to actually work. Over the years I've become more and more radical about this: an untested line of code is a broken line of code.

Just the other day I came across a wonderful quote from Brian Kernighan:
The most effective debugging tool is still careful thought, coupled with judiciously placed print statements.
He wrote that in 1979, but it's still true today. There are two really important ideas there: careful thought and print statements. If I allow myself to update his quotation then I'd say:
The most effective debugging tools are: your brain, a unit test, and the print statement.
If you want to be a great debugger then those are the only three things you are going to need. To be a great debugger you need to follow these steps (that I jokingly call the Tao of Debugging):

WRITE A UNIT TEST THAT TICKLES THE BUG

1. Once you have a bug in front of you add a failing unit test to your unit test suite that tickles the bug. This should be absolute smallest, simplest, and fastest unit test you can write. You want the test to be fast because you are going to have to run it over and over again as you debug.

Once when I was working on a strange problem with Novell's IPX protocol the only way to tickle the bug was to play the Beverly Hills Cop theme tune across the network. After a few runs my coworkers were going insane so I pulled the speaker cable off the motherboard. Nevertheless, a few bars of Axel F were enough to crash the IPX stack and reproduce the bug.

You may have to do a lot of work to write the unit test, since writing the unit test means narrowing down the bug as much as possible. The best way to do this is to write unit tests each time you have a way to reproduce the bug. These tests may start at complex, but as you delve into the code you'll be able to write simpler and simpler tests until you are faced with just the ultimate cause of the bug. Along the way you've written a small test suite that makes sure this bug doesn't happen again.

START BROAD AND NARROW DOWN

2. Once you've got the unit tests in place (or while you are writing ever simpler tests) you can start the actual debugging. Since the tests point to the place in the code where the bug is located it's time to instrument the code with printf or similar statements. Each time you run the unit test you'll be able to examine, and refine the output of those printf statements to reveal the location and reason for the bug.

Most debugger freaks would at this point start a debugger and set a breakpoint and then start single stepping through the code, examing all sorts of variables, setting up watch statements, and generally using the seductive machinery of debuggers. The problems is the debugger is an enormous time sink: you have to tell it what you are looking for, which variables to display, and then you single step through code.

The problem with single stepping through code is that it's rarely useful. To narrow down a bug you need to start at the highest level (if I do X, Y happens) and take into account the inputs and outputs of the broken code. As you narrow down the bug you are essentially looking at the inputs (the state of various variables, or bits of memory) and outputs (states of other variables, and memory) of an ever smaller piece of code (until you get to the actual location of the bug). The ideal way to monitor those inputs and outputs is the print statement.

Single stepping is totally the wrong approach: it starts at the lowest level and the debugger wrangler gets lost in all sorts of fine detail that's totally irrelevant to the debugging process. Even if you step over subroutine calls and loops it's still the wrong level of detail. You need to start wide and narrow down, writing unit tests along the way.

And if you are writing multi-threaded code then I'll wager that the debugger will make things exponentially worse (base is the number of running threads) then using print statements.

THINK

3. Use your noggin(**). Many bugs can be found (once you've got your unit test telling you where to look) by just staring at the code and rereading it. Or you can 'single step' in your head (this is way more flexible than a debugger single step because you can intelligently ignore irrelevant detail and jump over blocks of code in an arbitrary fashion).

In doing so, you'll often see small improvements that can be made to the existing code. As you are narrow down a bug you'll frequently realize that you could write a better comment, or fix a small bug, along the way. If you are in a text editor (and not a debugger), looking at the code, then you can fix those things while narrowing down the bug. The overall effect is that you fix the bug and improve the code.

I find it very helpful to look at my own code. In trying to understand it I'll often find myself saying "Well, that can't happen" (when, of course, it can) and it's usually when I think I know what's happening (but don't) that I find the bug in question.

CAVEATS

There are, of course, some times when a debugger can be used.

When I used to write device drivers the debugger was sometimes helpful to look at a very complex situation related to the handling of interrupts between a network adapter card and the host machine (and for that SoftICE reigned supreme), but even there the printf-style (where in fact each 'printf' was the output of a single character directly to the screen memory in assembly language) was usually enough to capture what was happening.

Some odd memory corruption bugs can benefit from a debugger than can monitor a section of memory and break when the corruption happens.

NOTES

I frequently disagree with Linus Torvalds, but his post on not having a kernel debugger is priceless.

I used to work for John Ousterhout and consider him to be way smarter than me. He uses a debugger all the time; perhaps I'm just not smart enough to use one.

Yes, I know I wrote a debugger for GNU Make. I'm not saying debuggers should be banned, but they should be relegated to the list of tools you break out on rare occasions. And the fact that GNU Make didn't have a debugger at all isn't a good thing because some people just can't live without one.

(*) Yes, I'm aware that POPFile's unit test suite is currently broken :-)
(**) British slang for your head or brain.