Skip to main content

Posts

Showing posts from 2007

Double-checking Dawkins

I've been reading through Richard Dawkins ' books and am currently half way through The Blind Watchmaker (2006 paperback edition) and on page 119 he writes: In my computer's ROM, location numbers 64489, 64490 and 64491, taken together, contain a particular pattern of contents---1s and 0s which---when interpreted as instructions, result in the computer's little loudspeaker uttering a blip sound. This bit pattern is 10101101 00110000 11000000. Of course, this piqued my curiosity. Did Dawkins just make that up, or is this really the contents of a specific bit of memory on a specific computer? The book was first published in 1986, so I just had to figure out what it was. Starting with the instructions and converting to hex we have AD 30 C0. Now, considering the main processors around at the time there are three possible interpretations of these three bytes: Z-80/8080 XOR L ; JR NC C0 6502 : LDA C030 6809 : JSR 30C0 The first didn't look at all plausible, b

Steve Gibson's PPP... new version 3 in Java and C

If you've been following along you'll know that I've implemented Steve Gibson's PPP system in Java and C. The Java code is in the form of a CLDC/MIDP project for cell phones and the C code is a command-line tool. Steve's made a major change to the algorithm which he's calling PPPv3. My updated code is here: The C source code is available in ppp3-c.zip . The Java source code is available in ppp3-java.zip . The compiled Java is available in ppp3.jar . Read my original blog posts for details.

Cryptographically and Constantly Changing Port Opening (or C3PO)

In another forum I was just talking about a little technique that I came up with for securing a server that I want on the Internet, but to be hard for hackers to get into. I've done all the right things with firewalling and shutting down services so that only SSH is available. But that still leaves port 22 sitting there open for someone to bang on. So what I wanted was something like port knocking (for an introduction to that you can read my DDJ article Practical Secure Port Knocking ). To avoid doing the classic port knocking where you have to knock the right way to open port 22 I came up with a different scheme which I call Cryptographically and Constantly Changing Port Opening or C3PO. The server and any SSH client who wish to connect to it share a common secret. In my case that's just a long passphrase that we both know. Both bits of software hash this secret with the current UTC time in minutes (using SHA256) to get 256 bits of random data. This data changes ev

PPPv2 in Java and C

Recently, I released C and Java versions of Steve Gibson's PPP system for password generation. Steve updated the algorithm to PPPv2 which uses a different hash (SHA256 instead of SHA384) and a slightly different plain text generation algorithm (see the PPP pages for details). I've now updated my code to PPPv2 and am releasing it here. The C source code is available in ppp-c.zip . The Java source code is available in ppp-java.zip . The compiled Java is available in ppp.jar . Read my original blog posts for details. All source code is released under the BSD License .

A Java client implementation of Steve Gibson's PPP

I recently produced an open source implementation of Steve Gibson's Perfect Paper Passwords system in C. It occurred to me that a better implementation would be a Java client for my mobile phone (thus eliminating the need for printing and carrying the paper passwords). Here's my PPP client implementation running on my Motorola RAZR. It's written in Java using CLDC 1.0 and MIDP 2.0 . You can download and install the JAR file . The current version is 1.0.0.

Times Square: a fun spammer GIF

Nick FitzGerald reported a neat spammer image trick to me the other day. It's entered in The Spammers' Compendium that involves using animation to display the word Viagra emulating a flashing neon sign. Since many OCR systems merge the layers together before OCR this image is actually in the 'wrong' order. Once merged the letters are in the order VIRAAG.

SOC Update and Google Maps integration

After receiving some feedback on my Simple code for entering latitudes and longitudes I've made a couple of changes: 1. Replace the letter V with the symbol @ in the alphabet to remove confusion between U and V. Implementations should automatically map V to U if entered. 2. Changed the checksum to the following calculation: C = p0 + p1 * 2 + p2 * 3 + p3 * 4 + p4 * 5 + p5 * 6 + p6 * 7 + p7 * 8 + p8 * 9 mod 29 To make it a bit easier to visualize here's an integration of SOC with Google Maps. You can either type in an address to navigate to that address and see the SOC, or type in a SOC to navigate to that location. Enter address to find: Or, enter a SOC:

An open source implementation of Steve Gibson's PPP algorithm

Steve Gibson has come up with a simple two-factor password scheme that relies on printed cards of passcodes generated using a combination of SHA-384 and Rijndael. The idea is that a system could prompt the user for one of the passcodes in addition to their normal password. Steve calls this his Perfect Paper Passwords system and has given a detailed description of the algorithm. As usual he's released code written in assembly language as a DLL for Windows. He hasn't released his source code (he never does), so I thought it would be interesting to write my own implementation of his algorithm. Here's the C code: #include <sys/time.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include "rijndael.h" #include "sha2.h" #pragma pack(1) typedef unsigned char Byte; typedef union __Passcode { unsigned long as_long; struct { Byte byte[4]; } bytes; } Passcode; typedef struct __PasscodeString { char char

More spammer crossword creativity

Nick FitzGerald writes in with a variant of the "1 across, 3 down" spammer content trick which looks like this: The neat thing is that the crossword is created using HTML in a way that prevents a simple HTML-stripping spam filter from reading the brand names. To a simple spam filter this looks like: CA BREIT OM R O L E TIER ING G X The actual HTML (cleaned up by Nick) is: <TABLE> <TR> <TD> <DIV align=right> CA<BR> <BR> BREIT<BR> OM </DIV> </TD> <TD> <DIV align=center> R<BR> O<BR> L<BR> E </DIV> </TD> <TD> TIER<BR> <BR> ING<BR> GA </TD> </TR> <TR> <TD> <DIV align=center> X </DIV> </TD> </TR> </TABLE>

Why you don't want to code for a government department

Back in the mists of time, straight after my doctorate, I worked for a UK start-up called Madge Networks initially maintaining device drivers that implemented LLC, NetBIOS, IPX/SPX protocols and then writing a TCP/IP stack. Most of this work was done in C and assembler (x86 and TMS380). When I first joined the company I was sent on an x86 assembly training course run by QA Training . (It rained on the first day and we were locked out so one of the company big cheeses ran over with QA Training umbrellas; to this day I use that umbrella). During the course we were asked to write a simple function in C. I've forgotten what it was, but let's say it was a classic factorial function. I wrote something like: unsigned int fac( unsigned int i ) { if ( i < 2 ) { return 1; } return i * fac( i - 1 ); } Later we looked at an assembly equivalent of the function, but before that I took at look at the person sitting next to me. His function looked like this: unsi

Last push for POPFile voting

POPFile has been nominated for a SourceForge Community Choice Award due to the efforts of many people. Voting closes on July 20 and there's strong competition in POPFile's category from the likes of Pidgin (formerly GAIM) and phpBB. If POPFile wins SourceForge will be making a donation to a charity that I picked: Doctors without Borders. If you feel like voting for POPFile, please vote in the Best Project for Communications category here: http://sourceforge.net/awards/cca/vote.php

Please vote for POPFile

POPFile has been nominated for a SourceForge Community Choice Award through the efforts of its users. Now it's time to vote. If you love POPFile, please vote for it in the Best Project for Communications category.

Pretty Darn Fancy: Even More Fancy Spam

Looks like the PDF wave of spam is proceeding with a totally different tack. Sorin Mustaca sent me a PDF file that was attached to a pump and dump spam. Unlike the previous incarnation of PDF spams, this one looks a lot like a 'classic' image-spam. It's some text (which has been misaligned to fool OCR systems), but there's a little twist. Zooming in on a portion of the spam shows that the letters are actually made up of many different colors (a look inside the PDF it's actually an image. I assume that the colors and the font misalignement is all there to make it difficult to OCR the text (and wrapping it in a PDF will slow down some filters). Extracting the image and passing it through gocr gave the following result: _ Ti_ I_ F_ _ Clih! % ogx. %_ % _c. (sR%) t0.42 N %x g0 _j_ __ h_ climb mis _k d% % %g g nle_ Frj%. _irgs__Dw_s _ rei_ � a f%%d __. n_ia une jg gtill oo_i_. Ib _ _ _ _ _ _ 90I Tgdyl And a run through tesseract resulted in: 9{Eg Takes I:

Pretty Darn Fancy: Stock spammers using PDF files

Joe Chongq sent me a fascinating spam that is the first I've seen that's using a PDF file to send its information. I've long predicted that we'll see a wave of complex formats used for spam as broadband penetration increases and sending large spams becomes possible. This particular spam has a couple of interesting attributes: 1. The PDF file itself is a really nicely formatted report about a particular stock that's being pumped'n'dumped. 2. The file name of the PDF was chosen to entice the user further by using their first name. In this case it was called joe_report.pdf. 3. The PDF is compressed using the standard PDF 'Flate' algorithm and totals 84,398 bytes. That's fairly large, but we've certainly seen image spams that were larger. Use of compression here means that a spam filter that's not aware of PDF formats would be unable to read the message content. Here's what the actual PDF looks like (click for a larger view): .

Escaping comma and space in GNU Make

Sometimes you need to hide a comma or a space from GNU Make's parser because GNU Make might strip it (if it's a space) or interpret it as an argument separator (for example, in a function invocation). First the problem. If you wanted to change every , into a ; in a string in GNU Make you'd probably head for the $(subst) function and do the following: $(subst ,,;,$(string)) See the problem? The argument separator for functions in GNU Make is , and hence the first , (the search text) is considered to be separator. Hence the search text in the above is actually the empty string, the replacement text is also the empty string and the ;, is just preprended to whatever is in $(string) . A similar problem occurs with spaces. Suppose you want to replace all spaces with ; in a string. You get a similar problem with $(subst) , this time because the leading space is stripped: $(subst ,;,$(string)) That extra space isn't an argument it's just extraneous w

POPFile v0.22.5 Released

Here are the details: Welcome to POPFile v0.22.5 This version is a bug fix and minor feature release that's been over a year in the making (mostly due to me being preoccupied by other things). NOMINATING POPFILE FOR AN AWARD SourceForge has announced their 'Community Choice Awards' for 2007 and is looking for nominations. If you feel that POPFile deserves such an honour please visit the following link and nominate POPFile in the 'Best Project for Communications' category. POPFile requires multiple nominations (i.e. as many people as possible) to get into the list of finalists. http://sourceforge.net/awards/cca/nomination.php?group_id=63137 Thanks! WHAT'S CHANGED SINCE v0.22.4 1. POPFile now defaults to using SQLite2 (the Windows installer will convert existing installations to use SQLite2). 2. Various improvements to the handling of Japanese messages and improvements for the 'Nihongo' environment: Performance enhancement for con

Back from the EU Spam Symposium; here's my talk

So I'm back home from the 2007 EU Spam Symposium which was held in Vienna in Austria and you can grab my presentation here . You'll notice that the presentation template is from MailChannels . They very kindly sponsored my trip to Vienna and so I did a little publicity for them. There's only one slide, however, that's actually anything to do with MailChannels in the entire presentation, so don't expect a product pitch! One thing I didn't mention in my talk was that as the number of Internet hosts expands and the number of broadband subscribers grows the number of competing botnets can also grow. That means I'd expect to see the price of botnet rental dropping as the Internet grows leading to lower costs for spammers. I'll give a complete round up of the conference in my newsletter next week, but overall there were some interesting talks, and meeting some people like Richard Cox from SpamHaus and Richard Clayton was very useful.

Some architectural details of Signal Spam

Finally, Signal Spam , France's new national anti-spam system, launched and I'm able to talk about it. For a brief introduction in English start here . I'm not responsible for the idea behind Signal Spam, nor for its organization, but I did write almost all the code used to run the site and the back end system. This blog post talks a little bit about the design of Signal Spam. Signal Spam lets people send spams via either a web form, or a plug-in. Plug-ins are currently available for Outlook 2003, Outlook 2007 and Thunderbird 2.0; more coming. Currently Signal Spam does three things with every message: it keeps a copy in a database after having extracted information from the body and headers of the message; it figures out if the message came from an ISP in France and if so sends an automatic message to the ISP indicating that they've got a spammer or zombie in their network; it figures out if the message was actually a legitimate e-marketing message from a Frenc

Perhaps OCRing image spams really is working?

I've previously been skeptical of the idea that OCRing image spams was a worthwhile effort because of the variety of image-obfuscation techniques that spammers had taken to using. But Nick FitzGerald has recently sent me an example of an image spam that seems to indicate that spammers are concerned about the effectiveness of OCR. Here's the image: What's striking is that the spammer has used the same content-obscuring tricks that we've seen with text (e.g. Viagra has become [email protected]@), perhaps out of fear that the OCRing of images is working and revealing the text within the images. Or perhaps this spammer is just really paranoid.

Debugging: Solaris bus error caused by taking pointer to structure member

Take a look at this sample program that fails horribly when compiled on Solaris using gcc (I haven't tried other compilers, and I'm not pointing my finger at gcc here, this is a Sun gotcha). Here's an example program (simplified for something much more complex that I was debugging), that illustrates how memory alignment on SPARC systems can bite you if you are doing low-level things in C. In the example the program allocates space for a thing structure which will be prepended with a header . The header structure has a dummy byte array called data which will be used to reference the start of the thing . struct thing { int an_int; }; struct header { short id; char data[0]; }; struct header * maker( int size ) { return (struct header *)malloc( sizeof( struct header ) + size ); } int main( void ) { struct header * a_headered_thing = maker( sizeof( struct thing ) ); struct thing * a_thing = (struct thing *)&(a_headered_thing->data[0]); a_thi

Code to decode an a.b.c.d/n IP address range to the list of IP addresses

I needed to map some addresses in the standard IP address prefix syntax to the actual list of addresses, and after I'd searched Google for all of 11 usecs for a suitable on-line decoder I hacked one up in Perl. Since someone else might find this handy, here it is: use strict; use warnings; my $address = $ARGV[0] || '/'; my ( $ip, $bits ) = split( /\//, $address ); my @octets = split( /\./, $ip ); if ( ( $address eq '' ) || ( $bits eq '' ) || ( $#octets != 3 ) ) { die "Usage: ip-decode a.b.c.d/x\n"; } my $base = ( ( $octets[0] * 256 + $octets[1] ) * 256 + $octets[2] ) * 256 + $octets[3]; my $remaining = 32 - $bits; for my $i (0..(2 ** $remaining) - 1) { print_ip( $base + $i ); } sub print_ip { my ( $address ) = @_; my @octets; for my $i (0..3) { push @octets, ($address % 256); $address >>= 8; } print "$octets[3].$octets[2].$octets[1].$octets[0]\n"; }

So how much difference do stopwords make in a Naive Bayes text classifier?

POPFile contains a list of stopwords (words that are completely ignored for the purpose of classifying messages) that I put together by hand years ago (if you are a POPFile user they are on the Advanced tab and called Ignored Words). The basic idea is that stopwords are useless from the perspective of classification and should be ignored; they are just too common to provide much information. My commercial machine learning tools did not, until recently, have a stopwords facility. This was based on my belief that stopwords didn't make much difference: if they are common words they'll appear everywhere and probabilities will be equal for each category of classification. I had a 'why bother' attitude. Finally, I got around to testing this assumption. And I can give some actual numbers. Taking the standard 20 Newsgroups test and using a (complement) Naive Bayes text classifier as described here I can give you some numbers. The bottom line is that stopwords did

Introducing Usman's Law

Back at Electric Cloud I worked with a smart guy named Usman Muzaffar. As part of his job he spent a lot of time dealing with our customers, many of whom used GNU Make other other Make tools to build their software. One of the constant problems that Usman encountered was that most people had no way to get back to a truly clean build. No matter what they'd put in place for doing a totally scratch, clean build it was hard for everyone because their process often accidentally ommitted to delete something. I've observed this problem in my own code. Like many people I have a 'make clean' option which deletes all the output files: in my case by rm -rf ing an obj directory: .PHONY: clean clean: @rm -rf $(OUT)/* @$(MAKE_DIRECTORIES) And I make sure that generated things only go under $(OUT) . But it's easy to screw up. Consider a program like yacc or bison which'll create temporary source code files in the same place as the source code being analyzed. The

Electric Cloud wins a Jolt Productivity Award

Back in 2005 POPFile (which is now in desperate need of an updated version) won a Productivity Award at the 15th Annual Jolt awards. This week the company I co-founded, Electric Cloud , won the exact same award for its product ElectricCommander . OK, I should stop bragging now. And show a little humility. Truth be told, the glow from the second award is strictly reflected... I didn't design, code, or do anything to make ElectricCommander :-) But being a company founder is a good thing; you get to pretend you had all the smart ideas.

Calibrating a machine learning-based spam filter

I've been reading up about calibration of text classifiers, and I recommend a few papers to get you started: Transforming Classifier Scores into Accurate Multiclass Probability Estimates , Zadrozny/Elkan, SIGKDD 2002 Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , Zadrozny/Elkan Predicting Good Probabilities With Supervised Learning , Niculescu-Mizil/Caruana. The overall idea is that the scores output by a classifier need to be calibrated so that they can be understood. And, specifically, if you want to understand them as a probability that a document falls into a particular category then calibration gives you a way to estimate the probability from the scores. The simplest technique is bucketing or binning. Suppose that the classifier outputs a score s(x) in the range [0,1) for an input document x . Once a classifier is trained it's possible to calibrating it by classifing known documents, recording the output scores and

An image-based spam that brags about its own delivery

Nick FitzGerald sent me a simple image-based spam for Viagra which starts with a brag on the part of the spammer: Nick calls it a 'self aware' spam.

Image spammers doing the twist

It's been quite a while since I last blogged about ever changing image spam. Anna Vlasova wakens me from my unblogging slumber with some great samples of recent image spams were the spammer has decided to rotate the entire image to try to avoid detect. Take a look at this first one: The spammer has really gone to town here: There's random speckling all over the images to upset hashing and OCR techniques There's no URL in the message itself (it's in the image) The entire image has been rotated to the left to obscure the text And, of course, they are not going to be content with just one rotation and can randomize the angle per message: And they've gone even further by slicing the image up, randomizing the angle and overlaying the elements using animation.

Jack Bauer's Management Secrets #1: I need it!

This is part one of a series of posts unlocking the valuable management secrets and strategies of 24 's best agent: Jack Bauer . What is it that makes Jack successful? Sure, he's a great shot, he's been trained in all sorts of combat, sometimes he's lucky, clearly he's very driven. But what really makes Jack a winner are his managament skills. Jack successfully motivates and manages, he handles superiors and subordinates, he gains people's trust, he has high integrity, he's a team player and ultimately he helps his team win time and again. These posts look into Jack's management secrets. In part one I look out how Jack creates a sense of urgency while at the same time binding his team together towards a common goal. And he does all of that with a simple phrase: 'I need it!'. I need it! Jack doesn't say "I want this done" or "You must do this", he tells his team members (especially, Chloe ) "I need it!&quo

Trusted Email Connection Signing (rev 0.2)

IMPORTANT: This blog post deprecates my previous posting on this subject. The blog post Proposal for connection signing reputation system for email is deprecated. Sign the medium, not the message The motivation behind TECS (Trusted Email Connection Signing) is that what managers of MX servers on the public Internet really care about is the ability to distinguish a good connection (coming from a legitimate sender and which will be used to send wanted email) from a bad connection (coming from a spammer). If you can identify a bad connection (today, you do that using an RBL or other reputation service based on the IP address of the sender) you can tarpit or drop it, or subject the mails sent on the connection to extra scrutiny. If you can identify a good connection it can bypass spam checks and help reduce the overall false positive rate. If you are a legitimate bulk mailer (an email marketer, for example) then you care deeply that you reputation being recognizable and that mail se

Proposal for connection signing reputation system for email: TECS

IMPORTANT: This blog post is deprecated. Please read Trusted Email Connection Signing (rev 0.2) instead The motivation behind TECS (Trusted Email Connection Signing) is that what managers of MX servers on the public Internet really care about is the ability to distinguish a good connection (coming from a legitimate sender and which will be used to send wanted email) from a bad connection (coming from a spammer). If you can identify a bad connection (today, you do that using an RBL or other reputation service based on the IP address of the sender) you can tarpit or drop it, or subject the mails sent on the connection to extra scrutiny. If you can identify a good connection it can bypass spam checks and help reduce the overall false positive rate. Currently, the options used to identify a bad connection are rather limited (RBLs, paid reputation services and grey listing), and good connections are hard to manage (whitelists on a per-recipient basis, or pay-per-mail services). W

What Makefile am I in?

A common request when using GNU Make is: "Is there a way to find the name and path of the current Makefile?". By 'current' people usually mean that Makefile that GNU Make is currently parsing. There's no built-in way to quickly get the answer, but there is a way using the GNU Make variable MAKEFILE_LIST . MAKEFILE_LIST (documented in the manual here ) is the list of Makefiles currently loaded or included. Each time a Makefile is loaded or included the variable is appended. The paths and names in the variable are relative to the current working directory (where GNU Make was started or where it moved to with the -C or --directory option). The current working directory is stored in the CURDIR variable. So you can quite easily define a GNU Make function (let's call it where-am-i ) that will return the current Makefile (it uses $(word) to get the last Makefile name from the list): where-am-i = $(CURDIR)/$(word $(words $(MAKEFILE_LIST)),$(MAKEFILE_LIST)

The Tao of Debugging

I hate debuggers. And not only do I hate them, I rarely use them. For me, a debugger is (almost) always the wrong tool. And people who habitually use debuggers are making a big mistake, because they don't truly understand their code. I suspect that the same people who use debuggers all the time, are the same people who don't unit test their code. Any programmer not writing unit tests for their code in 2007 should be considered a pariah (*). The truth is that if you haven't written unit tests for your code then it's unlikely to actually work. Over the years I've become more and more radical about this: an untested line of code is a broken line of code. Just the other day I came across a wonderful quote from Brian Kernighan: The most effective debugging tool is still careful thought, coupled with judiciously placed print statements. He wrote that in 1979, but it's still true today. There are two really important ideas there: careful thought and print sta