Skip to main content

Some architectural details of Signal Spam

Finally, Signal Spam, France's new national anti-spam system, launched and I'm able to talk about it. For a brief introduction in English start here.

I'm not responsible for the idea behind Signal Spam, nor for its organization, but I did write almost all the code used to run the site and the back end system. This blog post talks a little bit about the design of Signal Spam.

Signal Spam lets people send spams via either a web form, or a plug-in. Plug-ins are currently available for Outlook 2003, Outlook 2007 and Thunderbird 2.0; more coming. Currently Signal Spam does three things with every message: it keeps a copy in a database after having extracted information from the body and headers of the message; it figures out if the message came from an ISP in France and if so sends an automatic message to the ISP indicating that they've got a spammer or zombie in their network; it figures out if the message was actually a legitimate e-marketing message from a French mailer and informs the person reporting the spam of how to unsubscribe.

The original plan was that the system be capable of handling 1,000,000 messages per day allowing for peaks of up to 1000 messages per minute (such as when people first come to work in the morning) and that messages would be handled in near real-time (i.e. the time from a message being received by the system to it being analyzed and forwarded to an ISP would be under 60 seconds). Signal Spam also wanted a lot of flexibility in being able to scale the system up as use of the site grew and being able to do maintenance of the site without taking it down.

Here's the last 12 hours of activity on the site, which pretty much matches what we expected with a peak once people get to work and start reading their mail. (These charts are produced automatically by the lovely RRDTool.)

The system I built is split into two parts: the front end (everything that the general public sees including the API used by the plug-ins) and the back end (the actual database storing the messages sent, the software that does analysis and the administration interface). Communication between the front end and the back end uses a web service running over HTTPS.

To make things scale easily the back end is entirely organized around a simple queuing system. When a message arrives from a user it is immediately stored in the database (there are, in fact, two tables: one table contains the actual complete message as a BLOB and the other contains the fields extracted from the message. The messages have the same ID in each table and the non-BLOB table is used for searching and sorting).

Once stored in the database the message ID is added to a FIFO queue (which is actually implemented as a database table). An arbitrary number of processus handle message analysis by dequeuing IDs from the FIFO queue (using row-level locking so that only one process gets each ID). Once dequeued the message is analyzed: fields such as From, Subject, Date are extracted and stored in the database, the Received headers are walked using a combination of blacklist lookup and forgery detection to find the true IP address that injected the message into Internet, the IP address is mapped to the network that manages the IP address, fingerprints of the message are taken and all URLs inside the message are extracted.

Once the analysis is complete the process decides whether the message needs to be sent to an ISP. If so it enqueues the message ID on another FIFO queue for a separate forwarding process to handle. If the message is in fact a legitimate message then the message ID is enqueued on a FIFO queue for another response process to handle.

The forwarding process generates an ARF message to the appropriate ISP and sends the message for every ID that it dequeues from the queue using VERP for bounce or reponse handling.

The response process dequeues IDs and responsed to the original reporter of the spam with a message tailored for the specific e-marketer with full unsubscribe details.

The use of queues and a shared database to handle the queues, plus a simple locking strategy means that arbitrary numbers of processes can be added to handle the load on the system as required (currently there is only one process of each type running and handling all messages in the delay required). It also means that the processus do not need to be on the same machine and the system can scale by adding processus or adding hardware.

Stopping the processes does not stop the front end from operating. Messages will still be added to the database and the analysis queue will grow. In fact, the length of the queue makes measuring the health of the system trivial: just look at the length of the queue to see if we are keeping up or not.

Since the queue has the knowledge about the work to be done processus can be stopped and upgraded as needed without taking the system off line.

To hide all this the entire system (which is written in Perl---in fact, the back end is entirely LAMP) uses an object structure. For example, creating the Message object (passing the raw message into the constructor) performs the initial message analysis and queues the message for further analysis. Access to the database queues is entirely wrapped in a Queue object (constructor takes the queue name). These objects are dynamically loaded by Perl and can be upgraded as needed.

Finally, all the objects (and related scripts) have unit tests using Perl's Test::Class and the entire system can be tested with a quick 'make test'. One complexity is that most of the classes require access to the database. To work around this I have created a Test::Database class that is capable of setting up a complete MySQL system from scratch (assuming MySQL is currently installed) and loading the right schema, that is totally independent of any other MySQL instance. The class then returns a handle (DBI) to that instance plus a connect string. This means the unit tests are totally independent of having a running database.

In addition, the unit tests include a system that I created for POPFile which allows me to get line-by-line coverage information showing what's tested and what's not. By running 'make coverage' it's possible to run the test suite with coverage information. This gives percentage of lines tested and for every Perl module, class and script a corresponding HTML file is generated with lines colored green (if they were executed during testing) or red (if not). The coloring is achieved by hooking the Perl debugger (see this module from POPFile for details).

Here's an example of what that looks like (here I'm showing Classifier/Bayes.html which corresponds to Classifier/ in the POPFile code, but it looks the same in Signal Spam):

The green lines (with line numbers added) were executed by the test suite; the red line was not (you can see here that my test suite for POPFile didn't test the possibility that the database connect would fail).


Unknown said…
Why did you decide to use a database to implement the queue?
It's currently implemented as a queue in the database because the database allows remote access (so that I can scale across many machines) and has useful properties of robustness. (And we already had a database in the mix).

Since the actual queue access is wrapped by the Queue object the actual implementation is irrelevant to the code that needs dequeue/enqueue operations and could easily be changed out.

Do you have recommendation for a better way?

Jeff said…
I very much like the produce/queue/consumer concept. However, in this case your database is a central point of failure. To scale you said you'll just add consumers. I've taken a slightly different approach in my systems. All machines are producers/queues/consumers. This way there is no central point of failure.

I do agree that the database is a single point of failure (which, of course, we have to get around with replication). I'd be interested in knowing more about the architecture of your program.

Always open to learn something new.

Toby DiPasquale said…
If Signal Spam gets popular, you will have a problem queuing the raw messages into a database, even MySQL. We eventually had to take the messages out of the database and just lay them on disk with pointers to their location going into the DB instead.

-- Toby DiPasquale

Part of the motivation for structuring the database as I did (each message has a row in a table which contains crunched data (e.g. the source IP address, From line) and a totally separate table just with the messages themselves) plus the use of an object to disguise the actual storage was that I figured that the table size might cause us a problem and we'd need to go to disk at some point.

Any idea at what size MySQL gave up for you?

Unknown said…
I assume you're using InnoDB tables (considering the row-level locking in your FIFO queue).

I've never used InnoDB transactionnal tables, but my experience with MyISAM tables is that a table becomes slow to write because of the growing index to update for each row (or write), and this is worst with non-fixed width table.

I don't know how InnoDB tables are architectured (on file for the data, and a seperate one for the index ?) but because of the FIFO concept, you can delete any index.

Just do :
SELECT * FROM my_queue_table LIMIT 1; # id = 42
DELETE FROM my_queue_table WHERE id = 42;

Because there aren't any index, MySQL will perform a full table scan for the DELERTE query... but remember, it's a FIFO table, so the scan will only last to the first row, so you can do without the index.

You may also change tune the concurrent_insert param to 2 in order to have a non-blocking INSERT.

Popular posts from this blog

How to write a successful blog post

First, a quick clarification of 'successful'. In this instance, I mean a blog post that receives a large number of page views. For my, little blog the most successful post ever got almost 57,000 page views. Not a lot by some other standards, but I was pretty happy about it. Looking at the top 10 blog posts (by page views) on my site, I've tried to distill some wisdom about what made them successful. Your blog posting mileage may vary. 1. Avoid using the passive voice The Microsoft Word grammar checker has probably been telling you this for years, but the passive voice excludes the people involved in your blog post. And that includes you, the author, and the reader. By using personal pronouns like I, you and we, you will include the reader in your blog post. When I first started this blog I avoid using "I" because I thought I was being narcissistic. But we all like to read about other people, people help anchor a story in reality. Without people your bl

Your last name contains invalid characters

My last name is "Graham-Cumming". But here's a typical form response when I enter it: Does the web site have any idea how rude it is to claim that my last name contains invalid characters? Clearly not. What they actually meant is: our web site will not accept that hyphen in your last name. But do they say that? No, of course not. They decide to shove in my face the claim that there's something wrong with my name. There's nothing wrong with my name, just as there's nothing wrong with someone whose first name is Jean-Marie, or someone whose last name is O'Reilly. What is wrong is that way this is being handled. If the system can't cope with non-letters and spaces it needs to say that. How about the following error message: Our system is unable to process last names that contain non-letters, please replace them with spaces. Don't blame me for having a last name that your system doesn't like, whose fault is that? Saying "Your

The Elevator Button Problem

User interface design is hard. It's hard because people perceive apparently simple things very differently. For example, take a look at this interface to an elevator: From flickr Now imagine the following situation. You are on the third floor of this building and you wish to go to the tenth. The elevator is on the fifth floor and there's an indicator that tells you where it is. Which button do you press? Most people probably say: "press up" since they want to go up. Not long ago I watched someone do the opposite and questioned them about their behavior. They said: "well the elevator is on the fifth floor and I am on the third, so I want it to come down to me". Much can be learnt about the design of user interfaces by considering this, apparently, simple interface. If you think about the elevator button problem you'll find that something so simple has hidden depths. How do people learn about elevator calling? What's the right amount of