Finally, Signal Spam, France's new national anti-spam system, launched and I'm able to talk about it. For a brief introduction in English start here.
I'm not responsible for the idea behind Signal Spam, nor for its organization, but I did write almost all the code used to run the site and the back end system. This blog post talks a little bit about the design of Signal Spam.
Signal Spam lets people send spams via either a web form, or a plug-in. Plug-ins are currently available for Outlook 2003, Outlook 2007 and Thunderbird 2.0; more coming. Currently Signal Spam does three things with every message: it keeps a copy in a database after having extracted information from the body and headers of the message; it figures out if the message came from an ISP in France and if so sends an automatic message to the ISP indicating that they've got a spammer or zombie in their network; it figures out if the message was actually a legitimate e-marketing message from a French mailer and informs the person reporting the spam of how to unsubscribe.
The original plan was that the system be capable of handling 1,000,000 messages per day allowing for peaks of up to 1000 messages per minute (such as when people first come to work in the morning) and that messages would be handled in near real-time (i.e. the time from a message being received by the system to it being analyzed and forwarded to an ISP would be under 60 seconds). Signal Spam also wanted a lot of flexibility in being able to scale the system up as use of the site grew and being able to do maintenance of the site without taking it down.
Here's the last 12 hours of activity on the site, which pretty much matches what we expected with a peak once people get to work and start reading their mail. (These charts are produced automatically by the lovely RRDTool.)
The system I built is split into two parts: the front end (everything that the general public sees including the API used by the plug-ins) and the back end (the actual database storing the messages sent, the software that does analysis and the administration interface). Communication between the front end and the back end uses a web service running over HTTPS.
To make things scale easily the back end is entirely organized around a simple queuing system. When a message arrives from a user it is immediately stored in the database (there are, in fact, two tables: one table contains the actual complete message as a BLOB and the other contains the fields extracted from the message. The messages have the same ID in each table and the non-BLOB table is used for searching and sorting).
Once stored in the database the message ID is added to a FIFO queue (which is actually implemented as a database table). An arbitrary number of processus handle message analysis by dequeuing IDs from the FIFO queue (using row-level locking so that only one process gets each ID). Once dequeued the message is analyzed: fields such as From, Subject, Date are extracted and stored in the database, the Received headers are walked using a combination of blacklist lookup and forgery detection to find the true IP address that injected the message into Internet, the IP address is mapped to the network that manages the IP address, fingerprints of the message are taken and all URLs inside the message are extracted.
Once the analysis is complete the process decides whether the message needs to be sent to an ISP. If so it enqueues the message ID on another FIFO queue for a separate forwarding process to handle. If the message is in fact a legitimate message then the message ID is enqueued on a FIFO queue for another response process to handle.
The forwarding process generates an ARF message to the appropriate ISP and sends the message for every ID that it dequeues from the queue using VERP for bounce or reponse handling.
The response process dequeues IDs and responsed to the original reporter of the spam with a message tailored for the specific e-marketer with full unsubscribe details.
The use of queues and a shared database to handle the queues, plus a simple locking strategy means that arbitrary numbers of processes can be added to handle the load on the system as required (currently there is only one process of each type running and handling all messages in the delay required). It also means that the processus do not need to be on the same machine and the system can scale by adding processus or adding hardware.
Stopping the processes does not stop the front end from operating. Messages will still be added to the database and the analysis queue will grow. In fact, the length of the queue makes measuring the health of the system trivial: just look at the length of the queue to see if we are keeping up or not.
Since the queue has the knowledge about the work to be done processus can be stopped and upgraded as needed without taking the system off line.
To hide all this the entire system (which is written in Perl---in fact, the back end is entirely LAMP) uses an object structure. For example, creating the Message object (passing the raw message into the constructor) performs the initial message analysis and queues the message for further analysis. Access to the database queues is entirely wrapped in a Queue object (constructor takes the queue name). These objects are dynamically loaded by Perl and can be upgraded as needed.
Finally, all the objects (and related scripts) have unit tests using Perl's Test::Class and the entire system can be tested with a quick 'make test'. One complexity is that most of the classes require access to the database. To work around this I have created a Test::Database class that is capable of setting up a complete MySQL system from scratch (assuming MySQL is currently installed) and loading the right schema, that is totally independent of any other MySQL instance. The class then returns a handle (DBI) to that instance plus a connect string. This means the unit tests are totally independent of having a running database.
In addition, the unit tests include a system that I created for POPFile which allows me to get line-by-line coverage information showing what's tested and what's not. By running 'make coverage' it's possible to run the test suite with coverage information. This gives percentage of lines tested and for every Perl module, class and script a corresponding HTML file is generated with lines colored green (if they were executed during testing) or red (if not). The coloring is achieved by hooking the Perl debugger (see this module from POPFile for details).
Here's an example of what that looks like (here I'm showing Classifier/Bayes.html which corresponds to Classifier/Bayes.pm in the POPFile code, but it looks the same in Signal Spam):
The green lines (with line numbers added) were executed by the test suite; the red line was not (you can see here that my test suite for POPFile didn't test the possibility that the database connect would fail).