Friday, July 26, 2013

OSCON 2013 Keynote: Turing's Curse

O'Reilly have put up a YouTube video of my keynote from today entitled "Turing's Curse". It's a short talk about the history of computing and about how history repeats itself.


Wednesday, July 03, 2013

Your test suite is trying to tell you something

A few weeks ago I started wondering about 'the test that occasionally and randomly breaks' in a large test suite at my job. The test, called 'overloaded origin', tests a situation where a web server becomes overwhelmed with requests and a proxy server (the code being tested) has to handle the situation gracefully.

The test works by having a dummy web server that could randomly decide to (a) return a normal web page for a request (b) read the HTTP headers and then do nothing for 30 seconds and (c) read the HTTP headers, wait 30 seconds and then send a valid response. The proxy server is hit by 5,000 clients simultaneously requesting the same URL.

And sometimes, every now and again, this test failed.

And like many engineers I'd ignored it for a long time. But it kept worrying me because it must have meant something: computers are deterministic, after all. I was spurred to action by a colleague suggesting that the test be disabled because it was 'flaky'.

It took me two days of continuous work to find out what was wrong and it explained other occasional problems that had been seen with the code. And it made the test suite 100% stable on all platforms. That 'randomly failing test' was really 'a genuine bug in the code'.

But getting to that point was tricky because this was a system level test with clients, servers and the proxy and a memcached server in the middle. It turned out that the memcached server was the problem. In the end, I had to implement my own memcached server (a simple one) so that I had complete control over the environment. In doing so, I discovered the root cause of the problem.

The program has a timeout used to stop it waiting for memcached if it doesn't respond quickly (within 100s of ms).  Here are the lines of code that handle the memcached timeout (this is from inside the proxy server being tested).
var Timeout time.Duration
Timeout = time.Duration(conf("timeout", 100)) * time.Millisecond

cache := memcache.New(Servers...)
cache.Timeout = Timeout * time.Millisecond
The first two lines read the timeout value from a configuration file (with a default of 100) and convert that to a time.Duration in ms. The following lines (later in the code) use that value to set the timeout on the memcached connection

Oops!

There's another * time.Millisecond there. So, 100ms, for example, would become something much larger. To find out what you just need to know what a time.Duration is: it's a value representing a number of nanoseconds. 

So, the initial value of Timeout is 100,000,000ns (since 1ms is 1,000,000ns). Then when the second multiply happens Timeout becomes 100,000,000,000,000ns which is close to 28 hours.

The test would fail because occasionally the connection to memcached would fail resulting in the timeout starting. Instead of gracefully failing at 100ms the program was prepared to wait 28 hours.

And examination of the source code control log showed that this bug had always been there, right from time those lines of code were written.

By me.

Write good commit messages

Over the years I've become more and more verbose in commit messages. For example, here's a recent commit message for something I'm working on at CloudFlare (I've obscured some details). This is actually a one line change to a Makefile but gives a good example of what I'm aiming for.
commit 6769d6679019623a6749783ea285043d9449d009
Author: John Graham-Cumming
Date:   Mon Jul 1 13:04:05 2013 -0700

    Sort the output of $(wildcard) as it is unsorted in GNU Make 3.82+

    The Makefile was relying on the output of $(wildcard) to be sorted. This is
    important because the XXXXXXXXXXXX rules have files that are numbered and
    must be handled in order. The XXXXXXX relies on this order to build the rules
    in the correct order (and set the order attributes in the JSON files). This
    worked with GNU Make 3.81

    In GNU Make 3.82 the code that globs has been changed to add the GLOB_NOSORT
    option and so the output of $(wildcard) is no longer ordered and the build
    would break. For example,

       make clean-out && make

    would fail because the XXXXXXXXXXXXXXXX (which is used for the XXXXX action)
    which appears in XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX would not have been
    parsed before the XXXXX action was used in some other XXXXX file. That would
    generate a fatal error.

    The solution is simple: wrap the $(wildcard) in $(sort). The actual change
    uses a $(foreach) loop because it's necessary to keep the directories in the
    order specified in the Makefile and the files within the directory are sorted
    by the $(sort $(wildcard ...)). The directory order is important because
    XXXXXXXXXXXX must be processed before the other rule directories because it
    contains XXXXXXXXXXXXXXXXXXXXXXXXXXX which sets the XXXXXXXXXX thresholds.
The first line gives a brief summary of the commit. But the rest explains in detail why this change was made (a change in GNU Make 3.82 in this case), why the change in GNU Make 3.82 caused a problem, verification of what actually changed that caused the problem, how to reproduce the problem and finally a note about the specific implementation. The final note is there so that someone looking at the commit later can understand what I was thinking and assumptions that went into the change.

I've come to like long commit messages for a number of reasons.

Firstly, I tend to forget why I changed things. And I certainly forget detailed reasoning about a change.

Secondly, I'm not working alone. Other people are looking at my code (and its changes) and need to be able to understand how the code evolves.

And these long commit messages overcome the problem of code comments that get out of date. Because the commit message is tied to a specific diff (and hence state of the code) it never gets out of date.

There's another interesting effect. These log messages take just a minute or two to write, but they force me to write clearly what I've been doing. Sometimes this causes me to stop and go "Oh wait, I've forgotten X". Some part of writing down a description of what I'm doing (for someone else to read) makes my brain apply different neurons to the task.

Here's another example from a different project:
commit 86db749caf52b20c682b3230d2488dad08b7b7fe
Author: John Graham-Cumming
Date:   Mon Jul 1 10:14:49 2013 -0700

    Handle SIGABRT and force a panic

    It can be useful to crash XXXXXXX via a signal to get a stack trace of every
    running goroutine. To make this reliable have added handling of SIGABRT.

    If you do,

       kill -ABRT 

    A panic is generated with message "panic: SIGABRT called" followed by
    a stack trace of every goroutine.

Tuesday, July 02, 2013

The Plain Mail Snail: One way to make people switch to using encrypted email

Due to revelations about access to private email (and other electronic communication) by the NSA and GCHQ some people have been suggesting that we all need to start using encrypted email. I've had PGP/GPG keys since about 1995 and I have only ever received a handful of encrypted mails.

So, how do you make people send you encrypted mail? I think an 'economic' incentive is necessary.

If you send me an unencrypted email it will be delayed by 12 hours before it is delivered. Encrypted email will be delivered immediately.

This is actually pretty easy to accomplish. An SMTP server can examine the contents of an incoming email and determine if it is encrypted or not. If it's not encrypted it can be placed in a delay queue and delivered after the appropriate delay; at the same time the server can send a message warning the sender of the delay and perhaps educating them about how to send encrypted mail.

This scheme could be called the Plain Text Tarpit (PTT) or perhaps the Plain Mail Snail.

If PTT were implemented then mail clients would quickly be upgraded to automatically handle email encryption.

PS What about mailing lists?

Either they accept the 12 hour delay or they find the public key of the people they are sending to.