Sunday, February 14, 2010

A bad workman blames his tools

One of the most depressing things about being a programmer is the realization that your time is not entirely spent creating new and exciting programs, but is actually spent eliminating all the problems that you yourself introduced.

This process is called debugging. And on a daily basis every programmer must face that fact that as they write code, they write bugs. And when they find that their code doesn't work, they have to go looking for the problems they created for themselves.

To deal with this problem the computer industry has built up an enormous amount of scar tissue around programs to make sure that they do work. Programmers use continuous integration, unit tests, assertions, static code analysis, memory checkers and debuggers to help prevent and help find bugs. But bugs remain and must be eliminated by human reasoning.

Some programming languages, such as C, are particularly susceptible to certain types of bugs that appear and disappear at random, and once you try figuring out what's causing them they disappear. These are sometimes called heisenbugs because as soon as you go searching for them they vanish.

These bugs can appear in any programming language (and especially when writing multi-threaded code where small changes in timing can uncover or cover race conditions). But in C there's another problem: memory corruption.

Whatever the cause of a bug the key steps in finding an eliminating a bug are:

  1. Find the smallest possible test case that tickles the bug. The aim is to find the smallest and fastest way to reproduce the bug reliably. With heisenbugs this can be hard, but even a fast way to reproduce it some percentage of the time is valuable.

  2. Automate that test case. It's best if the test case can be automated so that it can be run again and again. This also means that the test case can become part of your program's test suite once the bug is eliminated. This'll stop it coming back.

  3. Debug until you find the root cause. The root cause is vital. Unless you fully understand why the bug occurred you can't be sure that you've actually fixed it. It's very easy to get fooled with heisenbugs into thinking that you've eliminated them, when all you've done is covered them up.

  4. Fix it and verify using #2.


Yesterday, a post appeared on Hacker News entitled When you see a heisenbug in C, suspect your compiler’s optimizer. This is, simply put, appalling advice.

The compiler you are using is likely used by thousands or hundreds of thousands of people. Your code is likely used by you. Which is more likely to have been shaken out and stabilized?

In fact, it's a sign of a very poor or inexperienced programmer if their first thought on encountering a bug is to blame someone else. It's tempting to blame the compiler, the library, or the operating system. But the best programmers are those who control their ego and are able to face the fact that it's likely their fault.

Of course, bugs in other people's code do exist. There's no doubt that libraries are faulty, operating systems do weird things and compilers do generate odd code. But most of the time, it's you, the programmer's fault. And that applies even if the bug appears to be really weird.

Debugging is often a case of banging your head against your own code repeating to yourself all of the impossible things that can't ever happen in your code until one of those impossible things turns out to be possible and you've got the bug.

The linked article contains an example of exactly what not to conclude:

“OK, set your optimizer to -O0,”, I told Jay, “and test. If it fails to segfault, you have an optimizer bug. Walk the optimization level upwards until the bug reproduces, then back off one.”

All you know from changing optimization levels is that optimization changes whether the bug appears or not. That doesn't tell you the optimizer is wrong. You haven't found the root cause of your bug.

Since optimizers perform all sorts of code rearrangement and speed ups changing optimizer levels is very likely to change the presence or absence of a heisenbug. That doesn't make it the optimizer's fault; it's still almost certainly yours.

Here's a concrete example of a simple C program that contains a bug that appears and disappears when optimization level is changed, and exhibits other odd behavior. First, here's the program:

#include <stdlib.h>

int a()
{
int ar[16];

ar[20] = (getpid()%19==0);
}

int main( int argc, char * argv[] )
{
int rc[16];

rc[0] = 0;

a();

return rc[0];
}

Build this with gcc under Mac OS X with the following simple Makefile (I saved it in a file called odd.c):

CC=gcc
CFLAGS=

odd: odd.o

And here's a simple test program for run it 20 times and print the return code:

#!/bin/bash

for i in {0..20}
do
./odd ; echo -n "$? "
done
echo

If you run that test program you'd expect a string of zeroes, because rc[0] is never set to anything other than zero in the program. Yet here's sample output:

$ ./test
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

If you are an experienced C programmer you'll see how I made that 1 appear (and why it appears at different places), but let's try to debug with quick a printf

[...]
rc[0] = 0;

printf( "[%d]", rc[0] );

a();
[...]

Now when you run the test program the bug is gone:

$ ./test
[0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0
[0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0

Weird, so you move the printf:

[...]
rc[0] = 0;

a();

printf( "[%d]", rc[0] );
[...]

and get the same odd result of a disappearing bug. And the same thing happens if you turn the optimizer on even without the printfs (this is the opposite of the situation in the linked article):

$ make CFLAGS=-O3
gcc -O3 -c -o odd.o odd.c
gcc odd.o -o odd
$ ./test
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

This all came about because the function a() allocates a 16 integer array called ar and then promptly writes past the end of it either 1 or 0 depending on whether the PID of the process is divisible by 19 or not. It ends up writing on top of rc[0] because of the arrangement of the stack.

Adding printfs or changing optimization level changes the layout of the code and causes the bad write to not hit rc[0]. But beware! The bug hasn't gone, it's just writing on some other bit of memory.

Because C programs are suspectible to this sort of error it's vital that good tools are used to check for problems. For example, the static code check splint and the memory analyzer valgrind help eliminate tons of nasty C bugs. And you should build your software with the maximum warning level (I prefer warn-as-error) and eliminate them all.

Only once you've done all that should you start to suspect someone else's code. And even when you do, you need to follow the same steps to reproduce the bug and get to the root cause. Most of the time, unfortunately, bugs are your fault.

Saturday, February 13, 2010

If you're searching remember your TF-IDF

Some people seem to be very good at searching the web, others seem to be very poor at it. What differentiates them? I think it's unconcious knowledge of something called TF-IDF (or term frequency-inverse document frequency).

If you clicked through to that Wikipedia link you were probably confronted by a bunch of mathematics, and since you are reading this you probably hit the back button as quickly as possible. But knowing about TF-IDF requires no mathematical knowledge at all. All you need is some common sense.

Put yourself in the shoes of a search engine. Sitting on the hard disks of its vast collection of computers are all the web pages in existence (or almost). Along comes a query from a human.

The first thing the search engine does is discard words that a too common. For example, if the search query contained the word 'the' there's almost no point using it to try to distinguish web pages. All the English ones almost certainly contain the word 'the' (just look at this one and count them).

With the common words removed the search engines goes looking for pages that match the remaining terms and ranks them in some useful order (a lot of Google's success is based on their ranking algorithm). One thing the search engine can take into account is how common the remaining words in the search query are.

For example, suppose the query was "the first imagineer". The search engine ignores 'the' and looks for pages containing "first imagineer". Obviously the results returned need to contain both words, but 'imagineer' is special: it's a rare word. And relatively rare words are a human's best searching friends.

A rare word allows the search engine to cut down the number of pages it needs to examine enormously, and that ends up giving the user better results. The ideal rare word is one that appears almost only on the sort of pages the end user is looking for, and appears in those pages frequently.

In nerdy terms 'appears in those pages frequently' is the TF (or term frequency), and 'almost only in the sort of pages end user is looking for' is the IDF (inverse document frequency).

Since 'imagineer' is a rare word, if the search engine finds a page on which that word occurs many times it's more likely to be relevant to the person searching that on a page where 'imagineer' appears only a few times.

Since 'first' is fairly common it's contribution to the search results is less clear. If 'first' appears many times on the page, but 'imagineer' only once then it's likely that the page is of lesser interest.

When you are searching give a few seconds thought to TF-IDF and ask yourself 'what words are most likely to appear only in the sort of pages I am looking for?' You'll likely get to where you wanted to go much faster that way.

PS If you find out who the first imagineer was, drop me a line.