Sunday, February 14, 2010

A bad workman blames his tools

One of the most depressing things about being a programmer is the realization that your time is not entirely spent creating new and exciting programs, but is actually spent eliminating all the problems that you yourself introduced.

This process is called debugging. And on a daily basis every programmer must face that fact that as they write code, they write bugs. And when they find that their code doesn't work, they have to go looking for the problems they created for themselves.

To deal with this problem the computer industry has built up an enormous amount of scar tissue around programs to make sure that they do work. Programmers use continuous integration, unit tests, assertions, static code analysis, memory checkers and debuggers to help prevent and help find bugs. But bugs remain and must be eliminated by human reasoning.

Some programming languages, such as C, are particularly susceptible to certain types of bugs that appear and disappear at random, and once you try figuring out what's causing them they disappear. These are sometimes called heisenbugs because as soon as you go searching for them they vanish.

These bugs can appear in any programming language (and especially when writing multi-threaded code where small changes in timing can uncover or cover race conditions). But in C there's another problem: memory corruption.

Whatever the cause of a bug the key steps in finding an eliminating a bug are:

  1. Find the smallest possible test case that tickles the bug. The aim is to find the smallest and fastest way to reproduce the bug reliably. With heisenbugs this can be hard, but even a fast way to reproduce it some percentage of the time is valuable.

  2. Automate that test case. It's best if the test case can be automated so that it can be run again and again. This also means that the test case can become part of your program's test suite once the bug is eliminated. This'll stop it coming back.

  3. Debug until you find the root cause. The root cause is vital. Unless you fully understand why the bug occurred you can't be sure that you've actually fixed it. It's very easy to get fooled with heisenbugs into thinking that you've eliminated them, when all you've done is covered them up.

  4. Fix it and verify using #2.

Yesterday, a post appeared on Hacker News entitled When you see a heisenbug in C, suspect your compiler’s optimizer. This is, simply put, appalling advice.

The compiler you are using is likely used by thousands or hundreds of thousands of people. Your code is likely used by you. Which is more likely to have been shaken out and stabilized?

In fact, it's a sign of a very poor or inexperienced programmer if their first thought on encountering a bug is to blame someone else. It's tempting to blame the compiler, the library, or the operating system. But the best programmers are those who control their ego and are able to face the fact that it's likely their fault.

Of course, bugs in other people's code do exist. There's no doubt that libraries are faulty, operating systems do weird things and compilers do generate odd code. But most of the time, it's you, the programmer's fault. And that applies even if the bug appears to be really weird.

Debugging is often a case of banging your head against your own code repeating to yourself all of the impossible things that can't ever happen in your code until one of those impossible things turns out to be possible and you've got the bug.

The linked article contains an example of exactly what not to conclude:

“OK, set your optimizer to -O0,”, I told Jay, “and test. If it fails to segfault, you have an optimizer bug. Walk the optimization level upwards until the bug reproduces, then back off one.”

All you know from changing optimization levels is that optimization changes whether the bug appears or not. That doesn't tell you the optimizer is wrong. You haven't found the root cause of your bug.

Since optimizers perform all sorts of code rearrangement and speed ups changing optimizer levels is very likely to change the presence or absence of a heisenbug. That doesn't make it the optimizer's fault; it's still almost certainly yours.

Here's a concrete example of a simple C program that contains a bug that appears and disappears when optimization level is changed, and exhibits other odd behavior. First, here's the program:

#include <stdlib.h>

int a()
int ar[16];

ar[20] = (getpid()%19==0);

int main( int argc, char * argv[] )
int rc[16];

rc[0] = 0;


return rc[0];

Build this with gcc under Mac OS X with the following simple Makefile (I saved it in a file called odd.c):


odd: odd.o

And here's a simple test program for run it 20 times and print the return code:


for i in {0..20}
./odd ; echo -n "$? "

If you run that test program you'd expect a string of zeroes, because rc[0] is never set to anything other than zero in the program. Yet here's sample output:

$ ./test
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

If you are an experienced C programmer you'll see how I made that 1 appear (and why it appears at different places), but let's try to debug with quick a printf

rc[0] = 0;

printf( "[%d]", rc[0] );


Now when you run the test program the bug is gone:

$ ./test
[0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0
[0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0 [0]0

Weird, so you move the printf:

rc[0] = 0;


printf( "[%d]", rc[0] );

and get the same odd result of a disappearing bug. And the same thing happens if you turn the optimizer on even without the printfs (this is the opposite of the situation in the linked article):

$ make CFLAGS=-O3
gcc -O3 -c -o odd.o odd.c
gcc odd.o -o odd
$ ./test
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

This all came about because the function a() allocates a 16 integer array called ar and then promptly writes past the end of it either 1 or 0 depending on whether the PID of the process is divisible by 19 or not. It ends up writing on top of rc[0] because of the arrangement of the stack.

Adding printfs or changing optimization level changes the layout of the code and causes the bad write to not hit rc[0]. But beware! The bug hasn't gone, it's just writing on some other bit of memory.

Because C programs are suspectible to this sort of error it's vital that good tools are used to check for problems. For example, the static code check splint and the memory analyzer valgrind help eliminate tons of nasty C bugs. And you should build your software with the maximum warning level (I prefer warn-as-error) and eliminate them all.

Only once you've done all that should you start to suspect someone else's code. And even when you do, you need to follow the same steps to reproduce the bug and get to the root cause. Most of the time, unfortunately, bugs are your fault.


If you enjoyed this blog post, you might enjoy my travel book for people interested in science and technology: The Geek Atlas. Signed copies of The Geek Atlas are available.


<$BlogCommentDateTime$> <$BlogCommentDeleteIcon$>

Post a Comment

Links to this post:

<$BlogBacklinkControl$> <$BlogBacklinkTitle$> <$BlogBacklinkDeleteIcon$>
Create a Link

<< Home