Thursday, July 15, 2010

Programmer Gore

I recently attended an invitation-only, private event where a small number of venture-backed startups that are doing exciting things in scaling web applications opened the kimono and talked about technical challenges. One of the startups is a major social network who talked about their backend.

When they have a failure they like to truly understand the root cause of the failure and then they do a postmortem with their team. And they love to get into the gory details. They made the point that programmers really love a horrid story about why something failed because it's fascinating and they can learn from the situation.

If they can't get to the root cause they'll fix the problem and then leave a small percentage of their servers running the old buggy code just so they can reproduce the bug and poke at it until they really know why it broke.

It's like telling horror stories around the camp fire:

We shut down all our Apache servers and the bad traffic was still coming.

We pulled the plug on the DNS server and the traffic kept coming.

And then we realized: the traffic was coming from inside the building! (scream)

I think this is a fantastic idea: truly get to the source of a failure and then reveal the gory details. It's a great engineering culture if you can tell those stories and learn from them.

Once at Electric Cloud years ago I was sent into deploy ElectricAccelerator at a large client. Instead of doing what it normally does, speed up a build 20x, the build became slow, really slow. In fact, the product was locking up, but it was a live lock: it was doing something for a very, very long time.

After literally days of debugging I found the problem. ElectricAccelerator has to walk the tree of dependencies in a build making various calculations along the way. The structure of this particular build was such that one particular tree node was highly connected to the rest of the build. Essentially because of the particular shape of the build tree a function was being called on that node to calculate a fixed value over and over again. The simple fix was to memoize the function and the build worked fast.

Later I calculated how long the build would have taken to run without the fix. The answer was that the sun would have died long, long before. Of course, it seems obvious looking back that we should have been caching the value in that function, but distributed or parallel systems have these sorts of problems all the time. It's not a priori obvious what the 'shape' of your data or traffic is going to be.

And when distributed or parallel systems go wrong, they go wrong in weird and wonderful ways that make great programmer gore stories.


Anonymous said...

For a slightly more formal approach to this idea, try Five Whys.

Anonymous said...

The "stickiness" of stories is covered rather nicely in "Made to Stick."

In fact, one of their examples was a copier repair "gore story.

I agree that learning/teaching can use sticky stories to great effect. In fact, I often tell horror stories about The Cosmic Ray Bug. It's a pretty mundane lesson about agreeing upon a protocol, but I have honed the tale over the last 30 years so that it sticks.

Another of my faves is the Nine Month Bug Fix. It's about walking away from a problem to gain real perspective.