Skip to main content

tacoli: a simple logging format

A post on Hacker News entitled Log Everything As JSON. Make Your Life Easier reminded me of my private logging strategy which has the following properties:

1. Easy to parse and analyze with Unix command-line tools such as grep, cut, sort, uniq, and wc

2. Easy to parse and analyze in code using Perl, Ruby, or Go

3. Compact

4. Easily expandable and lacking the ambiguity of simple delimited log formats

I call it tacoli (which stands for Tabs, Colons and Lines).  Here are the tacoli logging rules: Each log entry is a single line that starts with the date/time; the second entry on the line is a string called the 'generator' which indicates where the log line came from (such as the program or module); all the other entries have the format "key: value"; and entries are tab-delimited and no tabs are allowed in keys, values or the generator name.

That's it.  Here's an example log line from Apache in this format:

22/Apr/2012:06:29:07 +0000      apache  ip: method: GET     uri: /example.html code:301        size:305        referer:        agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.162

Note that it's easy to make Apache output this format just by using tabs and adding the appropriate key: to each field in the LogFormat.  No special logger module required.  In fact, anything that can 'printf' a string can create tacoli lines trivially.

It's trivial to parse in code, all you need is 'split' to break on the tabs, and then split again to break the key name from the value.  No specialized JSON (or other parser) required.

It's trivial to extend without breaking any tools.  Just add a new field (anywhere on the line) with a new key.

It's simple to work with using Unix tools.  Since the format is 'one log entry per line' it works well with wc -l to count instances of anything and it interfaces with all the other Unix tools that expect to work with lines (and even in code the line oriented nature is helpful since getting a complete entry is a single line read).

If you want to extract a single field from each line of the log file then it's easy to do with grep.  Here's an example that extracts all the lines that have an ip entry and just extracts that

grep -Po "\tip: [^\t]+" access.log

The key name can be trivially removed using cut

grep -Po "\tip: [^\t]+" access.log | cut -d: -f2-

and the output can be fed into the other Unix tools.  Also, if you know that your log file format hasn't changed you can still use the positional information to simplify parsing and fall back to cut.
It isn't quite as compact as a log file format that only uses position to indicate meaning, but compression largely overcomes that problem and key names can be chosen to be short and unique.


zdw said…
The example date format given here is nonideal.

You should be using ISO8601 for date/time stamps, unless you require more accuracy, in which case something like TAI64 may be more appropriate.
Anonymous said…
Seems like a good idea at first blush, but this will cause you no end of heartache as soon as your log data contains embedded colons. Those will pop up much more often than you might expect: Windows pathnames, standard format timestamps, even many user-agent strings contain colons (I'm looking at you, Firefox!).

JSON or other marginally structured formats protect you against that, and that's well worth the cost.
Unknown said…
The leading space from the IP address can be removed by changing the cut(1) to

cut -d' ' -f2-

It will leave further leading spaces but then they're significant since they were present in the outputted field.

printf 'a\tip:\tb\n' |
grep -Po "\tip: [^\t]+" |
cut -d' ' -f2-

sed 's/[^ ]* //' would also achieve the same and for some reason I tend to prefer it over cut(1); less counting of fields and more thinking about the change?
Nils said…
Ericm: That isn't the problem. You split on tabs, then on _the first_ colon. Unless you allow colons in the key, that is. Embedded tabs though...
curt504 said…
I tried living in a JSON encoded log file world. I literally got headaches! No gain, I felt pain. This is a good idea. Pair it with Splunk or Kibana log parsing DB / search tool and you'd have a good comprimise between still human readable and machine parsible.
curt504 said…
This is a good compromise between still being human readible and better machine parsibility. Combine with Splunk or Kibana log DB / search engines and you have a good combination.
Rob said…
You could also use a positive lookbehind assertion and get rid of cut all together:

grep -Po '(?<=ip: )[^\t]+'

Popular posts from this blog

Your last name contains invalid characters

My last name is "Graham-Cumming". But here's a typical form response when I enter it: Does the web site have any idea how rude it is to claim that my last name contains invalid characters? Clearly not. What they actually meant is: our web site will not accept that hyphen in your last name. But do they say that? No, of course not. They decide to shove in my face the claim that there's something wrong with my name. There's nothing wrong with my name, just as there's nothing wrong with someone whose first name is Jean-Marie, or someone whose last name is O'Reilly. What is wrong is that way this is being handled. If the system can't cope with non-letters and spaces it needs to say that. How about the following error message: Our system is unable to process last names that contain non-letters, please replace them with spaces. Don't blame me for having a last name that your system doesn't like, whose fault is that? Saying "Your

All the symmetrical watch faces (and code to generate them)

If you ever look at pictures of clocks and watches in advertising they are set to roughly 10:10 which is meant to be the most attractive (smiling!) position for the hands . They are actually set to 10:09.14 if the hands are truly symmetrical. CC BY 2.0 image by Shinji I wanted to know what all the possible symmetrical watch faces are and so I wrote some code using Processing. Here's the output (there's one watch face missing, 00:00 or 12:00, because it's very boring): The key to writing this is to figure out the relationship between the hour and minute hands when the watch face is symmetrical. In an hour the minute hand moves through 360° and the hour hand moves through 30° (12 hours are shown on the watch face and 360/12 = 30). The core loop inside the program is this:   for (int h = 0; h <= 12; h++) {     float m = (360-30*float(h))*2/13;     int s = round(60*(m-floor(m)));     int col = h%6;     int row = floor(h/6);     draw_clock((r+f)*(2*col+1), (r+f)*(row*2+1),

The Elevator Button Problem

User interface design is hard. It's hard because people perceive apparently simple things very differently. For example, take a look at this interface to an elevator: From flickr Now imagine the following situation. You are on the third floor of this building and you wish to go to the tenth. The elevator is on the fifth floor and there's an indicator that tells you where it is. Which button do you press? Most people probably say: "press up" since they want to go up. Not long ago I watched someone do the opposite and questioned them about their behavior. They said: "well the elevator is on the fifth floor and I am on the third, so I want it to come down to me". Much can be learnt about the design of user interfaces by considering this, apparently, simple interface. If you think about the elevator button problem you'll find that something so simple has hidden depths. How do people learn about elevator calling? What's the right amount of