Friday, April 27, 2012

tacoli: a simple logging format

A post on Hacker News entitled Log Everything As JSON. Make Your Life Easier reminded me of my private logging strategy which has the following properties:

1. Easy to parse and analyze with Unix command-line tools such as grep, cut, sort, uniq, and wc

2. Easy to parse and analyze in code using Perl, Ruby, or Go

3. Compact

4. Easily expandable and lacking the ambiguity of simple delimited log formats

I call it tacoli (which stands for Tabs, Colons and Lines).  Here are the tacoli logging rules: Each log entry is a single line that starts with the date/time; the second entry on the line is a string called the 'generator' which indicates where the log line came from (such as the program or module); all the other entries have the format "key: value"; and entries are tab-delimited and no tabs are allowed in keys, values or the generator name.

That's it.  Here's an example log line from Apache in this format:

22/Apr/2012:06:29:07 +0000      apache  ip: method: GET     uri: /example.html code:301        size:305        referer:        agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.162

Note that it's easy to make Apache output this format just by using tabs and adding the appropriate key: to each field in the LogFormat.  No special logger module required.  In fact, anything that can 'printf' a string can create tacoli lines trivially.

It's trivial to parse in code, all you need is 'split' to break on the tabs, and then split again to break the key name from the value.  No specialized JSON (or other parser) required.

It's trivial to extend without breaking any tools.  Just add a new field (anywhere on the line) with a new key.

It's simple to work with using Unix tools.  Since the format is 'one log entry per line' it works well with wc -l to count instances of anything and it interfaces with all the other Unix tools that expect to work with lines (and even in code the line oriented nature is helpful since getting a complete entry is a single line read).

If you want to extract a single field from each line of the log file then it's easy to do with grep.  Here's an example that extracts all the lines that have an ip entry and just extracts that

grep -Po "\tip: [^\t]+" access.log

The key name can be trivially removed using cut

grep -Po "\tip: [^\t]+" access.log | cut -d: -f2-

and the output can be fed into the other Unix tools.  Also, if you know that your log file format hasn't changed you can still use the positional information to simplify parsing and fall back to cut.
It isn't quite as compact as a log file format that only uses position to indicate meaning, but compression largely overcomes that problem and key names can be chosen to be short and unique.


zdw said...

The example date format given here is nonideal.

You should be using ISO8601 for date/time stamps, unless you require more accuracy, in which case something like TAI64 may be more appropriate.

ericm said...

Seems like a good idea at first blush, but this will cause you no end of heartache as soon as your log data contains embedded colons. Those will pop up much more often than you might expect: Windows pathnames, standard format timestamps, even many user-agent strings contain colons (I'm looking at you, Firefox!).

JSON or other marginally structured formats protect you against that, and that's well worth the cost.

Ralph Corderoy said...

The leading space from the IP address can be removed by changing the cut(1) to

cut -d' ' -f2-

It will leave further leading spaces but then they're significant since they were present in the outputted field.

printf 'a\tip:\tb\n' |
grep -Po "\tip: [^\t]+" |
cut -d' ' -f2-

sed 's/[^ ]* //' would also achieve the same and for some reason I tend to prefer it over cut(1); less counting of fields and more thinking about the change?

Nils said...

Ericm: That isn't the problem. You split on tabs, then on _the first_ colon. Unless you allow colons in the key, that is. Embedded tabs though...

curt504 said...

I tried living in a JSON encoded log file world. I literally got headaches! No gain, I felt pain. This is a good idea. Pair it with Splunk or Kibana log parsing DB / search tool and you'd have a good comprimise between still human readable and machine parsible.

curt504 said...

This is a good compromise between still being human readible and better machine parsibility. Combine with Splunk or Kibana log DB / search engines and you have a good combination.

Rob said...

You could also use a positive lookbehind assertion and get rid of cut all together:

grep -Po '(?<=ip: )[^\t]+'