Monday, April 30, 2012

Make your own 'prime factorization' diagram

The Prime Factorization Sweater is a lovely idea and I thought it would be fun to reproduce the same idea electronically so that I could print out a poster version for home.

Enter Processing.

With it I've developed a small program that produces a diagram of the first 100 numbers and for each number there's a circle broken up into arcs.  Each arc is a prime factor.  As in the original sweater each factor gets a unique color (assigning unique colors is rather complex and I ended up using the color difference method based on CMC l:c and a nice online tool that does the work for you).

Here's the finished product.  The top left corner is the number 1 and the numbers read right to left.  So the first red circle is a prime number (2), the second the next number (3, which is prime) and so on.

There's also an option to print the numbers involved.

The source code is in the pfd repository on GitHub and licensed under GPLv2. Processing is a really nice environment for this sort of rapid hacking of anything graphical. See, for example, how I used it to visualize Ikea Lillabo Train Set layouts.

PS After encouragement in the comments from the person who had the original idea for the prime factorization sweater I've made a CafePress store in which you can buy men's and women's T-shirts printed with the prime factorization diagram.

Friday, April 27, 2012

tacoli: a simple logging format

A post on Hacker News entitled Log Everything As JSON. Make Your Life Easier reminded me of my private logging strategy which has the following properties:

1. Easy to parse and analyze with Unix command-line tools such as grep, cut, sort, uniq, and wc

2. Easy to parse and analyze in code using Perl, Ruby, or Go

3. Compact

4. Easily expandable and lacking the ambiguity of simple delimited log formats

I call it tacoli (which stands for Tabs, Colons and Lines).  Here are the tacoli logging rules: Each log entry is a single line that starts with the date/time; the second entry on the line is a string called the 'generator' which indicates where the log line came from (such as the program or module); all the other entries have the format "key: value"; and entries are tab-delimited and no tabs are allowed in keys, values or the generator name.

That's it.  Here's an example log line from Apache in this format:

22/Apr/2012:06:29:07 +0000      apache  ip: method: GET     uri: /example.html code:301        size:305        referer:        agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.162

Note that it's easy to make Apache output this format just by using tabs and adding the appropriate key: to each field in the LogFormat.  No special logger module required.  In fact, anything that can 'printf' a string can create tacoli lines trivially.

It's trivial to parse in code, all you need is 'split' to break on the tabs, and then split again to break the key name from the value.  No specialized JSON (or other parser) required.

It's trivial to extend without breaking any tools.  Just add a new field (anywhere on the line) with a new key.

It's simple to work with using Unix tools.  Since the format is 'one log entry per line' it works well with wc -l to count instances of anything and it interfaces with all the other Unix tools that expect to work with lines (and even in code the line oriented nature is helpful since getting a complete entry is a single line read).

If you want to extract a single field from each line of the log file then it's easy to do with grep.  Here's an example that extracts all the lines that have an ip entry and just extracts that

grep -Po "\tip: [^\t]+" access.log

The key name can be trivially removed using cut

grep -Po "\tip: [^\t]+" access.log | cut -d: -f2-

and the output can be fed into the other Unix tools.  Also, if you know that your log file format hasn't changed you can still use the positional information to simplify parsing and fall back to cut.
It isn't quite as compact as a log file format that only uses position to indicate meaning, but compression largely overcomes that problem and key names can be chosen to be short and unique.

The Greatest Machine That Never Was

I was invited to talk at TEDx Imperial College and gave a talk about Charles Babbage's Analytical Engine called The Greatest Machine That Never Was. Here's the video of that talk:

All the other talks are here. The project to build the Analytical Engine is Plan 28.

Sunday, April 22, 2012

Deglitching a Sparkfun 7-segment Serial Display

The display in my Ambient Bus Arrival Monitor is a Sparkfun 7-segment Serial Display connected to the TTL serial port.  I had noticed that occasionally the display would reset itself to 0000 (or sometimes 0, 00 or 000).  It was even possible to make it do this by touching the body of the bus.  It didn't happen often so I was able to ignore it but then it began to happen more and more.

After a very long and tedious investigation I discovered why.  I started out by blaming my code, my soldering, the cable I was using, the quality of the connectors, ...  Only having eliminated everything that I'd touched did I realize it must be something else.

The display has two input methods: serial (which I am using) and SPI (which I am not).  The SPI interface has a clock signal (which in the case of the display is acting as an input) called SCK.  The manual for the display says "The display is configured using SPI mode 0 (CPOL = 0, CPHA = 0), so the clock line should idle low".

If you take a quick look at the schematic for the display you'll discover that the SCK pin on the microcontroller is just connected to a solder pad for connection.

The upshot is that if you (the user) don't connect that pin to something Sparkfun aren't doing it for you.  The display glitch I was seeing was that this floating clock input would sometimes go high and the firmware for the device would then read a byte (all zeroes of course) and write to the display.

There were two possible fixes: hack the firmware so that it ignores SPI completely, or force the SCK pin low all the time.  I opted for the latter (since it was a quick fix) and connected a 10k resistor between that pin and ground.  Glitch gone.

Pity that Sparkfun didn't include a pull-down resistor on SCK (and possibly on RX as well).

Monday, April 16, 2012

Getting around the London 2012 branding police

The Guardian reported the other day on the London 2012 Olympics branding police who ensure that words like London, 2012 or Games aren't being used by people who didn't pay to use them:
As well as introducing an additional layer of protection around the word "Olympics", the five-rings symbol and the Games' mottoes, the major change of the legislation is to outlaw unauthorised "association". This bars non-sponsors from employing images or wording that might suggest too close a link with the Games. Expressions likely to be considered a breach of the rules would include any two of the following list: "Games, Two Thousand and Twelve, 2012, Twenty-Twelve". 
Using one of those words with London, medals, sponsors, summer, gold, silver or bronze is another likely breach. The two-word rule is not fixed, however: an event called the "Great Exhibition 2012" was threatened with legal action last year under the Act over its use of "2012" (Locog later withdrew its objection).
And today I received a funny email from Novotel where they are forced to use euphemisms for the London 2012 Olympic Games because they are clearly not an official sponsor.  It reminded me of the wonderful world of email penis enlargement spam where filters would spot all the common terms and spammers would insert "male enhancement" and other terms to get through.

Here Novotel is forced to refer to "London's summer of sport" and "London's Big Event".  So, that's how you do it folks, think like a penis enlargement spammer and you can talk about the London 2012 Olympics all you want.

My blog is in no way associated with the Londinium 0x7DC Ολυμπιακοί Demonstrations of World Class Athletic Ability.  But if you do wish to talk about them, may I suggest the official name Londinium MMXII and hash tag #mmxii.  And here's an alternative logo representing the five interlocking benzene rings of benzopyrene:

Why benzopyrene you ask?  Well, it's because benzopyrene messes up DNA transcription (or copying).  So benzopyrene is a reminder to not copy any of the official DNA of London 2012.

More support for open software in science

In the space of two months both the most famous scientific journals world-wide have published pieces arguing for open source code.

Back in February myself and two co-authors had a paper in Nature arguing for open software in science.  That paper was entitled The Case for Open Computer Programs.  Last week the US journal Science published a piece entitled Shining Light into Black Boxes arguing the same thing and giving policy recommendations.

Is it not now time for an international cooperation on defining standards for code openness and associated policies?  The Science paper lays out suggested policies and could be used as a starting point:

Saturday, April 14, 2012

Brief Plan 28 Update

Starting today people who asked to be kept informed about Plan 28 and the construction of Babbage's Analytical Engine have started to receive emails asking them to confirm subscription to the official mailing list.  People who want to join the mailing list can subscribe here.  The official Twitter account is @plan28.

Finally, Plan 28 is getting moving.

Over the next few weeks expect announcements about initial funding and the general schedule for the project.

Tuesday, April 10, 2012

Bletchley Park is Blooming

Despite the persistent drizzly rain yesterday it was clear that spring time had come to Bletchley Park in more ways than one.  The trees and flowers around the grounds were starting to blossom and bloom and inside the slightly rickety Second World War walls the museum is undergoing its own springtime.

After years of struggle to first save, then preserve and now, finally, improve this precious part of British history, the hard work by staff and volunteers is beginning to become obvious to even the most casual visitor.

By flickr user Draco2008
I've been visiting Bletchley Park for a long time and for a while it was hard to take a non-enthusiast around because the museum itself was a bit of a jumble.  BP simply didn't have the money (or spare time away from fighting for survival) to create a fantastic museum suitable for all.  But now it's really happening and it's easy to see how Bletchley Park's spring time can turn into summer.

It's easy for me to sing the praises of Bletchley Park because I'm so fascinated by the technical history of the place, but it's important to realize that Bletchley Park has something that most museums do not: the place is part of the exhibit.

Bletchley Park doesn't contain a collection of objects or stories of things that happened elsewhere.  When you walk through the front gates you are entering a time warp world.  Your first clue comes in the form of the low-rise buildings hastily constructed during the Second World War that first housed the code breakers and now house the museum itself.

For Bletchley Park is both place and museum, and unlike some stuffily preserved country house, it's full of life.  For as well as having the place and the exhibits, Bletchley Park is filled with the stories of what happened there.  And these stories are brought to life by a continuous stream of enthusiastic volunteers and veterans.

Of course, Bletchley Park is not today at the same level of sophistication as many British museums that have had years to perfect their displays and explanations (and in some cases drive out any enthusiasm that was present in their staff).

But the new things that are happening at Bletchley Park show the route to a glorious future to reflect its glorious past.  The new Alan Turing Exhibit has been deservedly nominated for the Art Fund Prize and puts the rebuilt Bombe in proper context.  Colossus has finally got a proper viewing gallery.  And the Radio Society of Great Britain has opened the National Radio Centre.

Couple that with the constant activities available (yesterday children were following the Easter Bunny around going on a children-themed visit) and Bletchley Park is becoming a great day out.  And it's easy to reach.  If you haven't visited Bletchley Park do so now before it becomes so popular that you are forced to apply for tickets on line with timed entry!

Of course, Bletchley Park isn't out of the woods yet.  Support is still needed and it still doesn't have any continuous form of government funding.  Donation information is here.

And, one specific project is looking for sponsors.  I've written before about the project to build one of Alan Turing's other inventions: Delilah.  Delilah was a secure speech system (or scrambler) that Turing worked on and thanks to the declassification of documents surrounding it, it is currently being reconstructed by the team that worked on the Bombe.  They are currently funding it out of their own pockets (to the tune of £1,000s) and are looking for sponsors (corporate or personal) to help finish the machine.  Contact me if you are interested.

Tuesday, April 03, 2012

In praise of... text files and protocols

The other night I had to debug a problem where CMYK colors specified in an OmniGraffle file weren't making it into an exported PDF (or at least appeared not to be). At first it looked like it might be a nightmare because what I really wanted to do was ignore the OmniGraffle UI and look inside the .graffle file and the PDF itself. But salvation was at hand: both .graffle and PDF are text formats. The OmniGraffle file is actually an XML document (in some cases it's a gzipped XML document but it can be decompressed with gunzip). Here, for example is part of the Colors.graffle file that's provided as a sample. It's easy to see the RGB colors that are specified and just as easy to modify them by adjusting the text file.
Yes, it's an image of text.  Just like a binary file is an 'image' of something that could have been easy to manipulate.
While fiddling around in the .graffle file looking at the CMYK colors I spotted that some straight lines that had been drawn in OmniGraffle were not quite straight. That's quite tricky to see in the UI, but dead easy in the XML document and you can simply fix the coordinates. Here, for example, are segments of a line drawn from the Nucleobases.graffle sample file:

The text format made it easy to examine what was happening under the hood of the fancy UI, to quickly fix small problems and to manipulate the file using other programs. Similarly, once I'd determined that in the file I was working with the CMYK colors were fine I exported a PDF and decompressed the result using pdftk. It was fairly easy to follow a color specification through from the .graffle file and into the PDF. Here, for example, is an RGB color specified in the Nucleobases.graffle and the corresponding color appearing in the exported PDF of the same file:

And with that I was able to determine that the CMYK colors were correct and that any problem lay with the person I was sending the PDF to.

The deeper story is that human-readable text formats are wonderful: they are easily debugged, they are easily manipulated (with text editors and other tools like awk and sed), and they can be compressed using common compression programs if space is a problem.

Similarly text based protocols (such as HTTP, IMAP, SMTP, FTP and POP3) make it easy for humans to write, read and debug. One of the things that made POPFile easy to implement was that all the mail protocols are text based (the entire POP3 proxying module is able to use simple string matching and regular expressions to handle POP3). And they are also line oriented (a command is read by reading to the line ending). That makes programs to handle them very easy to implement.

Recently I used an undocumented API that was entirely text-based (using JSON) to obtain live bus arrival times in London and make an Ambient Bus Arrival Monitor.

Another great example of a text format appears in the code that's behind Hacker News and UseTheSource.  In the Lisp philosophy your program is also data it can consume and the data about users is simply sent to a file as Arc code meaning that any admin tasks that don't (yet) have UI can be performed trivially by hand:

Of course, the downside is that text takes up extra space and for low-level protocols (such as IP) it makes sense to use binary. But for almost everything else it's best to use text. Only use binary protocols where the performance is so sensitive that it's worth the implementation and debugging downside. The upside is that no special tools are needed.

I wonder how much of the success of the Internet can be put down to the decision to use text-based protocols for almost everything that people will need to implement.  And how much we owe the early writes of the RFCs in deciding that text was best.

PS A reader points to Eric Raymond's Art of Unix Programming and specifically the chapter called Textuality.

PPS A commenter over at Hacker News makes the very good point that it's easy to version/diff text files and very hard with binary.

PPPS Another commenter over at Hacker News points out that there's a chapter in The Pragmatic Programmer called The Power of Plain Text.

Making an old USB printer support Apple AirPrint using a Raspberry Pi

There are longer tutorials on how to connect a USB printer to a Raspberry Pi and make it accessible via AirPrint but here's the minimal ...