Skip to main content

Posts

Showing posts from February, 2010

A bad workman blames his tools

One of the most depressing things about being a programmer is the realization that your time is not entirely spent creating new and exciting programs, but is actually spent eliminating all the problems that you yourself introduced. This process is called debugging. And on a daily basis every programmer must face that fact that as they write code, they write bugs. And when they find that their code doesn't work, they have to go looking for the problems they created for themselves. To deal with this problem the computer industry has built up an enormous amount of scar tissue around programs to make sure that they do work. Programmers use continuous integration , unit tests , assertions , static code analysis , memory checkers and debuggers to help prevent and help find bugs. But bugs remain and must be eliminated by human reasoning. Some programming languages, such as C, are particularly susceptible to certain types of bugs that appear and disappear at random, and once you t

If you're searching remember your TF-IDF

Some people seem to be very good at searching the web, others seem to be very poor at it. What differentiates them? I think it's unconcious knowledge of something called TF-IDF (or term frequency-inverse document frequency). If you clicked through to that Wikipedia link you were probably confronted by a bunch of mathematics, and since you are reading this you probably hit the back button as quickly as possible. But knowing about TF-IDF requires no mathematical knowledge at all. All you need is some common sense. Put yourself in the shoes of a search engine. Sitting on the hard disks of its vast collection of computers are all the web pages in existence (or almost). Along comes a query from a human. The first thing the search engine does is discard words that a too common. For example, if the search query contained the word 'the' there's almost no point using it to try to distinguish web pages. All the English ones almost certainly contain the word 'the