Friday, February 18, 2011

How could I have coded this better?

For work I had to write a small script to scrape Google results to measure relative rank of my company and others that we like to track. We were looking at various search terms and where we appear in the results.

So, I decided to use Python with BeautifulSoup. I love BeautifulSoup, but I don't use Python very often, so... Python experts: criticize my code! How would you have done this better?
# Script to perform Google searches and extract the domain names of
# the returned results.  The order in which the domain names are
# returned is used to determine a ranking between different companies.

# This is the list of domains to look for in Google searches.  To
# search for more or different domains simply alter this list.  The
# order of this list determines the order in which the results are
# saved.

domains = [ '', '', '' ]

from BeautifulSoup import BeautifulSoup
import urllib, urllib2
from urlparse import urlparse

# Performs a Google search and returns a BeautifulSoup object
# containing the parsed, returned page.
def google(q): # Query string (will be URL quoted by this function)

    # This attempts to ask for 100 results from Google.  This does not
    # always seem to work correctly.  Note that the fake User-Agent is
    # required otherwise Google will reject the search

    url = "" % 
    req = urllib2.Request(url, None, {'User-Agent':'Mozilla/5.0 (Macintosh; U; 
             Intel Mac OS X 10_5_8; en-US) AppleWebKit/534.13 (KHTML, like Gecko) 
             Chrome/9.0.597.94 Safari/534.13'} )
    res = urllib2.urlopen(req)
    dat =

    return BeautifulSoup(dat)

# Small helper function to output a single line of a CSV file.  Note
# that this does not do quoting of arguments so assumes there are not
# " or , present.
def csv(f, l): # First elements to print followed by list of elements
               # to print
    print "%s," % f, ", ".join(l)

# Cartesian product function (similar to itertools.product but joins
# the #elements together as a space separated string rather than
# returning a tuple)
def product(*args): # List to x together
    pools = map(tuple, args)
    result = [[]]
    for pool in pools:
        result = [x+[y] for x in result for y in pool]
    for prod in result:
        yield " ".join(prod)

# This generates all possible search strings to be used using
# product.  This can be modified to create other search terms by
# altering the lists (add/delete elements, or add/delete lists).  Each
# of these terms will have "whizz bang" appended below

terms = product( [ 'a', 'b', 'c' ],
                 [ 'foo', 'bar', 'baz' ],
                 [ 'one', 'two', 'three', 'four' ] )

csv('term', domains)

for t in terms:

    # All the queries have 'whizz bang' appended to the end of
    # them

    qu = "%s whizz bang" % t

    # This performs a Google query using the helper function and then
    # extracts all the <a> tags that have class "l".  If Google
    # changes the structure of their pages this is where this code
    # will break.  Currently class=l grabs all the appropriate links
    # (displayed in green in the search results).

    so = google(qu)

    # The urlparse(u['href'])[1] works by extracting the href from the
    # <a> tag, parsing it into component parts and extracting the 1th
    # element of the returned tuple which contains the netloc (the
    # domain name)

    a = [ urlparse(u['href'])[1] for u in so.findAll("a", {"class":"l"}) ]

    # The pos array stores the lowest position in which a specific
    # domain has been seen in the results from Google.  Each element
    # corresponds to an element of domains.  Initially, these are set
    # to 0 to indicate 'not found' in the results.

    rank = [ "0" for d in domains ]

    # Here we iterate through the returned domain names in a and match
    # them up with the domain names we are looking for.

    for i in range(len(a)):
        for j in range(len(domains)):

            # Note that the comparison here deals with two cases.  The
            # domain is entirely '' (for example), or the
            # domain ends with '' (for example).

            if ((domains[j] == a[i]) or a[i].endswith(".%s" % domains[j])):
                rank[j] = "%d" % (i+1) # Count ranks from 1 not 0

    csv(qu, rank)


openid said...

That looks pretty good to me. There are a couple of different idioms I would have used. For constructing URL query strings, I prefer to use urllib.urlencode:

url = "" + urllib.urlencode({
'q': q,
'num': 100,
'hl': 'en',
'start': 0,

Also, where you do a range(len(l)) loop I'd suggest instead using the enumerate function:

for index, item in enumerate(list_of_items): ...

Greg said...

I would make a small style-change to reduce the subscripting:

for i, href in enumerate(a):
for j, domain in enumerate(domains):

if ((domain == href) or href.endswith(".%s" % domain)):
rank[j] = "%d" % (i+1) # Count ranks from 1 not 0 said...

This is my attempt:

Basically I made docstring from some of your comments, pulled the configurable bits to the top and made some more use of the standard library.
Also the way you wrote the loop to search for the rank was not very pythonic.
Finally I added a main() function and the typical if __name__ == "__main__" idiom.

John Graham-Cumming said...

Thanks to all of you. Fascinating to see how my code would look in the hands of an experienced Python programmer. Such a quick way to get shown some of the idioms and common patterns.


Andy said...

After a quick look through, most of the aspects that strike me as notably un-Pythonic are stylistic/structural in nature.

A few points:

a.) The two variables most likely to change are separated by three function definitions. The 'terms' list is hidden in the middle of everything.

To clean this up, I'd create a tuple of the lists right below 'domains', and then later call product(*terms).

b.) It is generally considered good practice to keep code you want executed when the script is invoked nested into a conditional that checks the value of '__name__', or in a function called by that conditional.


def main():
....csv('term', domains)
....for t in terms:
.... ...

if __name__ == '__main__':

This helps you in a number of ways. Firstly, importing this file into another python script is now possible without having the code execute on import. Secondly, you could now easily extend this to pass command-line arguments to the main function, allowing you to pass in the domains and/or terms externally, without modifying the script.

c.) "Readability counts" and "Explicit is better than implicit" come to mind. In a number of places, variable naming is confusing or ambiguous, with you seeming to err on the side of brevity over clarity.

Your csv function, for example, accepts two parameters: f and l. 'f' is the first element to print, 'l' is the list of elements. Is 'f' the first element from the 'l' list? Nope! But you can't tell that until you look at your print statement.

Try to let the code and variable names do the explaining for you, instead of adding comments to explain what they mean and do. If you run through your whole script keeping this in mind, I think you'll find a number of ways you can make it clearer. Here is a re-written version of your csv function:

def csv(prefix, elements):
....print ', '.join([prefix] + elements)

In your main loop, going over the various term permutations, you've used some short variable names in place of their more easily parsed full words. By calling it 'query' instead of 'qu', the code can be more fluidly internally read. You've assigned the results of your google function to a variable named 'so'. Calling it 'soup' provides the necessary context for me to understand what the function returns (it also wouldn't hurt to call the google function google_soup instead).

Below that, you have a list of anchor hypertext reference values you've pulled out of the soup. By keeping the variables you use for lists, tuples, dicts, and so on plural, you help to infer their type. My first assumption with 'a' would be that it is a single anchor value, but with 'anchors' I'd know it was some sort of list (I did the same with 'elements' in the csv example above).

d.) The code logic itself.

rank = [ "0" for d in domains ]

While harmless with such a small list, your list comprehension for forming a list of strings unnecessarily iterates over every item in 'domains'. Instead, you can multiply a list of the string to be duplicated by the length of the 'domains' list:
rank = ["0"] * len(domains)

Further, your nested for loops are a bit messy, and should ideally be using enumerate to obtain the indexes of the list elements. You can also clean up your string comparison by using the built-in find method.

Ah, a quick refresh of the page shows that you've accepted a number of comments, which cover a few things I've already mentioned.

Building on Greg's comment about the use of enumerate:

for i, href in enumerate(a):
....for j, domain in enumerate(domains):
........if href.find(domain) != -1:
............rank[j] = i+1

Hope this helps. There's a few other things worth mentioning, but I believe I'm nearing the word limit.

Dave Kirby said...

1) Use the csv module to format the output text - this will properly quote fields with commas in them.

2) Use the 'with' statement when opening a file-like object such as that returned by urlopen. This will automatically close the object when it goes out of the block, including when an exception is thrown.

3) as others have said, use enumerate instead of iterating over range(x).

John Graham-Cumming said...

Wow. Even more of a Python master class. I've been tempted to really get into Python because the functional aspects appeal and your replies have only made me wish to spend more time on it.

Thank you!

RBerenguel said...

Looks clean enough for my taste. I have a similar piece of code, done in Emacs. I never finished it... So far it scrapes google and bing for my domain name in a particular keyword search, and writes the result in a buffer named keywrd.engine.dat It is not as clean as your code, but the structure is similar enough, seen from a distance.

My plan was to track my most important keywords and pages together, to see improvements, but I arrived to the point where I had to thoroughly plan how the program would need to go and the fun started to disappear.

I'll probably finish it some day, as I'm enjoying a recent surge in Emacs lispery: I have written a major mode to markup my blog posts (saving me a lot of time!), and I've realised how much time it is saving me to code this kind of stuff in an emacs clean way.



yorksranter said...

It would be pythonic to use the csv module, but it does have some issues with Unicode support.

pixelbath said...

The code looks good to me, and works reasonably well, though Google seems to only be returning 10 results, and ignoring the "num" parameter.

Testing further, "num" is ignored in Instant search, for which there appears to be no public method to disable aside from a preferences cookie.