Skip to main content

How could I have coded this better?

For work I had to write a small script to scrape Google results to measure relative rank of my company and others that we like to track. We were looking at various search terms and where we appear in the results.

So, I decided to use Python with BeautifulSoup. I love BeautifulSoup, but I don't use Python very often, so... Python experts: criticize my code! How would you have done this better?
# Script to perform Google searches and extract the domain names of
# the returned results.  The order in which the domain names are
# returned is used to determine a ranking between different companies.

# This is the list of domains to look for in Google searches.  To
# search for more or different domains simply alter this list.  The
# order of this list determines the order in which the results are
# saved.

domains = [ 'foo.com', 'bar.com', 'baz.com' ]

from BeautifulSoup import BeautifulSoup
import urllib, urllib2
from urlparse import urlparse

# Performs a Google search and returns a BeautifulSoup object
# containing the parsed, returned page.
def google(q): # Query string (will be URL quoted by this function)

    # This attempts to ask for 100 results from Google.  This does not
    # always seem to work correctly.  Note that the fake User-Agent is
    # required otherwise Google will reject the search

    url = "http://google.com/search?q=%s&num=100&hl=en&start=0" % 
          urllib.quote_plus(q)
    req = urllib2.Request(url, None, {'User-Agent':'Mozilla/5.0 (Macintosh; U; 
             Intel Mac OS X 10_5_8; en-US) AppleWebKit/534.13 (KHTML, like Gecko) 
             Chrome/9.0.597.94 Safari/534.13'} )
    res = urllib2.urlopen(req)
    dat = res.read()
    res.close()

    return BeautifulSoup(dat)

# Small helper function to output a single line of a CSV file.  Note
# that this does not do quoting of arguments so assumes there are not
# " or , present.
def csv(f, l): # First elements to print followed by list of elements
               # to print
    print "%s," % f, ", ".join(l)

# Cartesian product function (similar to itertools.product but joins
# the #elements together as a space separated string rather than
# returning a tuple)
def product(*args): # List to x together
    pools = map(tuple, args)
    result = [[]]
    for pool in pools:
        result = [x+[y] for x in result for y in pool]
    for prod in result:
        yield " ".join(prod)

# This generates all possible search strings to be used using
# product.  This can be modified to create other search terms by
# altering the lists (add/delete elements, or add/delete lists).  Each
# of these terms will have "whizz bang" appended below

terms = product( [ 'a', 'b', 'c' ],
                 [ 'foo', 'bar', 'baz' ],
                 [ 'one', 'two', 'three', 'four' ] )

csv('term', domains)

for t in terms:

    # All the queries have 'whizz bang' appended to the end of
    # them

    qu = "%s whizz bang" % t

    # This performs a Google query using the helper function and then
    # extracts all the <a> tags that have class "l".  If Google
    # changes the structure of their pages this is where this code
    # will break.  Currently class=l grabs all the appropriate links
    # (displayed in green in the search results).

    so = google(qu)

    # The urlparse(u['href'])[1] works by extracting the href from the
    # <a> tag, parsing it into component parts and extracting the 1th
    # element of the returned tuple which contains the netloc (the
    # domain name)

    a = [ urlparse(u['href'])[1] for u in so.findAll("a", {"class":"l"}) ]

    # The pos array stores the lowest position in which a specific
    # domain has been seen in the results from Google.  Each element
    # corresponds to an element of domains.  Initially, these are set
    # to 0 to indicate 'not found' in the results.

    rank = [ "0" for d in domains ]

    # Here we iterate through the returned domain names in a and match
    # them up with the domain names we are looking for.

    for i in range(len(a)):
        for j in range(len(domains)):

            # Note that the comparison here deals with two cases.  The
            # domain is entirely 'foo.com' (for example), or the
            # domain ends with '.foo.com' (for example).

            if ((domains[j] == a[i]) or a[i].endswith(".%s" % domains[j])):
                rank[j] = "%d" % (i+1) # Count ranks from 1 not 0

    csv(qu, rank)

Comments

Anonymous said…
That looks pretty good to me. There are a couple of different idioms I would have used. For constructing URL query strings, I prefer to use urllib.urlencode:

url = "http://google.com/search?" + urllib.urlencode({
'q': q,
'num': 100,
'hl': 'en',
'start': 0,
})

Also, where you do a range(len(l)) loop I'd suggest instead using the enumerate function:

for index, item in enumerate(list_of_items): ...
Greg said…
I would make a small style-change to reduce the subscripting:

for i, href in enumerate(a):
for j, domain in enumerate(domains):

if ((domain == href) or href.endswith(".%s" % domain)):
rank[j] = "%d" % (i+1) # Count ranks from 1 not 0
Anonymous said…
This is my attempt: https://gist.github.com/834249

Basically I made docstring from some of your comments, pulled the configurable bits to the top and made some more use of the standard library.
Also the way you wrote the loop to search for the rank was not very pythonic.
Finally I added a main() function and the typical if __name__ == "__main__" idiom.
Thanks to all of you. Fascinating to see how my code would look in the hands of an experienced Python programmer. Such a quick way to get shown some of the idioms and common patterns.

Cheers!
Andrew said…
After a quick look through, most of the aspects that strike me as notably un-Pythonic are stylistic/structural in nature.

A few points:

a.) The two variables most likely to change are separated by three function definitions. The 'terms' list is hidden in the middle of everything.

To clean this up, I'd create a tuple of the lists right below 'domains', and then later call product(*terms).

b.) It is generally considered good practice to keep code you want executed when the script is invoked nested into a conditional that checks the value of '__name__', or in a function called by that conditional.

E.g.:

def main():
....csv('term', domains)
....
....for t in terms:
.... ...

if __name__ == '__main__':
....main()

This helps you in a number of ways. Firstly, importing this file into another python script is now possible without having the code execute on import. Secondly, you could now easily extend this to pass command-line arguments to the main function, allowing you to pass in the domains and/or terms externally, without modifying the script.

c.) "Readability counts" and "Explicit is better than implicit" come to mind. In a number of places, variable naming is confusing or ambiguous, with you seeming to err on the side of brevity over clarity.

Your csv function, for example, accepts two parameters: f and l. 'f' is the first element to print, 'l' is the list of elements. Is 'f' the first element from the 'l' list? Nope! But you can't tell that until you look at your print statement.

Try to let the code and variable names do the explaining for you, instead of adding comments to explain what they mean and do. If you run through your whole script keeping this in mind, I think you'll find a number of ways you can make it clearer. Here is a re-written version of your csv function:

def csv(prefix, elements):
....print ', '.join([prefix] + elements)

In your main loop, going over the various term permutations, you've used some short variable names in place of their more easily parsed full words. By calling it 'query' instead of 'qu', the code can be more fluidly internally read. You've assigned the results of your google function to a variable named 'so'. Calling it 'soup' provides the necessary context for me to understand what the function returns (it also wouldn't hurt to call the google function google_soup instead).

Below that, you have a list of anchor hypertext reference values you've pulled out of the soup. By keeping the variables you use for lists, tuples, dicts, and so on plural, you help to infer their type. My first assumption with 'a' would be that it is a single anchor value, but with 'anchors' I'd know it was some sort of list (I did the same with 'elements' in the csv example above).

d.) The code logic itself.

Regarding:
rank = [ "0" for d in domains ]

While harmless with such a small list, your list comprehension for forming a list of strings unnecessarily iterates over every item in 'domains'. Instead, you can multiply a list of the string to be duplicated by the length of the 'domains' list:
rank = ["0"] * len(domains)

Further, your nested for loops are a bit messy, and should ideally be using enumerate to obtain the indexes of the list elements. You can also clean up your string comparison by using the built-in find method.

Ah, a quick refresh of the page shows that you've accepted a number of comments, which cover a few things I've already mentioned.

Building on Greg's comment about the use of enumerate:

for i, href in enumerate(a):
....for j, domain in enumerate(domains):
........if href.find(domain) != -1:
............rank[j] = i+1

Hope this helps. There's a few other things worth mentioning, but I believe I'm nearing the word limit.
Dave Kirby said…
1) Use the csv module to format the output text - this will properly quote fields with commas in them.

2) Use the 'with' statement when opening a file-like object such as that returned by urlopen. This will automatically close the object when it goes out of the block, including when an exception is thrown.

3) as others have said, use enumerate instead of iterating over range(x).
Wow. Even more of a Python master class. I've been tempted to really get into Python because the functional aspects appeal and your replies have only made me wish to spend more time on it.

Thank you!
Ruben Berenguel said…
Looks clean enough for my taste. I have a similar piece of code, done in Emacs. I never finished it... So far it scrapes google and bing for my domain name in a particular keyword search, and writes the result in a buffer named keywrd.engine.dat It is not as clean as your code, but the structure is similar enough, seen from a distance.

My plan was to track my most important keywords and pages together, to see improvements, but I arrived to the point where I had to thoroughly plan how the program would need to go and the fun started to disappear.

I'll probably finish it some day, as I'm enjoying a recent surge in Emacs lispery: I have written a major mode to markup my blog posts (saving me a lot of time!), and I've realised how much time it is saving me to code this kind of stuff in an emacs clean way.

Cheers,

Ruben
Anonymous said…
It would be pythonic to use the csv module, but it does have some issues with Unicode support.
Michael said…
The code looks good to me, and works reasonably well, though Google seems to only be returning 10 results, and ignoring the "num" parameter.

Testing further, "num" is ignored in Instant search, for which there appears to be no public method to disable aside from a preferences cookie.

Popular posts from this blog

Your last name contains invalid characters

My last name is "Graham-Cumming". But here's a typical form response when I enter it:


Does the web site have any idea how rude it is to claim that my last name contains invalid characters? Clearly not. What they actually meant is: our web site will not accept that hyphen in your last name. But do they say that? No, of course not. They decide to shove in my face the claim that there's something wrong with my name.

There's nothing wrong with my name, just as there's nothing wrong with someone whose first name is Jean-Marie, or someone whose last name is O'Reilly.

What is wrong is that way this is being handled. If the system can't cope with non-letters and spaces it needs to say that. How about the following error message:

Our system is unable to process last names that contain non-letters, please replace them with spaces.

Don't blame me for having a last name that your system doesn't like, whose fault is that? Saying "Your last name …

All the symmetrical watch faces (and code to generate them)

If you ever look at pictures of clocks and watches in advertising they are set to roughly 10:10 which is meant to be the most attractive (smiling!) position for the hands. They are actually set to 10:09.14 if the hands are truly symmetrical. CC BY 2.0image by Shinji
I wanted to know what all the possible symmetrical watch faces are and so I wrote some code using Processing. Here's the output (there's one watch face missing, 00:00 or 12:00, because it's very boring):



The key to writing this is to figure out the relationship between the hour and minute hands when the watch face is symmetrical. In an hour the minute hand moves through 360° and the hour hand moves through 30° (12 hours are shown on the watch face and 360/12 = 30).
The core loop inside the program is this:   for (int h = 0; h <= 12; h++) {
    float m = (360-30*float(h))*2/13;
    int s = round(60*(m-floor(m)));
    int col = h%6;
    int row = floor(h/6);
    draw_clock((r+f)*(2*col+1), (r+f)*(row*2+1), r, h, floor(m…

Importing an existing SSL key/certificate pair into a Java keystore

I'm writing this blog post in case anyone else has to Google that. In Java 6 keytool has been improved so that it now becomes possible to import an existing key and certificate (say one you generated outside of the Java world) into a keystore.

You need: Java 6 and openssl.

1. Suppose you have a certificate and key in PEM format. The key is named host.key and the certificate host.crt.

2. The first step is to convert them into a single PKCS12 file using the command: openssl pkcs12 -export -in host.crt -inkey host.key > host.p12. You will be asked for various passwords (the password to access the key (if set) and then the password for the PKCS12 file being created).

3. Then import the PKCS12 file into a keystore using the command: keytool -importkeystore -srckeystore host.p12 -destkeystore host.jks -srcstoretype pkcs12. You now have a keystore named host.jks containing the certificate/key you need.

For the sake of completeness here's the output of a full session I performe…