Skip to main content

Parsing HTML in Python with BeautifulSoup

I got into a spat with Eric Raymond the other day about some code he's written called ForgePlucker. I took a look at the source code and posted saying it looks like a total hack job by a poor programmer.

Raymond replied by posting a blog entry in which he called me a poor fool and snotty kid.

So far so good. However, he hadn't actually fixed the problems I was talking about (and which I still think are the work of a poor programmer). This morning I checked and he's removed two offending lines that I was talking about and done some code rearrangement. The function that had caught my eye initially was one to parse data from an HTML table which he does with this code:

def walk_table(text):
"Parse out the rows of an HTML table."
rows = []
while True:
oldtext = text
# First, strip out all attributes for easier parsing
text = re.sub('<TR[^>]+>', '<TR>', text, re.I)
text = re.sub('<TD[^>]+>', '<TD>', text, re.I)
# Case-smash all the relevant HTML tags, we won't be keeping them.
text = text.replace("</table>", "</TABLE>")
text = text.replace("<td>", "<TD>").replace("</td>", "</TD>")
text = text.replace("<tr>", "<TR>").replace("</tr>", "</TR>")
text = text.replace("<br>", "<BR>")
# Yes, Berlios generated \r<BR> sequences with no \n
text = text.replace("\r<BR>", "\r\n")
# And Berlios generated doubled </TD>s
# (This sort of thing is why a structural parse will fail)
text = text.replace("</TD></TD>", "</TD>")
# Now that the HTML table structure is canonicalized, parse it.
if text == oldtext:
break
end = text.find("</TABLE>")
if end > -1:
text = text[:end]
while True:
m = re.search(r"<TR>\w*", text)
if not m:
break
start_row = m.end(0)
end_row = start_row + text[start_row:].find("</TR>")
rowtxt = text[start_row:end_row]
rowtxt = rowtxt.strip()
if rowtxt:
rowtxt = rowtxt[4:-5]# Strip off <TD> and </TD>
rows.append(re.split(r"</TD>\s*<TD>", rowtxt))
text = text[end_row+5:]
return rows

The problem with writing code like that is maintenance. It's got all sorts of little assumptions and special cases. Notice how it can't cope with a mixed case <TD> tag? Or how there's a special case for handling a doubled </TD>?

A much better approach is to use an HTML parser than knows all about the foibles of real HTML in the real world (Raymond's main argument in his blog posting is that you can't rely on the HTML structure to give you semantic information---I actually agree with that, but don't agree that throwing the baby out with the bath water is the right approach). If you use such an HTML parser you eliminate all the hassles you had maintaining regular expressions for all sorts of weird HTML situations, dealing with case, dealing with HTML attributes.

Here's the equivalent function written using the BeautifulSoup parser:

def walk_table2(text):
"Parse out the rows of an HTML table."
soup = BeautifulSoup(text)
return [ [ col.renderContents() for col in row.findAll('td') ]
for row in soup.find('table').findAll('tr') ]

In Raymond's code above he includes a little jab at this style saying:

# And Berlios generated doubled </TD>s
# (This sort of thing is why a structural parse will fail)
text = text.replace("</TD></TD>", "</TD>")

But that doesn't actually stand up to scrutiny. Try it and see. BeautifulSoup handles the extra </TD> without any special cases.

Bottom line: parsing HTML is hard, don't make it harder on yourself by deciding to do it yourself.

Disclaimer: I am not an experienced Python programmer, there could be a nicer way to write my walk_table2 function above, although I think it's pretty clear what it's doing.

Comments

Anonymous said…
Excellent job! I've done things both ways in the past (mostly due to lack of good parsers at the time), and prefer using BeautifulSoup -- I know I can't come up with all the possible exceptions myself.
Jim Robert said…
I agree with your assessment of beautiful soup - less code is usually better code
peterbe said…
I agree with you. Why even be interested in structure when you can get the meaning straight away.

PS. If you use lxml and BeautifulSoup you can use CSS to extract meaning from a broken HTML document. Come Eric! Catch up!
Unknown said…
Beautifulsoup is a generalized library. I have used regex for specific matches within strings within tags of HTML code. In large files, repeated parsing slows down considerably. A proper regex string can fit certain specifics better for your unique code.
Anonymous said…
All my projects contain BeautifulSoup at some point; it's fantastically great.
rwenderlich said…
I used Beautiful Soup for the first time last week - loved it, it made what I was trying to do super easy.
Unknown said…
Afraid that for uses such as Google App Engine the overhead of Beautiful Soup are too much.

Find that using RE.VERBOSE and grouping (?P<>) what is required helps with maintainability.
Unknown said…
I've done quite a bit of scraping in Python, the bulk of it using PyParsing, sometimes in combination with BeautifulSoup. If you haven't heard of PyParsing, I suggest you have a try: http://pyparsing.wikispaces.com/
Unknown said…
I agree with the idea that parsing html using regex is bad. However I disagree that it's less code. In this article you're comparing the beautiful soup function call to the full function as written on the blog. I trust that Beautiful Soup is more code.

Popular posts from this blog

Your last name contains invalid characters

My last name is "Graham-Cumming". But here's a typical form response when I enter it:


Does the web site have any idea how rude it is to claim that my last name contains invalid characters? Clearly not. What they actually meant is: our web site will not accept that hyphen in your last name. But do they say that? No, of course not. They decide to shove in my face the claim that there's something wrong with my name.

There's nothing wrong with my name, just as there's nothing wrong with someone whose first name is Jean-Marie, or someone whose last name is O'Reilly.

What is wrong is that way this is being handled. If the system can't cope with non-letters and spaces it needs to say that. How about the following error message:

Our system is unable to process last names that contain non-letters, please replace them with spaces.

Don't blame me for having a last name that your system doesn't like, whose fault is that? Saying "Your last name …

All the symmetrical watch faces (and code to generate them)

If you ever look at pictures of clocks and watches in advertising they are set to roughly 10:10 which is meant to be the most attractive (smiling!) position for the hands. They are actually set to 10:09.14 if the hands are truly symmetrical. CC BY 2.0image by Shinji
I wanted to know what all the possible symmetrical watch faces are and so I wrote some code using Processing. Here's the output (there's one watch face missing, 00:00 or 12:00, because it's very boring):



The key to writing this is to figure out the relationship between the hour and minute hands when the watch face is symmetrical. In an hour the minute hand moves through 360° and the hour hand moves through 30° (12 hours are shown on the watch face and 360/12 = 30).
The core loop inside the program is this:   for (int h = 0; h <= 12; h++) {
    float m = (360-30*float(h))*2/13;
    int s = round(60*(m-floor(m)));
    int col = h%6;
    int row = floor(h/6);
    draw_clock((r+f)*(2*col+1), (r+f)*(row*2+1), r, h, floor(m…

The Elevator Button Problem

User interface design is hard. It's hard because people perceive apparently simple things very differently. For example, take a look at this interface to an elevator:


From flickr

Now imagine the following situation. You are on the third floor of this building and you wish to go to the tenth. The elevator is on the fifth floor and there's an indicator that tells you where it is. Which button do you press?

Most people probably say: "press up" since they want to go up. Not long ago I watched someone do the opposite and questioned them about their behavior. They said: "well the elevator is on the fifth floor and I am on the third, so I want it to come down to me".

Much can be learnt about the design of user interfaces by considering this, apparently, simple interface. If you think about the elevator button problem you'll find that something so simple has hidden depths. How do people learn about elevator calling? What's the right amount of informati…