Monday, November 09, 2009

Parsing HTML in Python with BeautifulSoup

I got into a spat with Eric Raymond the other day about some code he's written called ForgePlucker. I took a look at the source code and posted saying it looks like a total hack job by a poor programmer.

Raymond replied by posting a blog entry in which he called me a poor fool and snotty kid.

So far so good. However, he hadn't actually fixed the problems I was talking about (and which I still think are the work of a poor programmer). This morning I checked and he's removed two offending lines that I was talking about and done some code rearrangement. The function that had caught my eye initially was one to parse data from an HTML table which he does with this code:

def walk_table(text):
"Parse out the rows of an HTML table."
rows = []
while True:
oldtext = text
# First, strip out all attributes for easier parsing
text = re.sub('<TR[^>]+>', '<TR>', text, re.I)
text = re.sub('<TD[^>]+>', '<TD>', text, re.I)
# Case-smash all the relevant HTML tags, we won't be keeping them.
text = text.replace("</table>", "</TABLE>")
text = text.replace("<td>", "<TD>").replace("</td>", "</TD>")
text = text.replace("<tr>", "<TR>").replace("</tr>", "</TR>")
text = text.replace("<br>", "<BR>")
# Yes, Berlios generated \r<BR> sequences with no \n
text = text.replace("\r<BR>", "\r\n")
# And Berlios generated doubled </TD>s
# (This sort of thing is why a structural parse will fail)
text = text.replace("</TD></TD>", "</TD>")
# Now that the HTML table structure is canonicalized, parse it.
if text == oldtext:
break
end = text.find("</TABLE>")
if end > -1:
text = text[:end]
while True:
m = re.search(r"<TR>\w*", text)
if not m:
break
start_row = m.end(0)
end_row = start_row + text[start_row:].find("</TR>")
rowtxt = text[start_row:end_row]
rowtxt = rowtxt.strip()
if rowtxt:
rowtxt = rowtxt[4:-5]# Strip off <TD> and </TD>
rows.append(re.split(r"</TD>\s*<TD>", rowtxt))
text = text[end_row+5:]
return rows

The problem with writing code like that is maintenance. It's got all sorts of little assumptions and special cases. Notice how it can't cope with a mixed case <TD> tag? Or how there's a special case for handling a doubled </TD>?

A much better approach is to use an HTML parser than knows all about the foibles of real HTML in the real world (Raymond's main argument in his blog posting is that you can't rely on the HTML structure to give you semantic information---I actually agree with that, but don't agree that throwing the baby out with the bath water is the right approach). If you use such an HTML parser you eliminate all the hassles you had maintaining regular expressions for all sorts of weird HTML situations, dealing with case, dealing with HTML attributes.

Here's the equivalent function written using the BeautifulSoup parser:

def walk_table2(text):
"Parse out the rows of an HTML table."
soup = BeautifulSoup(text)
return [ [ col.renderContents() for col in row.findAll('td') ]
for row in soup.find('table').findAll('tr') ]

In Raymond's code above he includes a little jab at this style saying:

# And Berlios generated doubled </TD>s
# (This sort of thing is why a structural parse will fail)
text = text.replace("</TD></TD>", "</TD>")

But that doesn't actually stand up to scrutiny. Try it and see. BeautifulSoup handles the extra </TD> without any special cases.

Bottom line: parsing HTML is hard, don't make it harder on yourself by deciding to do it yourself.

Disclaimer: I am not an experienced Python programmer, there could be a nicer way to write my walk_table2 function above, although I think it's pretty clear what it's doing.

Labels:

If you enjoyed this blog post, you might enjoy my travel book for people interested in science and technology: The Geek Atlas. Signed copies of The Geek Atlas are available.

<$BlogCommentBody$>

<$BlogCommentDateTime$> <$BlogCommentDeleteIcon$>

Post a Comment

Links to this post:

<$BlogBacklinkControl$> <$BlogBacklinkTitle$> <$BlogBacklinkDeleteIcon$>
<$BlogBacklinkSnippet$>
Create a Link

<< Home