Too Cool for Internet Explorer

Parse HTML the Groovy way

parse-html-the-groovy-way

In the last couple of weeks I often had to download a lot of files, submitted to a web-based teaching platform. Downloading all these files by hand is very annoying so I implemented a short Groovy script. Since Groovy has a great support for parsing well-formed XML-like information it fails if you want to parse unstructured and nasty HTML code.

At last I searched for a Java library containing an HTML-parsers and I found TagSoup. This is a SAX-compliant HTML-parser specialized in re-formating and cleaning up faulty HTML code.

This is <B>bold, <I>bold italic, </b>italic, </i>normal text

will be rewritten to

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

One advantage of TagSoup is the Xpath-like query mechanism. It parses the HTML code and generates an object structure representing this content. Now the user can access the single elements. One possible example could be:

def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
html = slurper.parse("an_example_file.html")
table = html.body.div.find{ it.@id == "content" }.form.table.
find{ it.@id == "attempts" }

This retrieves the table “attempts” placed inside a form in the div “content”. The method findAll() will retrieve all elements for a given attribute or with given child elements.

After all I fell in love with TagSoup. It saves a lot of work when you have to access HTML content of websites, portals or similar, which are not able to send a XHTML 1.x compliant responses. But this is an other topic ;) .

4 Comments »

RSS feed for comments on this post. TrackBack URI

  1. Nice!

    Comment by Nils — June 10, 2008 #

  2. Cool!

    saved me from some bad time trying to parse a lot of ugly formatted html pages :)

    Comment by Federico — July 22, 2009 #

  3. I cannot for the life of me figure out how to get text nodes out of the GPathResult. I am trying to flatten the cleaned HTML back into an XHTML document string, but none of the iterators seems to include text nodes.

    Any help?

    Comment by Ben Nadel — September 30, 2009 #

  4. It appears there are even better alternatives than tagsoup, see this blog post

    Comment by duckman — October 16, 2009 #

Leave a comment

XHTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Powered by WordPress with GimpStyle Theme design by Horacio Bella. Get Entries and comments.