Parse HTML the Groovy way
Posted by thomas - 09/06/08 at 05:06:53 pmIn the last couple of weeks I often had to download a lot of files, submitted to a web-based teaching platform. Downloading all these files by hand is very annoying so I implemented a short Groovy script. Since Groovy has a great support for parsing well-formed XML-like information it fails if you want to parse unstructured and nasty HTML code.
At last I searched for a Java library containing an HTML-parsers and I found TagSoup. This is a SAX-compliant HTML-parser specialized in re-formating and cleaning up faulty HTML code.
This is <B>bold, <I>bold italic, </b>italic, </i>normal text
will be rewritten to
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
One advantage of TagSoup is the Xpath-like query mechanism. It parses the HTML code and generates an object structure representing this content. Now the user can access the single elements. One possible example could be:
def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
html = slurper.parse("an_example_file.html")
table = html.body.div.find{ it.@id == "content" }.form.table.
find{ it.@id == "attempts" }
This retrieves the table “attempts” placed inside a form in the div “content”. The method findAll() will retrieve all elements for a given attribute or with given child elements.
After all I fell in love with TagSoup. It saves a lot of work when you have to access HTML content of websites, portals or similar, which are not able to send a XHTML 1.x compliant responses. But this is an other topic
.
5 Comments »
RSS feed for comments on this post. TrackBack URI
Leave a comment
Powered by WordPress with GimpStyle Theme design by Horacio Bella. Get Entries and comments.
Nice!
Comment by Nils — June 10, 2008 #
Cool!
saved me from some bad time trying to parse a lot of ugly formatted html pages
Comment by Federico — July 22, 2009 #
I cannot for the life of me figure out how to get text nodes out of the GPathResult. I am trying to flatten the cleaned HTML back into an XHTML document string, but none of the iterators seems to include text nodes.
Any help?
Comment by Ben Nadel — September 30, 2009 #
It appears there are even better alternatives than tagsoup, see this blog post
Comment by duckman — October 16, 2009 #
[...] Soup has a much nicer syntax when used with Groovy, but I decided to try HTMLCleaner because it is reported to have better [...]
Pingback by Parsing HTML with Groovy and HTMLCleaner — May 12, 2010 #