Parse HTML the Groovy way
Posted by thomas - 09/06/08 at 05:06:53 pmIn the last couple of weeks I often had to download a lot of files, submitted to a web-based teaching platform. Downloading all these files by hand is very annoying so I implemented a short Groovy script. Since Groovy has a great support for parsing well-formed XML-like information it fails if you want to parse unstructured and nasty HTML code.
At last I searched for a Java library containing an HTML-parsers and I found TagSoup. This is a SAX-compliant HTML-parser specialized in re-formating and cleaning up faulty HTML code.
This is <B>bold, <I>bold italic, </b>italic, </i>normal text
will be rewritten to
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
One advantage of TagSoup is the Xpath-like query mechanism. It parses the HTML code and generates an object structure representing this content. Now the user can access the single elements. One possible example could be:
def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
html = slurper.parse("an_example_file.html")
table = html.body.div.find{ it.@id == "content" }.form.table.
find{ it.@id == "attempts" }
This retrieves the table “attempts” placed inside a form in the div “content”. The method findAll() will retrieve all elements for a given attribute or with given child elements.
After all I fell in love with TagSoup. It saves a lot of work when you have to access HTML content of websites, portals or similar, which are not able to send a XHTML 1.x compliant responses. But this is an other topic
.
1 Comment »
RSS feed for comments on this post. TrackBack URI
Leave a comment
Powered by WordPress with GimpStyle Theme design by Horacio Bella. Get Entries and comments.
Nice!
Comment by Nils — June 10, 2008 #