Clean and convert HTML to XML for BaseX

708 views Asked by At

I would like to run some XQuery commands using BaseX over an HTML source that may be full of <script>, <style> nodes that must be removed and also unclosed tags (<br>, <img>) that must have a pair. (for example the dirty source of this page )

"Converting HTML to XML" suggests using Tidy, but it doesn't have a GUI and doesn't seem work correctly on my source (it outputs nothing), and I doubt if it removes scripts and other unnecessary tags. It is very old, by the way.

As I didn't find any question which address my needs, I asked it again. because it is very close to the tools for coding and querying, I asked it here.

1

There are 1 answers

3
Jens Erat On BEST ANSWER

BaseX has integration for TagSoup, which will convert HTML to well-formed XHTML.

Most distributions of BaseX already bundle TagSoup, if you installed BaseX from a Linux repository, you might need to add it manually (for example, on Debian and Ubuntu it's called libtagsoup-java). Further details for different installation options are given in the documentation linked above.

Afterwards, either set the TagSoup parser as default using the command

SET PARSER html

or in the XQuery prologue using

declare option db:parser "html";

Afterwards, simply fetch the document you want. An example for the Amazon site you linked:

declare option db:parser "html";
doc('http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&amp;field-keywords=camera')

This should work, but doesn't. I'm querying the main developers for the reason it doesn't (seems because of some an HTTP redirection) and will update the answer when the issue is resolved (or I understand why this does not work). Workaround until then is to fetch the document as text and parse it as HTML:

html:parse(fetch:text('http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&amp;field-keywords=camera')