Is there a Standard Java SE HTML Parser? If so, why use non-standard ones?

3.2k views Asked by At

I need to parse a simple HTML page with a simple form in it. The answers to similar questions on StackOverflow suggest using one of a large variety of non-standard Java libraries such as TagSoup, JSoup, HTMLParser and many others.

However, a web search revealed that there exists some standard functionality in Java SE via this class: http://docs.oracle.com/javase/7/docs/api/javax/swing/text/html/parser/ParserDelegator.html

My sub-questions are:

  1. Is it really true that the standard ParserDelegator class can parse a use case like mine?
  2. What are the limitations of the standard library that create the need for so many non-standard libraries?
  3. Does the fact that ParserDelegator is within swing preclude using it in a regular EC2 cloud server for a web application? Would I have to jump through a lot of hoops to get around the headless aspect or would it be just a small tweak to the configuration?
  4. If the standard one is not recommended, which non-standard one should I use, given: (a) my desire to not stray far from the standard; (b) my simple use case; (c) desire for a mature reliable implementation; and (d) no size or weight limitations since this is a server application as opposed to an embedded client. API is a far lower priority so while I do appreciate JSoup's CSS selector like API, the other concerns (a) through (d) override it.

Thank you.

1

There are 1 answers

1
AlexR On BEST ANSWER

JDK has built-in HTML parser that supports HTML 1.0 or so. It should support parsing of base text formatting tags and forms.

The reason to use other, third party parsers is requirement to support "real" HTML pages DHTML, JavaScript etc.

JSoup is one of popular parsers that can do the job. For more information about other implementations please take a look on the following discussion:

Pure Java HTML viewer/renderer for use in a Scrollable pane