How to add new tags to JTidy?

1.4k views Asked by At

I am trying to use jTidy for extract data from (real world)HTML.But jTidy doesnt parse custom tags.

<html>
  <body>
    <myCustomTag>some text</myCustomTag>
    <anotherCustom>more text</anotherCustom>
  </body>
</html>

I cant get texts between custom tags.I have to use jTidy because i ll use xpath.

I tried HTMLCleaner but it doesnt support full xpath functions.

2

There are 2 answers

0
Mark Howard On BEST ANSWER

You can also set the properties using a Java Properties object, for example:

import java.util.Properties;
Properties oProps = new Properties();
oProps.setProperty("new-blocklevel-tags", "header hgroup article footer nav");

Tidy tidy = new Tidy();
tidy.setConfigurationFromProps(oProps);

This should save you having to create and load a configuration file.

3
Ethan On

Check out http://tidy.sourceforge.net/docs/quickref.html#new-blocklevel-tags

The quick and dirty of it is to create a file, I named mine jTidyTags and call:

Tidy tidy = new Tidy();
tidy.setConfigurationFromFile("jTidyTags");

After that it will throw up a warning saying it isn't W3C compliant but who cares. That will let you parse the file.

An example of jTidyTags for you would be:

new-blocklevel-tags: myCustomTag anotherCustom

Hope this helps!