Remove redundant space in HTML in JAVA

198 views Asked by At

Need to perform some HTML cleansing.

Have HTML that has lots of redundant br tags, so far tried HtmlCleaner and jTidy without any results.

Example:

<br>
<br>
<br>
<br>
...

What I would like is just to get a single <br> back

Any other ways to accomplish this without manually parsing line by line?

1

There are 1 answers

3
Adam Gent On

If your only trying to remove superfluous <br/> tags then I recommend a simple parsing state machine using Jericho to do the parsing since Jericho is very good about preserving data.

The state machine would simply keep the last tag seen and if the last tag seen is a <br/> tag and the next tag is a <br/> tag you simply omit it. Its a pretty simple exercise that I recommend you try. I don't recommend though manual text parsing (ie not using a HTML parser) as its very error prone.

I would also like to remind you that despite how people may use <br/> tags it is an explicit content tag. So removing the tag is changing the content. Perhaps instead of scraping some HTML you get the content from a more structured source like XML feed, REST API, or database, etc.