I'm using jericho for sanitizing html and it works great. Except in one situation I can't figure out. I want to completely remove any script and the script content. Right now I'm getting the script tag removed but the actual script content is kept.
So currently I create a Source object and do a fullSequentialParse. Then I create an OutputDocument and loop through each tag.
When I get to a "script" tag I just want to replace the whole thing with "".
Any ideas?
TIA
I am not familiar with Jericho, however it has the capability to work on a tree, very similar to a DOM tree, so you can remove the
scriptelement instead of just the tag. (If you have a huge HTML, this may not be optimal, though).If not, then you can go for the SAX way of things. Remember the opening
scripttag, and when you reach the closing tag, you can remove everything inbetween.