clean/sanitize HTML, but preserve loses HTML chars with Ruby/Rails + Nokogiri + Sanitize (?)

1k views Asked by At

We were using a combination of the Sanitize gem and HTMLEntities to do some clean up of user input HTML. The Sanitize gem used Hpricot, but now uses Nokogiri. I need to get Hpricot out of the app.

Here are two test strings, each followed by the output I'm expecting:

Test string 1:

"SOME TEXT < '<span style='background-image: url(\"http://evil.ru/webbug.png\")'>MORE' & TEXT!!!</span>"

expected_text = "SOME TEXT < 'MORE' & TEXT!!!"

Second test string (a slightly different path):

'Support <i>odd</i> chars like " < \' ‽'

expected_text = 'Support <i>odd</i> chars like &quot; &lt; &#39; ‽'

Is this something you've solved? What tools did you use?

1

There are 1 answers

0
Mike Dalessio On

You may want to try the Loofah gem:

Loofah.document("SOME TEXT < '<span style='background-image: url(\"http://evil.ru/webbug.png\")'>MORE' & TEXT!!!</span>").to_html
=> "SOME TEXT MORE' &amp; TEXT!!!" 

Loofah isn't handling the unicode character in the second example for some reason, but I'd be happy to look into it if you file a Github Issue on Loofah (full disclosure: I'm the author of Loofah and co-author of Nokogiri).

Some more links: