I am trying to sanitalize Solr search results, cause it has html tags inside:
ActionController::Base.helpers.sanitize( result_string )
It is easy to sanitalize not highlighted string like: I know <ul><li>ruby</li> <li>rails</li></ul>
.
But when results is highlighted I have additional important tags inside - <em>
and </em>
:
I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>
.
So, when I sanitalize string with nested html and highlighting tags, I get string with peaces of htmls tags. And it is bad :)
How can I sanitalize highlighted string with <em>
tags inside to get correct result (string with <em>
tags only)?
I found the way, but it's slow and not pretty:
string = 'I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>'
['p', 'ul', 'li', 'ol', 'span', 'b', 'br'].each do |tag|
string.gsub!( "<<em>#{tag}</em>>", '' )
string.gsub!( "</<em>#{tag}</em>>", '' )
end
string = ActionController::Base.helpers.sanitize string, tags: %w(em)
How can I optimize it or do it using some better solution?
to write some regex and remove html_tags, but keep <em>
and </em>
e.g.
Please help, thanks.
You could call gsub! to discard all tags but keep only tags that are independent, or that are not included in html tag.
would do the trick
To explain: