How to sanitalize string with nested html tags but keep <em> tag?

780 views Asked by At

I am trying to sanitalize Solr search results, cause it has html tags inside:

ActionController::Base.helpers.sanitize( result_string )

It is easy to sanitalize not highlighted string like: I know <ul><li>ruby</li> <li>rails</li></ul>.

But when results is highlighted I have additional important tags inside - <em> and </em>:

I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>.

So, when I sanitalize string with nested html and highlighting tags, I get string with peaces of htmls tags. And it is bad :)

How can I sanitalize highlighted string with <em> tags inside to get correct result (string with <em> tags only)?

I found the way, but it's slow and not pretty:

string = 'I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>'

['p', 'ul', 'li', 'ol', 'span', 'b', 'br'].each do |tag| 
  string.gsub!( "<<em>#{tag}</em>>",  '' )
  string.gsub!( "</<em>#{tag}</em>>", '' )
end

string = ActionController::Base.helpers.sanitize string, tags: %w(em)

How can I optimize it or do it using some better solution? to write some regex and remove html_tags, but keep <em> and </em> e.g.

Please help, thanks.

3

There are 3 answers

0
Nermin On BEST ANSWER

You could call gsub! to discard all tags but keep only tags that are independent, or that are not included in html tag.

result_string.gsub!(/(<\/?[^e][^m]>)|(<<em>\w*<\/em>>)|(<\/<em>\w*<\/em>>)/, '')

would do the trick

To explain:

# first group (<\/?[^e][^m]>) 
# find all html tags that are not <em> or </em>

# second group (<<em>\w*<\/em>>)
# find all opening tags that have <em> </em> inside of them like:
# <<em>li</em>>   or <<em>ul</em>>

# third group (<\/<em>\w*<\/em>>)
# find all closing tags that have <em> </em> inside of them:
# </<em>li</em>>   or  </<em>ul</em>>

# and gsub replaces all of this with empty string
2
gabrielhilal On

I think you can use the sinitize:

Custom Use (only the mentioned tags and attributes are allowed, nothing else)
<%= sanitize @article.body, tags: %w(table tr td), attributes: %w(id class style) %>

So, something like that should work:

sanitize result_string, tags: %w(em)
1
Andrea Salicetti On

With an additional parameter to sanitize, you can specify which tags are allowed.

In your example, try:

ActionController::Base.helpers.sanitize( result_string, tags: %w(em) ) 

It should do the trick