How to sanitalize string with nested html tags but keep tag?

Question

How to sanitalize string with nested html tags but keep tag?

776 views Asked by bmalets At 25 November 2014 at 11:49

I am trying to sanitalize Solr search results, cause it has html tags inside:

ActionController::Base.helpers.sanitize( result_string )

It is easy to sanitalize not highlighted string like: I know <ul><li>ruby</li> <li>rails</li></ul>.

But when results is highlighted I have additional important tags inside -  and :

I know <ul><li>ruby</li> <li>rails</li></ul>.

So, when I sanitalize string with nested html and highlighting tags, I get string with peaces of htmls tags. And it is bad :)

How can I sanitalize highlighted string with  tags inside to get correct result (string with  tags only)?

I found the way, but it's slow and not pretty:

string = 'I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>'

['p', 'ul', 'li', 'ol', 'span', 'b', 'br'].each do |tag| 
  string.gsub!( "<<em>#{tag}</em>>",  '' )
  string.gsub!( "</<em>#{tag}</em>>", '' )
end

string = ActionController::Base.helpers.sanitize string, tags: %w(em)

How can I optimize it or do it using some better solution? to write some regex and remove html_tags, but keep  and  e.g.

Please help, thanks.

Original Q&A

There are 3 answers

gabrielhilal On 25 November 2014 at 12:06

I think you can use the sinitize:

Custom Use (only the mentioned tags and attributes are allowed, nothing else)
<%= sanitize @article.body, tags: %w(table tr td), attributes: %w(id class style) %>

So, something like that should work:

sanitize result_string, tags: %w(em)

Andrea Salicetti On 25 November 2014 at 12:06

With an additional parameter to sanitize, you can specify which tags are allowed.

In your example, try:

ActionController::Base.helpers.sanitize( result_string, tags: %w(em) )

It should do the trick

**Nermin** · Accepted Answer · 2014-11-25T15:30:12+00:00

You could call gsub! to discard all tags but keep only tags that are independent, or that are not included in html tag.

result_string.gsub!(/(<\/?[^e][^m]>)|(<<em>\w*<\/em>>)|(<\/<em>\w*<\/em>>)/, '')

would do the trick

To explain:

# first group (<\/?[^e][^m]>) 
# find all html tags that are not <em> or </em>

# second group (<<em>\w*<\/em>>)
# find all opening tags that have <em> </em> inside of them like:
# <<em>li</em>>   or <<em>ul</em>>

# third group (<\/<em>\w*<\/em>>)
# find all closing tags that have <em> </em> inside of them:
# </<em>li</em>>   or  </<em>ul</em>>

# and gsub replaces all of this with empty string

TechQA.

How to sanitalize string with nested html tags but keep <em> tag?

There are 3 answers

Related Questions in RUBY

Related Questions in REGEX

Related Questions in GSUB

Related Questions in HTML-SANITIZING

Popular Questions

Popular Tags

Trending Questions