Convert special chars to HTML entities, without changing tags and parameters

2.5k views Asked by At

I'm using FreeTextBox editor to get some HTML created by users. The problem with this is this editor is not converting special chars in HTML entities at exception of "<>". I cannot use theHTML = Server.HtmlEncode(theHTML), because it converts all the HTML including tags and parameters, and I don't want to create an unfinishable list of theHTML.Replace lines.

Is there any other function or method available to convert to html entities but only outside tags?

3

There are 3 answers

0
backslash17 On BEST ANSWER

After searching a lot, I've found that I was using the wrong property of the FreeTextBox component. The property was ConvertHtmlSymbolsToHtmlCodes wich has to be true.

It also helps to use FormatHtmlTagsToXhtml if you need to insert your code into XHTML pages, because it uses a strong validation with tags parameters and quotes surrounding them.

0
David On

I would suggest parsing through each element using Linq to Xml and encoding the value of each element and attribute node. I'll try to come up with some code but hey it's 5pm on a Friday!

2
bobince On

If you've got a mixture of < meaning start a tag and < meaning a literal less-than sign, you can't possibly tell which is ‘a tag’ to ignore and which isn't.

About all you could do would be to detect < usages that weren't a conventionally-formed start or end tag, using a nasty unreliable regex something like:

<(?!\w+(\s+\w+="[^"<]*")*\s*/?>|/\w+\s*>)

and replace them with &lt;. Similarly for & with &amp;:

&(?!\w+;|#\d+;|#x[0-9A-Fa-f]+;)

(> does not normally have to be escaped.)

This won't allow every possible valid way of constructing elements, and it will allow broken mis-nested elements, and non-existent entities, and would mess up non-element constructs like comments. Because regex can't parse HTML, let alone HTML with added crunchy broken bits.

So it's hardly foolproof. If you want proper markup that won't break your page when they accidentally leave a div open, the best first step is to parse it as XHTML and refuse it with an error if it's not well-formed XML.

If you have a rich text editor component that generates output where a literal < is not escaped, then it's time to replace that component with something less appalling. But in general it's not a good idea to let users create HTML, because they're really rubbish at it. Plus allowing anyone to input HTML gives them complete control over wrecking the site and its security with JavaScript. A simpler text-markup language is often a win.