There is a tough nut to crack.
I have a HTML which needs to be stripped of some tags, attributes AND properties.
Basically there are three different approaches which are to be considered:
- String Operations: Iterate through the HTML string and strip it via string operations 'manually'
- Regex: Parsing HTML with RegEx is evil. Is stripping HTML evil too?
- Using a library to strip it (e.g. HTML Agility Pack)
My wish is that I have lists for:
- acceptedTags (e.g. SPAN, DIV, OL, LI)
- acceptedAttributes (e.g. STYLE, SRC)
- acceptedProperties (e.g. TEXT-ALIGN, FONT-WEIGHT, COLOR, BACKGROUND-COLOR)
Which I can pass to this function which strips the HTML.
Example Input:
<BODY STYLE="font-family:Tahoma;font-size:11;"> <DIV STYLE="margin:0 0 0 0;text-align:Left;font-family:Tahoma;font-size:16;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;font-family:tahoma;font-size:11;">Hello</SPAN></BODY>
Example Output (with parameter lists from above):
<DIV STYLE="text-align:Left;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;">Hello</SPAN>
- the entire tag Body is stripped (not accepted tag)
- properties margin, font-family and font-size are stripped from DIV-Tag
- properties font-family and font-size are stripped from SPAN-Tag.
What have I tried?
Regex seemed to be the best approach at the first glance. But I couldn't get it working properly. Articles on Stackoverflow I had a look at:
...and many more.
I tried the following regex:
Dim AcceptableTags As String = "font|span|html|i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
Dim Html as String = Regex.Replace(b.HTML, WhiteListPattern, "", RegexOptions.Compiled)
However, this is only removing tags and no attributes or properties!
I'm definitely not looking for someone who's doing the whole job. Rather for someone, who points me to the right direction.
I'm happy with either C# or VB.NET as answers.
Definitely use a library! (See this)
With HTMLAgilityPack you can do pretty much everything you want:
Remove tags you don't want:
Remove attributes you don't want & remove properties