HTML Strip Function

1.2k views Asked by At

There is a tough nut to crack.

I have a HTML which needs to be stripped of some tags, attributes AND properties.

Basically there are three different approaches which are to be considered:

  • String Operations: Iterate through the HTML string and strip it via string operations 'manually'
  • Regex: Parsing HTML with RegEx is evil. Is stripping HTML evil too?
  • Using a library to strip it (e.g. HTML Agility Pack)

My wish is that I have lists for:

  • acceptedTags (e.g. SPAN, DIV, OL, LI)
  • acceptedAttributes (e.g. STYLE, SRC)
  • acceptedProperties (e.g. TEXT-ALIGN, FONT-WEIGHT, COLOR, BACKGROUND-COLOR)

Which I can pass to this function which strips the HTML.

Example Input:

<BODY STYLE="font-family:Tahoma;font-size:11;"> <DIV STYLE="margin:0 0 0 0;text-align:Left;font-family:Tahoma;font-size:16;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;font-family:tahoma;font-size:11;">Hello</SPAN></BODY>

Example Output (with parameter lists from above):

<DIV STYLE="text-align:Left;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;">Hello</SPAN>
  1. the entire tag Body is stripped (not accepted tag)
  2. properties margin, font-family and font-size are stripped from DIV-Tag
  3. properties font-family and font-size are stripped from SPAN-Tag.

What have I tried?

Regex seemed to be the best approach at the first glance. But I couldn't get it working properly. Articles on Stackoverflow I had a look at:

...and many more.

I tried the following regex:

Dim AcceptableTags As String = "font|span|html|i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
            Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
                  ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
Dim Html as String = Regex.Replace(b.HTML, WhiteListPattern, "", RegexOptions.Compiled)

However, this is only removing tags and no attributes or properties!

I'm definitely not looking for someone who's doing the whole job. Rather for someone, who points me to the right direction.

I'm happy with either C# or VB.NET as answers.

2

There are 2 answers

3
Tyress On BEST ANSWER

Definitely use a library! (See this)

With HTMLAgilityPack you can do pretty much everything you want:

  1. Remove tags you don't want:

    string[] allowedTags = {"SPAN", "DIV", "OL", "LI"};
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
    {
        if (!allowedTags.Contains(node.Name.ToUpper()))
        {
            HtmlNode parent = node.ParentNode;
            parent.RemoveChild(node,true);
        }
    }
    
  2. Remove attributes you don't want & remove properties

    string[] allowedAttributes = { "STYLE", "SRC" };
    
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
        {
            List<HtmlAttribute> attributesToRemove = new List<HtmlAttribute>();
            foreach (HtmlAttribute att in node.Attributes)
            {
                if (!allowedAttributes.Contains(att.Name.ToUpper())) attributesToRemove.Add(att);
                else
                {
                     string newAttrib = string.Empty;
                    //do string manipulation based on your checking accepted properties
                    //one way would be to split the attribute.Value by a semicolon and do a
                    //String.Contains() on each one, not appending those that don't match. Maybe
                    //use a StringBuilder instead too
    
                    att.Value = newAttrib;
                }
            }
            foreach (HtmlAttribute attribute in attributesToRemove)
            {
                node.Attributes.Remove(attribute);
            }
    
        }
    
2
Steve Lillis On

I would probably actually just write this myself as a multi-step process:

1) Exclude all rules for removing properties from tags that are listed as tags to be removed (the tags won't be there anyway!)

2) Walk the document, taking a copy of the document without excluded tags (i.e. in your example, copy everything up until "< div" then wait until I see ">" before continuing to copy. If I'm in copy mode, and I see "ExcludedTag=" then stop copying until I see quotation mark.

You'll probably want to do some pre-work validation on the html and getting the formatting the same, etc. before running this process to avoid broken output.

Oh, and copy in chunks, i.e. just keep the index of copy start until you reach copy end, then copy the whole chunk, not individual characters!

Hopefully that helps as a starting point.