HTML Strip Function

Question

HTML Strip Function

1.2k views Asked by Fabian Bigler At 29 December 2024 at 17:46

There is a tough nut to crack.

I have a HTML which needs to be stripped of some tags, attributes AND properties.

Basically there are three different approaches which are to be considered:

String Operations: Iterate through the HTML string and strip it via string operations 'manually'
Regex: Parsing HTML with RegEx is evil. Is stripping HTML evil too?
Using a library to strip it (e.g. HTML Agility Pack)

My wish is that I have lists for:

acceptedTags (e.g. SPAN, DIV, OL, LI)
acceptedAttributes (e.g. STYLE, SRC)
acceptedProperties (e.g. TEXT-ALIGN, FONT-WEIGHT, COLOR, BACKGROUND-COLOR)

Which I can pass to this function which strips the HTML.

Example Input:

<BODY STYLE="font-family:Tahoma;font-size:11;"> <DIV STYLE="margin:0 0 0 0;text-align:Left;font-family:Tahoma;font-size:16;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;font-family:tahoma;font-size:11;">Hello</SPAN></BODY>

Example Output (with parameter lists from above):

<DIV STYLE="text-align:Left;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;">Hello</SPAN>

the entire tag Body is stripped (not accepted tag)
properties margin, font-family and font-size are stripped from DIV-Tag
properties font-family and font-size are stripped from SPAN-Tag.

What have I tried?

Regex seemed to be the best approach at the first glance. But I couldn't get it working properly. Articles on Stackoverflow I had a look at:

...and many more.

I tried the following regex:

Dim AcceptableTags As String = "font|span|html|i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
            Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
                  ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
Dim Html as String = Regex.Replace(b.HTML, WhiteListPattern, "", RegexOptions.Compiled)

However, this is only removing tags and no attributes or properties!

I'm definitely not looking for someone who's doing the whole job. Rather for someone, who points me to the right direction.

I'm happy with either C# or VB.NET as answers.

Original Q&A

There are 2 answers

Steve Lillis On 18 November 2014 at 10:11

I would probably actually just write this myself as a multi-step process:

1) Exclude all rules for removing properties from tags that are listed as tags to be removed (the tags won't be there anyway!)

2) Walk the document, taking a copy of the document without excluded tags (i.e. in your example, copy everything up until "< div" then wait until I see ">" before continuing to copy. If I'm in copy mode, and I see "ExcludedTag=" then stop copying until I see quotation mark.

You'll probably want to do some pre-work validation on the html and getting the formatting the same, etc. before running this process to avoid broken output.

Oh, and copy in chunks, i.e. just keep the index of copy start until you reach copy end, then copy the whole chunk, not individual characters!

Hopefully that helps as a starting point.

**Tyress** · Accepted Answer · 2014-11-18T11:02:28+00:00

Definitely use a library! (See this)

With HTMLAgilityPack you can do pretty much everything you want:

Remove tags you don't want:

string[] allowedTags = {"SPAN", "DIV", "OL", "LI"};
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
    if (!allowedTags.Contains(node.Name.ToUpper()))
    {
        HtmlNode parent = node.ParentNode;
        parent.RemoveChild(node,true);
    }
}

Remove attributes you don't want & remove properties

string[] allowedAttributes = { "STYLE", "SRC" };

    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
    {
        List<HtmlAttribute> attributesToRemove = new List<HtmlAttribute>();
        foreach (HtmlAttribute att in node.Attributes)
        {
            if (!allowedAttributes.Contains(att.Name.ToUpper())) attributesToRemove.Add(att);
            else
            {
                 string newAttrib = string.Empty;
                //do string manipulation based on your checking accepted properties
                //one way would be to split the attribute.Value by a semicolon and do a
                //String.Contains() on each one, not appending those that don't match. Maybe
                //use a StringBuilder instead too

                att.Value = newAttrib;
            }
        }
        foreach (HtmlAttribute attribute in attributesToRemove)
        {
            node.Attributes.Remove(attribute);
        }

    }

TechQA.

HTML Strip Function

There are 2 answers

Related Questions in C#

Related Questions in HTML

Related Questions in REGEX

Related Questions in VB.NET

Related Questions in STRIP

Popular Questions

Popular Tags

Trending Questions