I am working on finding a good way to make user submitted data, in this case allow HTML and have it be as safe and fast as I can.
I know EVERY SINGLE PERSON on this site seems to think http://htmlpurifier.org is the answer here. I do agree partially. htmlpurifier has the best open source code out there for filtering user submitted HTML but there solution is very bulky and is not good for performance on a high traffic site. I might even use there solution someday but for now my goal is to find a more lightweight method.
I have been using the 2 functions below for about 2 and a half years now with no problems yet but I think it is time to take some input from the pro's on here if they will help me.
The first function is called FilterHTML($string) it is ran before user data is saved to a mysql database. The second function is called format_db_value($text, $nl2br = false) and I use it on a page where I plan to show the user submitted data.
Below the 2 functions is a bunch of the XSS codes I found on http://ha.ckers.org/xss.html and I then ran them on these 2 functions to see how affective my code is, I am somewhat pleased with the results, they did block out every code I tried but I know it is still not 100% safe obviously.
Can you guys please look over it and give me any advice for my code itself or even on the whole html filtering concept.
I would like to do a whitelist approach someday but htmlpurifier is the only solution I have found worth using for that and as I mentioned it is not lightweight as I would like.
function FilterHTML($string) {
    if (get_magic_quotes_gpc()) {
        $string = stripslashes($string);
    }
    $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
    // convert decimal
    $string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
    // convert hex
    $string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
    //$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
    $string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
    $string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
    //$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
    $string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
    $string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
    $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
    $string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
    $string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
    //$string = str_replace('left:0px; top: 0px;','',$string);
    do {
        $oldstring = $string;
        //bgsound|
        $string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
    } while ($oldstring != $string);
    return addslashes($string);
}
Below function is used when showing user submitted code on a webpage
function format_db_value($text, $nl2br = false) {
    if (is_array($text)) {
        $tmp_array = array();
        foreach ($text as $key => $value) {
            $tmp_array[$key] = format_db_value($value);
        }
        return $tmp_array;
    } else {
        $text = htmlspecialchars(stripslashes($text));
        if ($nl2br) {
            return nl2br($text);
        } else {
            return $text;
        }
    }
}
The codes below are from ha.ckers.org and they all seem to fail on my functions above
I did not try everyone on that site though there is a lot more, this is just some of them.
The original code is on the top line of each set and the code after running through my functions is on the line below it.
<IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii
<IMG SRC=...alert('XSS');"><b>hello</b> hiii
<IMG SRC=JaVaScRiPt:alert('XSS')>
<IMG SRC=...alert('XSS')>
<IMG SRC=javascript:alert(String.fromCharCode(88,83,83))>
<IMG SRC=...alert(String.fromCharCode(88,83,83))>
<IMG SRC=javascript:alert('XSS')>
<IMG SRC=...alert('XSS')>
<IMG SRC=javascript:alert('XSS')>
<IMG SRC=F  MLEJNALN !>
<IMG SRC=javascript:alert('XSS')>
<IMG SRC=...alert('XSS')>
<IMG SRC="jav
ascript:alert('XSS');">
<IMG SRC=...alert('XSS');">
perl -e 'print "<IMG SRC=javascript:alert("XSS")>";' > out
perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out
<BODY onload!#$%&()*~+-_.,:;?@[/|\]^`=alert("XSS")>
...
<iframe src=http://ha.ckers.org/scriptlet.html <
...
<LAYER SRC="http://ha.ckers.org/scriptlet.html"></LAYER>
......
<META HTTP-EQUIV="Link" Content="<http://ha.ckers.org/xss.css>; REL=stylesheet">
...; REL=stylesheet">
<IMG STYLE="xss:...(alert('XSS'))">
<IMG STYLE="xss:expr/*XSS*/ession(alert('XSS'))">
<XSS STYLE="xss:...(alert('XSS'))">
<XSS STYLE="xss:expression(alert('XSS'))">
<EMBED SRC="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED>
<EMBED SRC="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED>
<IMG
SRC
=
"
j
a
v
a
s
c
r
i
p
t
:
a
l
e
r
t
(
'
X
S
S
'
)
"
>
<IMG
SRC
=...
a
l
e
r
t
(
'
X
S
S
'
)
"
>
 
                        
Here is four alternatives :