how to use htmlpurifier to allow entire document to be passed including html,head,title,body

2.4k views Asked by At

Given the code below, how do I use htmlpurifier to allow the entire contents to pass through. I want to allow the entire html document but the html,head,style,title,body and meta get stripped out.

I even tried $config->set('Core.ConvertDocumentToFragment', false) but that didn't work.

Any help on where to start would be greatly appreciated.

I tried the example here HTML Purifier - Change default allowed HTML tags configuration but it doesn't work. I keep getting exceptions that the tags are not allowed. NOTE: I did add all the tags above in HTML.Allowed but nothing seems to work.

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1" />
    <title>Hello World - Email Template</title>
    <style type="text/css">
    @import url(https://fonts.googleapis.com/css?family=Open+Sans:400,600);
    body{-webkit-text-size-adjust: none;-ms-text-size-adjust: none;margin: 0;padding: 0;}
    </style>
    <body>
    <h1>Hi there</h1>
    </body>
    </html>
2

There are 2 answers

1
pinkgothic On

HTML Purifier by default only knows tags that are valid within a <body> context, because that's its intended use-case. Basically, it doesn't actually know what a <meta>, <html>, <head> or <title> tag is - and that's a big deal, because most of its security relies on understanding the semantic underpinnings of the HTML!

There are some older stackoverflow questions on this topic:

...but they don't currently have very useful answers, so after some contemplation, I think your question still has merit and am going to answer here.

Generally, this has been discussed a few times on the HTML Purifier forums (e.g. in Allow HTML, HEAD, STYLE and BODY tags) - but the nutshell is that you can't do this without a significant amount of work, and unfortunately I'm not currently familiar with any snippet of code that solves this problem with a simple copy and paste.

So you're going to have to dig into the guts of HTML Purifier.

You can teach HTML Purifier most tags and associated behaviour using the instructions on the Customize! documentation page. The part most interesting for you would be near the bottom, an example where <form> is taught to HTML Purifier. Quoting from there for some posterity:

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
$config->set('HTML.DefinitionRev', 1);
$config->set('Cache.DefinitionImpl', null); // remove this later!
$def = $config->getHTMLDefinition(true);
$def->addAttribute('a', 'target', new HTMLPurifier_AttrDef_Enum(
  array('_blank','_self','_target','_top')
));
$form = $def->addElement(
  'form',   // name
  'Block',  // content set
  'Flow', // allowed children
  'Common', // attribute collection
  array( // attributes
    'action*' => 'URI',
    'method' => 'Enum#get|post',
    'name' => 'ID'
  )
);
$form->excludes = array('form' => true);

Each of the parameters corresponds to one of the questions we asked. Notice that we added an asterisk to the end of the action attribute to indicate that it is required. If someone specifies a form without that attribute, the tag will be axed. Also, the extra line at the end is a special extra declaration that prevents forms from being nested within each other.

You would have to do similar things with all tags outside of the <body> tag that you want to support (all the way up to <html>).

Note: Even if you add all these tags to HTML Purifier, the setting Core.ConvertDocumentToFragment that you discovered needs to be set to false (as you have done).

Alternative

If this looks like too much work, and you have other ways to sanitise the header section and body attributes of your document, you can also cut your document into pieces, sanitise the pieces separately, then carefully stick them back together.

(Or, of course, just use the alternative for the entire document.)

0
ky4k0b On

Quick workaround. Edit function extractBody() of Lexer.php

public function extractBody($html)
    {
        return $html;
    }