How do I pass a whilte list to the caja web standalone script

393 views Asked by At

I'm using http://caja.appspot.com/html-css-sanitizer-minified.js to sanitize user html, however in some instances I want to restrict the tags used to just a white list.

I've found https://code.google.com/p/google-caja/wiki/CajaWhitelists which describes how to define a white list, but I can't work out how to pass it to the html_sanitize method provided by html-css-sanitizer-minified.js

I've tried calling html.sanitizeWithPolicy(the_html, white_list); but I get an error:

TypeError: a is not a function

Which is hard to debug due to the minification, but it seems likely that html-css-sanitizer-minified.js does not contain everything in the html-sanitizer.js file.

I've tried using html-sanitizer.js combined with cssparser.js instead of the minified version, but I get errors before calling it, presumably because I am missing other dependencies.

How can I make this work?

Edit: sanitizeWithPolicy does exist in the minified file, but something is missing further down the process. This suggests that this file can't be used with a custom white list. I'm now investigating if it is possible to work out which uniminified files I need to include to make my own version.

Edit2: I was missing two files https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/plugin/html4-defs.js?spec=svn1950&r=1950 and https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/plugin/uri.js?r=5170

However I am now getting an error because sanitizeWithPolicy expects a function not a whitelist object. Also the html4-defs.js file is very old and according to this I would have to build the caja project in order get a more recent one.

2

There are 2 answers

0
SystemicPlural On

I solved this by downloading the unminified files

https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/plugin/html-sanitizer.js

https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/plugin/uri.js

https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/plugin/html4-defs.js?spec=svn1950&r=1950 (This last one is from an old revision. This file is built from the Java files, would be great if a more up to date one was available.)

I then added a new function to html-sanitizer.js

/**
* Trims down the element white list to just those passed in whilst still not allowing unsafe elements.
* @param {array} custom_elements An array of elements to include.
*/
function useCustomElements(custom_elements) {
  var length = custom_elements.length;
  var new_elements = {};
  for (var i = 0; i < length; i++) {
      var key = custom_elements[i].toLowerCase();
      if (typeof elements.ELEMENTS[key] !== 'undefined') {
          new_elements[key] = elements.ELEMENTS[key];
      }
  }
  elements.ELEMENTS = new_elements;
};

I then made this function public with this near the end of the file withthe other public function statements.

html.useCustomElements = html['useCustomElements'] = useCustomElements;

Now I can call it like so:

var raw = '<p>This element is kept</p><div>this element is not</div>';
var white_list ['p', 'b'];
html.useCustomElements(white_list)
var sanitized = html.sanitize(raw);

I then manually added some html5 elements to the html4-defs.js file (The ones that just define block elements like and ).

The attributes sanitization was still broken. This is due to the html4-defs.js file being out of date with the html-sanitizer.js. I changed this in html-sanitizer.js :

if ((attribKey = tagName + '::' + attribName,
     elements.ATTRIBS.hasOwnProperty(attribKey)) ||
    (attribKey = '*::' + attribName,
     elements.ATTRIBS.hasOwnProperty(attribKey))) {
  atype = elements.ATTRIBS[attribKey];
}

to

if (elements.ATTRIBS.hasOwnProperty(attribName)) {
  atype = elements.ATTRIBS[attribName];
}

This is far from ideal but without compiling Caja and generating an up to date html-defs.js file I can't see a way around this.

This still leaves css sanitization. I would like this as well, but I am missing the css def files and can't find any that work via search so I have turned it off for now.

EDIT: I've managed to extract the html-defs from html-css-sanitizer-minified.js. I've uploaded a copy to here. It includes elements like 'nav' so it has been updated for html5.

I've tried to do the same for the css parsing, I managed to extract the defs, but they depend on a bit count, and I can't find anyway to calculate what bits were used for which defaults.

0
SystemicPlural On

I've decided on another approach. I've left the other answer in case I manage to find the bit values for the css definitions as it would be preferable to this one if I could get it to work.

This time I've taken the html-css-sanitizer-minified file and injected a bit of code into it so that the element and attributes can be modified.

Search for :

ka=/^(?:https?|mailto)$/i,m={};

And after it insert the following:

var unmodified_elements = {};
for(var property_name in $.ELEMENTS) {
    unmodified_elements[property_name] = $.ELEMENTS[property_name];
};
var unmodified_attributes = {};
for(var property_name in $.ATTRIBS) {
    unmodified_attributes[property_name] = $.ATTRIBS[property_name];
};

var resetElements = function () {
    $.ELEMENTS = {};
    for(var property_name in unmodified_elements) {
        $.ELEMENTS[property_name] = unmodified_elements[property_name];
    }
    $.f = $.ELEMENTS;
};

var resetAttributes = function () {
    $.ATTRIBS = {};
    for(var property_name in unmodified_attributes) {
        $.ATTRIBS[property_name] = unmodified_attributes[property_name];
    }
    $.m = $.ATTRIBS;
};

var resetWhiteLists = function () {
    resetElements();
    resetAttributes();
};

/**
 * Trims down the element white list to just those passed in whilst still not allowing unsafe elements.
 * @param {array} custom_elements An array of elements to include.
 */
var applyElementsWhiteList = function(custom_elements) {
    resetElements();
    var length = custom_elements.length;
    var new_elements = {};
    for (var i = 0; i < length; i++) {
        var key = custom_elements[i].toLowerCase();
        if (typeof $.ELEMENTS[key] !== 'undefined') {
            new_elements[key] = $.ELEMENTS[key];
        }
    }
    $.f = new_elements;
    $.ELEMENTS = new_elements;
};

  /**
   * Trims down the attribute white list to just those passed in whilst still not allowing unsafe elements.
   * @param {array} custom_attributes An array of attributes to include.
   */
var applyAttributesWhiteList = function(custom_attributes) {
    resetAttributes();
    var length = custom_attributes.length;
    var new_attributes = {};
    for (var i = 0; i < length; i++) {
        var key = custom_attributes[i].toLowerCase();
        if (typeof $.ATTRIBS[key] !== 'undefined') {
            new_attributes[key] = $.ATTRIBS[key];
        }
    }
    $.m = new_attributes;
    $.ATTRIBS = new_attributes;
};

m.applyElementsWhiteList = applyElementsWhiteList;
m.applyAttributesWhiteList = applyAttributesWhiteList;
m.resetWhiteLists = resetWhiteLists;

You can now apply a white list with :

var raw = "<a>element tags removed</a><p class='class-removed' style='color:black'>the p tag is kept</p>";
var tag_white_list = [
    'p'
];
var attribute_white_list = [
    '*::style'
];
html.applyElementsWhiteList(tag_white_list);
html.applyAttributesWhiteList(attribute_white_list);
var san = html.sanitize(raw);

This approach also sanatizes the styles, which I needed. Another white list could be injected for those, but I don't need that so I havn't written one.