GAS: parse XML - decode HTML entity name fails - decode entity decimal code succeeds

754 views Asked by At

Using Google Apps Script, I would like to decode HTML, so that e.g.:

Some text &#x26; text <br/> &cent;

is stored as:

Some text & text 
ยข

So, similar question as: How to decode HTML entities

Posting as new question because the answer does not work when using HTML entity names and because the supported GAS service has changed since.

I use:

var str = 'Some text &#x26; text <br/> &cent;';
var xml = XmlService.parse('<d>' + str + '</d>');
var strDecoded = xml.getRootElement().getText();
Logger.log(strDecoded);

The GAS error message when parsing:

TypeError: The entity "cent" was referenced, but not declared.

I am using &cent; as an example, I tested several other HTML entity names, all with same result.

When I use the entity decimal code instead of the HTML entity name it works fine (in this case: &#162; instead of &cent;). Same effect with the old GAS services.

Any solution that can parse the above HTML in GAS is appreciated.

3

There are 3 answers

0
wivku On

It appears to be a known issue: https://code.google.com/p/google-apps-script-issues/issues/detail?id=3565

To avoid the error you can prepend the doctype to the string, but note that this will filter out the HTML entities:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html>H&auml;</html>

Workarounds are still welcome. At the moment I manually convert some of the frequently used HTML entity names to the decimal equivalent before parsing.

0
marcomow On

Old question, but I managed to solve it this way

function cleanHTML(html){
  var decoded = '';
  var xml = XmlService
  .parse('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html>' + html + '</html>')
  .getRootElement()
  .getChildren().forEach(function(el){
    decoded+=el.getValue();
  });
  //Logger.log(decoded)
  return decoded
}
0
mal On

You can explicitly declare them at the start of the xml document:

<!DOCTYPE html [ <!ENTITY cent "&#x00A2;"> <!ENTITY Auml "&#x00C4;"> ]>