How to work around Groovy's XmlSlurper refusing to parse HTML due to DOCTYPE and DTD restrictions?

Question

How to work around Groovy's XmlSlurper refusing to parse HTML due to DOCTYPE and DTD restrictions?

12.4k views Asked by android.weasel At 10 June 2015 at 10:36

I'm trying to copy an element in an HTML coverage report, so the coverage totals appear at the top of the report as well as the bottom.

The HTML starts thus and I believe is well-formed:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
    <link rel="stylesheet" href=".resources/report.css" type="text/css" />
    <link rel="shortcut icon" href=".resources/report.gif" type="image/gif" />
    <title>Unified coverage</title>
    <script type="text/javascript" src=".resources/sort.js"></script>
  </head>
  <body onload="initialSort(['breadcrumb', 'coveragetable'])">

Groovy's XmlSlurper complains as follows:

doc = new XmlSlurper( /* false, false, false */ ).parse("index.html")
[Fatal Error] index.html:1:48: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.
DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.

Enabling DOCTYPE:

doc = new XmlSlurper(false, false, true).parse("index.html")
[Fatal Error] index.html:1:148: External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

doc = new XmlSlurper(false, true, true).parse("index.html")
[Fatal Error] index.html:1:148: External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.


doc = new XmlSlurper(true, true, true).parse("index.html")
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

doc = new XmlSlurper(true, false, true).parse("index.html")
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

So I think I've covered all the options. There must be a way to get this working without resorting to regexps and risking the wrath of Tony The Pony.

Original Q&A

There are 2 answers

ataylor On 10 June 2015 at 15:00

Even though your HTML also happens to be well-formed XML, a more general solution for parsing HTML is use a true HTML parser. I've used the TagSoup parser in the past, and it handles real-world HTML quite well.

TagSoup provides a parser that implements the javax.xml.parsers.SAXParser interface and can be provided to XmlSlurper in the constructor. Example:

@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')

import org.ccil.cowan.tagsoup.Parser

def doc = new XmlSlurper(new Parser()).parse("index.html")

**android.weasel** · Accepted Answer · 2015-06-10T10:42:12+00:00

android.weasel On 10 June 2015 at 10:42 BEST ANSWER

Tsk.

parser=new XmlSlurper()
parser.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false) 
parser.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
parser.parse(it)

TechQA.

How to work around Groovy's XmlSlurper refusing to parse HTML due to DOCTYPE and DTD restrictions?

There are 2 answers

Related Questions in HTML

Related Questions in GROOVY

Related Questions in XMLSLURPER

Popular Questions

Trending Questions