Editing DOCTYPE tag with BeautifulSoup

Asked by At

I need to add an ATTLIST declaration to the DOCTYPE tag in html documents.

After reading the documentation and googling, this is what I've come up with:

from bs4 import BeautifulSoup, Doctype

# minimal html document
doc = """<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html/>"""

soup = BeautifulSoup(doc, features='html.parser')

# the modified doctype tag
doctype = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>] >"""

dt = BeautifulSoup(doctype, features='html.parser')

for item in soup.contents:
    if isinstance(item, Doctype):
        item.replace_with(dt)
        break

print(soup.prettify(formatter=None))

This produces the desired result, but it feels a bit "hacky". I'd like to just insert the ATTLIST part into the tag, and not replace the whole thing, as I've done here. Does anyone know how to do that?

1 Answers

0
Martin Evans On Best Solutions

A small improvement would be to build a Doctype object and replace with that, for example:

from bs4 import BeautifulSoup, Doctype

# minimal html document
doc = """<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html/>"""

# the modified doctype tag
doctype = """html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]"""

soup = BeautifulSoup(doc, features='html.parser')

for item in soup.contents:
    if isinstance(item, Doctype):
        item.replace_with(Doctype(doctype))
        break

print(soup.prettify(formatter=None))

Giving:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]>
<html>
</html>