How to convert SGML into XML without a DTD?

129 views Asked by At

I have several files of state laws in SGML format. Unfortunately, no DTD was included with the files, and the tags and structure of the document is unclear. How can I create a DTD (if necessary) and convert the SGML files to XML files? A command-line tool would be ideal.

3

There are 3 answers

1
kjhughes On

James Clark's SP has an SGML to XML converter, sx.

You may find a more modern converter — haven't looked — but you're unlikely to find one that's more complete and standards compliant.

0
Michael Kay On

Once you have converted the files to XML, as suggested by @kjhughes, there are a number of tools around for DTD generation. Most IDEs such as Oxygen and XMLSpy probably have one. There's also a freestanding one that I wrote years ago and which is available at https://github.com/Saxonica/Saxon-Archive/tree/main/DTDGenerator, or on SourceForge. It works in streaming mode, because I once had to analyze a multi-gigabyte file to find out what it contained.

0
imhotap On

Checkout https://sgmljs.net (https://www.npmjs.com/package/sgml on npmjs.com for node.js) which can be installed via npm install -g sgml.

The installed sgmlproc command-line utility does support DTD-less SGML (aka WebSGML), and can perform conversion to XML using the -v output_format=xml option (see the manual page at https://sgmljs.net/docs/sgmlproc-manual.html). You can also read about conversion to XML at https://sgmljs.net/docs/parsing-html-tutorial/parsing-html-tutorial.html, which is specifically about conversion of HTML to XHTML though.

It would be helpful if you could provide an example of your source SGML here. SGML needs DTD declarations for exactly those SGML features that are above and beyond the XML subset of SGML, such as tag omission, SGML/HTML-style empty elements (without end-element tags), attribute short forms, short references, and more. Depending on whether such features are used in your source SGML, you might be required to make up and add a DTD and then use sgmlproc or osx, such as in How to parse invalid (bad / not well-formed) XML? and Adding missing XML closing tags in Javascript.