OpenSP - Convert SGML to XML declaring html entity

96 views Asked by At

On Windows 10, I use osx.exe from OpenSP in order to convert a SGML file to XML. In the SGML, there are html entities as   – é and a lot more.

The parser forces me to declare them:

reference to entity "ndash" for which no system identifier could be generated

So, in my DTD, I tried to declare them as follow:

<!ENTITY ndash "&#8211;"> 

But then I obtained this error:

"8211" is not a character number in the document character set

Finally, I tested adding the character itself:

<!ENTITY ndash "–"> 

And I obtained those errors:

non SGML character number 226

non SGML character number 8364 non SGML

character number 8220

To answer @imhotap, I post here the SGML declaration given with my document:

<!SGML  "ISO 8879:1986"
-- Basic SGML declaration using Reference Concrete Syntax --
CHARSET
BASESET "ISO 646-1983//CHARSET
        International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET
        0       9   UNUSED
        9       2   9
       11       2   UNUSED
       13       1   13
       14       18  UNUSED
       32       95  32
      127       1   UNUSED

CAPACITY SGMLREF
        TOTALCAP        35000
        ENTCAP          35000
        ENTCHCAP        35000
        ELEMCAP         35000
        GRPCAP          35000
        EXGRPCAP        35000
        EXNMCAP         35000
        ATTCAP          35000
        ATTCHCAP        35000
        AVGRPCAP        35000
        NOTCAP          35000
        NOTCHCAP        35000
        IDCAP           35000
        IDREFCAP        35000
        MAPCAP          35000
        LKSETCAP        35000
        LKNMCAP         35000

SCOPE    DOCUMENT

SYNTAX
        SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
                18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
BASESET  "ISO 646-1983//CHARSET
          International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET  0      128     0
FUNCTION RE     13
         RS     10
         SPACE  32
         TAB    SEPCHAR 9
NAMING   LCNMSTRT ""
         UCNMSTRT ""
         LCNMCHAR "-."
         UCNMCHAR "-."
         NAMECASE GENERAL YES
                  ENTITY  NO
DELIM    GENERAL SGMLREF
         SHORTREF SGMLREF
NAMES    SGMLREF
QUANTITY SGMLREF
        ATTCNT          250
        ATTSPLEN        960
        BSEQLEN         960
        DTAGLEN         16
        DTEMPLEN        16
        ENTLVL          16
        GRPCNT          250
        GRPGTCNT        96
        GRPLVL          16
        LITLEN          900
        NAMELEN         50
        NORMSEP         2
        PILEN           240
        TAGLEN          960
        TAGLVL          40

FEATURES
MINIMIZE DATATAG NO     OMITTAG  YES     RANK     NO     SHORTTAG YES
LINK     SIMPLE  NO     IMPLICIT NO     EXPLICIT NO
OTHER    CONCUR  NO     SUBDOC   NO     FORMAL   NO
APPINFO NONE>

I declare then entities in the DTD as follow:

<!DOCTYPE gp [ 
<!ENTITY % MYDTD    SYSTEM  ".\my_dtd.dtd">
<!ENTITY ndash SDATA "&#8211;">
%MYDTD;
]>

How can I handle those HTML entities in the SGMl-> XML conversion please?

1

There are 1 answers

2
imhotap On

It's difficult to tell when you're not including the SGML that osx complains about, but those error messages you're receiving are because osx is assuming an incorrect document character set. Most probably, osx is told to assume a document character set by a so-called SGML declaration or SGML declaration reference at the begin of your file, though theoretically it's possible that osx assumes another default character set on Windows machines, on your particular locale, is given Windows-like byte order marks, or is deriving an SGML declaration via catalog resolution rules.

Or at least, osx doesn't complain on my Unix machine when run with the following test document using its implicit SGML declaration defaults:

<!DOCTYPE test [
  <!ENTITY ndash "&#8211;">
  <!ELEMENT test - - (#PCDATA)>
]>
<test>&ndash;</test>

For a detailed explanation of SGML declarations, see eg. https://sgmljs.net/docs/sgmlrefman.html#sgml-declaration. Note sgmljs.net SGML supports ISO 8879 Annex K (aka WebSGML) and makes use of predefined entities for HTML in the SGML declaration as decribed in https://sgmljs.net/docs/w3c-html51-sgmldecl.html, but for OpenSP's osx, which doesn't (fully) support WebSGML, you need to declare these as entities in the DTD, just like you're already doing. By chance you can sidestep your problem by declaring these as SDATA entities to make the error messages got away; that is, by declaring these as

<!ENTITY ndash SDATA "&#8211;">

If that doesn't work, or the resulting output file causes trouble, you could include the following SGML declaration taken from https://sgmljs.net/docs/sgmlrefman.html#sgml-declaration-for-html5 as the first thing in your SGML. The important part is the line 160 55136 160 in the DECSCET (described character set) section telling the SGML parser that UCS code points 160 through 55136 are allowed in the document. Note the BASESET is assumed to be UTF-8 which might or might not match your document data; moreover, this SGML declaration switches on tag inference, attribute name omission, and other options appropriate for HTML but not necessarily your SGML; I have no way of telling.

<!SGML "ISO 8879:1986 (WWW)"
CHARSET
         BASESET   "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 160     55136   160
                 55296   2048    UNUSED  -- SURROGATES --
                 57344   1056768 57344

CAPACITY        SGMLREF
                TOTALCAP        150000
                GRPCAP          150000
                ENTCAP          150000

SCOPE    DOCUMENT
SYNTAX
         SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
               17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
         BASESET  "ISO 646IRV:1991//CHARSET
                   International Reference Version
                   (IRV)//ESC 2/8 4/2"
         DESCSET  0 128 0

         FUNCTION
                  RE            13
                  RS            10
                  SPACE         32
                  TAB SEPCHAR    9

         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-_:"    
                  UCNMCHAR ".-_:"
                  NAMECASE GENERAL YES
                           ENTITY  NO
         DELIM    GENERAL  SGMLREF
                  HCRO     "&#38;#x" -- ampersand --
                  NESTC    "/"
                  NET      ">"
                  SHORTREF SGMLREF
         NAMES    SGMLREF
         QUANTITY SGMLREF
                  ATTCNT   120     -- increased for HTML 5 --
                  ATTSPLEN 65536   -- These are the largest values --
                  LITLEN   65536   -- permitted in the declaration --
                  NAMELEN  65536   -- Avoid fixed limits in actual --
                  PILEN    65536   -- implementations of HTML UA's --
                  TAGLVL   100
                  TAGLEN   65536
                  GRPGTCNT 150
                  GRPCNT   150     -- increased for HTML 5 --

FEATURES
        MINIMIZE DATATAG  NO
                 OMITTAG  YES
                 RANK     NO
                 SHORTTAG
                          STARTTAG EMPTY    NO
                                   UNCLOSED NO
                                   NETENABL IMMEDNET
                          ENDTAG   EMPTY    NO
                                   UNCLOSED NO
                          ATTRIB   DEFAULT  YES
                                   OMITNAME YES
                                   VALUE    YES
                 EMPTYNRM YES
                 IMPLYDEF ATTLIST  YES
                          DOCTYPE  NO
                          ELEMENT  YES
                          ENTITY   NO
                          NOTATION NO
         LINK
                 SIMPLE   NO
                 IMPLICIT NO
                 EXPLICIT NO
         OTHER
                 CONCUR   NO
                 SUBDOC   NO
                 FORMAL   NO
                 URN      NO
                 KEEPRSRE YES
                 VALIDITY NOASSERT
                 ENTITIES
                          REF      ANY
                          INTEGRAL NO
APPINFO NONE
>
<!-- your instance document following here eg.: -->
<!DOCTYPE test [
    <!ENTITY ndash "&#8211;">
    <!ELEMENT test - - (#PCDATA)>
]>
<test>&ndash;</test>

Update: Based on the SGML declaration you specified in your update, here's an edited version of it that allows ndash and other UCS code points above 128, where I've edited the base character set like in my earlier answer/example above, and also added the described set ranges once again, but otherwise have copied the details of your SGML declaration. Keep in mind that you're basically expanding the character set of your document; if you don't want to do that, you can map ndash to a plain U+002D HYPHEN-MINUS (&#45;) instead and leave your SGML declaration as it is to leave your document character set within the 7-Bit ASCII (ie. the "IRV") code set.

<!SGML  "ISO 8879:1986"
CHARSET
BASESET   "ISO Registration Number 177//CHARSET
           ISO/IEC 10646-1:1993 UCS-4 with
           implementation level 3//ESC 2/5 2/15 4/6"

DESCSET
        0       9   UNUSED
        9       2   9
       11       2   UNUSED
       13       1   13
       14       18  UNUSED
       32       95  32
      127       1   UNUSED
      128       128 UNUSED
      256       55136   256

CAPACITY SGMLREF
        TOTALCAP        35000
        ENTCAP          35000
        ENTCHCAP        35000
        ELEMCAP         35000
        GRPCAP          35000
        EXGRPCAP        35000
        EXNMCAP         35000
        ATTCAP          35000
        ATTCHCAP        35000
        AVGRPCAP        35000
        NOTCAP          35000
        NOTCHCAP        35000
        IDCAP           35000
        IDREFCAP        35000
        MAPCAP          35000
        LKSETCAP        35000
        LKNMCAP         35000

SCOPE    DOCUMENT

SYNTAX
        SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
                18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
BASESET   "ISO Registration Number 177//CHARSET
           ISO/IEC 10646-1:1993 UCS-4 with
           implementation level 3//ESC 2/5 2/15 4/6"
DESCSET  0      128     0
FUNCTION RE     13
         RS     10
         SPACE  32
         TAB    SEPCHAR 9
NAMING   LCNMSTRT ""
         UCNMSTRT ""
         LCNMCHAR "-."
         UCNMCHAR "-."
         NAMECASE GENERAL YES
                  ENTITY  NO
DELIM    GENERAL SGMLREF
         SHORTREF SGMLREF
NAMES    SGMLREF
QUANTITY SGMLREF
        ATTCNT          250
        ATTSPLEN        960
        BSEQLEN         960
        DTAGLEN         16
        DTEMPLEN        16
        ENTLVL          16
        GRPCNT          250
        GRPGTCNT        96
        GRPLVL          16
        LITLEN          900
        NAMELEN         50
        NORMSEP         2
        PILEN           240
        TAGLEN          960
        TAGLVL          40

FEATURES
MINIMIZE DATATAG NO     OMITTAG  YES     RANK     NO     SHORTTAG YES
LINK     SIMPLE  NO     IMPLICIT NO     EXPLICIT NO
OTHER    CONCUR  NO     SUBDOC   NO     FORMAL   NO
APPINFO NONE>