Regular expression for converting SGML to XML

446 views Asked by At

I am converting sgml content to xml content by the help of this link. Using the sgmlString.replaceAll("<(([^<>]+?)>)([^<>]+?)(?=<(?!\\1))", "<$1$3</$2>"); regular expression I am almost closed to the expected result, but for the following file when there are multiple parallel tags of same name without closing, it is closing the tag only for last tag.

Input:

<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
    <ACCEPTANCE-DATETIME>20170817060417
    <ACCESSION-NUMBER>0001104659-17-052330
    <TYPE>8-K
    <PUBLIC-DOCUMENT-COUNT>4
    <PERIOD>20170816
    <ITEMS>7.01
    <ITEMS>8.16
    <FILING-DATE>20170817
    <DATE-OF-FILING-DATE-CHANGE>20170817
    <FILER>
        bye bye see you!
    </FILER>
</SEC-HEADER>

Output:(Note only one closing of ITEMS tag and two closings of FILER, it is not expected)

  <SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
     <ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
     <ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
     <TYPE>8-K</TYPE>
     <PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
     <PERIOD>20170816</PERIOD>
     <ITEMS>7.01<ITEMS>8.16</ITEMS>
     <FILING-DATE>20170817</FILING-DATE>
     <DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
     <FILER>bye bye see you!</FILER></FILER>
</SEC-HEADER>

Expected:

  <SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
         <ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
         <ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
         <TYPE>8-K</TYPE>
         <PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
         <PERIOD>20170816</PERIOD>
         <ITEMS>7.01</ITEMS>
         <ITEMS>8.16</ITEMS>
         <FILING-DATE>20170817</FILING-DATE>
         <DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
         <FILER>bye bye see you!</FILER>
    </SEC-HEADER>

I am in need of your kind suggestion/guidance for following queries:

  1. Is it a good approach to use regular expression for getting the closing tags to make it in xml format, because I read regular expressions are slow?
  2. I have quite heavy files to process(Up-to 18000 lines/tags), is there a better way to achieve it?
  3. How to get the expected result by changing in the regular expression(I am really weak in EL)
2

There are 2 answers

0
Yunnosch On

I have a solution in perl. It is based on the special treatment of <SEC-HEADER>, incorporating it.

Perl code:

use strict;
use warnings;

my $Input ='';
while(<>)
{
    $Input.=$_;
}

$Input =~ s/<((?!SEC-HEADER)([^\/<>]+?)>)([^<>]+?)(\s*?)(?=<[^\/])/<$1$3<\/$2>$4/g;
print $Input;

In order to translate it to your tool (which I cannot test on and have to guess about its syntax), I propose trying:

sgmlString.replaceAll("<((?!SEC-HEADER)([^\/<>]+?)>)([^<>]+?)(\s*?)(?=<[^\/])", "<$1$3<\/$2>$4");

Sorry, you will have to polish a few tool-specific mistakes yourself, maybe by try and error.
With my perl version I got the following output, which I hope is close enough, it just does not eat the white space inside <FILER>.

Output:

<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
    <ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
    <ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
    <TYPE>8-K</TYPE>
    <PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
    <PERIOD>20170816</PERIOD>
    <ITEMS>7.01</ITEMS>
    <ITEMS>8.16</ITEMS>
    <FILING-DATE>20170817</FILING-DATE>
    <DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
    <FILER>
        bye bye see you!
    </FILER>
</SEC-HEADER>

Details:

  • use the negative match with actually the found tag name instead of \1
  • / instead of \
  • at the start, expect a non-/
  • ignore the special tag-name SEC-HEADER, as you implicitly allowed
  • capture some whitespace and use it to get indentation and newlines right

If you do want the whitespace eaten, here is a (perl) replace to do that:

$Input =~ s/<(?!\/)([^<>]+)>\s*([^<>]+[^\s<>])\s*<\/\1>/<$1>$2<\/$1>/g;

Guessed version for your tool
(again, sorry for little mistakes, please polish them yourself):

sgmlString.replaceAll("<(?!\/)([^<>]+)>\s*([^<>]+[^\s<>])\s*<\/\1>", "<$1>$2<\/$1>");

Output (applied after first code):

<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
    <ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
    <ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
    <TYPE>8-K</TYPE>
    <PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
    <PERIOD>20170816</PERIOD>
    <ITEMS>7.01</ITEMS>
    <ITEMS>8.16</ITEMS>
    <FILING-DATE>20170817</FILING-DATE>
    <DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
    <FILER>bye bye see you!</FILER>
</SEC-HEADER>
0
imhotap On

While it may work for the SGML at hand, in general using regexp match/replace is a terrible approach for converting SGML to XML, because SGML has tag omission/tag inference, attribute name and value omission (like in HTML), and other short forms and features not in the XML profile of SGML.

But there's the dedicated osx SGML to XML conversion program for it which I can fully recommend. Its source is available from http://openjade.sourceforge.net/. If you're on Debian/Ubuntu, you can install it via sudo apt-get install opensp, and if you're on Mac OS (using MacPorts which you must install first) via sudo port install opensp (don't know the MacBrew equivalent, though).