I am converting sgml content to xml content by the help of this link.
Using the sgmlString.replaceAll("<(([^<>]+?)>)([^<>]+?)(?=<(?!\\1))", "<$1$3</$2>");
regular expression I am almost closed to the expected result, but for the following file when there are multiple parallel tags of same name without closing, it is closing the tag only for last tag.
Input:
<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
<ACCEPTANCE-DATETIME>20170817060417
<ACCESSION-NUMBER>0001104659-17-052330
<TYPE>8-K
<PUBLIC-DOCUMENT-COUNT>4
<PERIOD>20170816
<ITEMS>7.01
<ITEMS>8.16
<FILING-DATE>20170817
<DATE-OF-FILING-DATE-CHANGE>20170817
<FILER>
bye bye see you!
</FILER>
</SEC-HEADER>
Output:(Note only one closing of ITEMS tag and two closings of FILER, it is not expected)
<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
<ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
<ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
<TYPE>8-K</TYPE>
<PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
<PERIOD>20170816</PERIOD>
<ITEMS>7.01<ITEMS>8.16</ITEMS>
<FILING-DATE>20170817</FILING-DATE>
<DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
<FILER>bye bye see you!</FILER></FILER>
</SEC-HEADER>
Expected:
<SEC-HEADER>0001104659-17-052330.hdr.sgml : 20170817
<ACCEPTANCE-DATETIME>20170817060417</ACCEPTANCE-DATETIME>
<ACCESSION-NUMBER>0001104659-17-052330</ACCESSION-NUMBER>
<TYPE>8-K</TYPE>
<PUBLIC-DOCUMENT-COUNT>4</PUBLIC-DOCUMENT-COUNT>
<PERIOD>20170816</PERIOD>
<ITEMS>7.01</ITEMS>
<ITEMS>8.16</ITEMS>
<FILING-DATE>20170817</FILING-DATE>
<DATE-OF-FILING-DATE-CHANGE>20170817</DATE-OF-FILING-DATE-CHANGE>
<FILER>bye bye see you!</FILER>
</SEC-HEADER>
I am in need of your kind suggestion/guidance for following queries:
- Is it a good approach to use regular expression for getting the closing tags to make it in xml format, because I read regular expressions are slow?
- I have quite heavy files to process(Up-to 18000 lines/tags), is there a better way to achieve it?
- How to get the expected result by changing in the regular expression(I am really weak in EL)
I have a solution in perl. It is based on the special treatment of
<SEC-HEADER>
, incorporating it.Perl code:
In order to translate it to your tool (which I cannot test on and have to guess about its syntax), I propose trying:
Sorry, you will have to polish a few tool-specific mistakes yourself, maybe by try and error.
With my perl version I got the following output, which I hope is close enough, it just does not eat the white space inside
<FILER>
.Output:
Details:
\1
/
instead of\
/
SEC-HEADER
, as you implicitly allowedIf you do want the whitespace eaten, here is a (perl) replace to do that:
Guessed version for your tool
(again, sorry for little mistakes, please polish them yourself):
Output (applied after first code):