DFDL Schema for parsing delimited text message

1.1k views Asked by At

Need small help for DFDL. I need to parse below message as something like XML/tree structure. Elements are not fixed and dynamic. Sometime some other elements will appear.

enter image description here

XML/Tree output expected as something below

<root>
<CLIENT_ID>DESKTOPCLIENT</CLIENT_ID>
<LOCALE>en-US</LOCALE>
<ENCODE/>
</root>
1

There are 1 answers

4
stevedlawrence On

Something like this is a possible solution, tested in Daffodil:

<xs:schema
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

  <xs:include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/">
      <dfdl:format
        ref="GeneralFormat"
        lengthKind="delimited"
      />
    </xs:appinfo>
  </xs:annotation>

  <xs:element name="root" dfdl:initiator="%ESC;" dfdl:terminator="%SUB;">
    <xs:complexType>
      <xs:sequence dfdl:separator="%CAN;" dfdl:separatorPosition="prefix" dfdl:sequenceKind="unordered">
        <xs:element name="CLIENT_ID" type="xs:string" dfdl:initiator="CLIENT_ID%NAK;" />
        <xs:element name="LOCALED" type="xs:string" dfdl:initiator="LOCALE%NAK;" />
        <xs:element name="ENCODE" type="xs:string" dfdl:initiator="ENCODE%NAK;" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>

Note that this assumes fixed names for the individual elements, and that they all exist, though the order does not matter. If you know the fixed names, but they may or may not exist, you can add minOccurs="0" to the elements in the unorderd sequence.

However, DFDL does not allow for dynami element names, so if you don't know the names, you need a slightly different schema. Instead, you need to describe the data as an unbouned number of name/value pairs, where the name and value are separated by %NAK;, for example:

  <xs:element name="root" dfdl:initiator="%ESC;" dfdl:terminator="%SUB;">
    <xs:complexType>
      <xs:sequence dfdl:separator="%CAN;" dfdl:separatorPosition="prefix">
        <xs:element name="element" maxOccurs="unbounded">
          <xs:complexType>
            <xs:sequence dfdl:separator="%NAK;" dfdl:separatorPosition="infix">
              <xs:element name="name" type="xs:string" />
              <xs:element name="value" type="xs:string" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

This results in an infoset that looks something like this:

<root>
  <element>
    <name>CLIENT_ID</name>
    <value>DESKTOPCLIENT</value>
  </element>
  <element>
    <name>LOCALE</name>
    <value>en-US</value>
  </element>
  <element>
    <name>ENCODE</name>
    <value></value>
  </element>
</root>

If you need the XML tags to match the name fields like in your question, you would then need to transform the infoset. XSLT can do this kind of transformation without much trouble.

Edit: There seems to be an issue where IBM DFDL does not like the above solution. I'm not sure why, but it works with Apache Daffodil. Something about value being the empty string causes an issue. After some trial and error, I've found that IBM DFDL (and Apache Daffodil too) are okay with it if you specify that empty value elements should be treated as nil. So changing the value element to this works:

<xs:element name="value" type="xs:string" nillable="true"
  dfdl:nilKind="literalValue" dfdl:nilValue="%ES;"
  dfdl:useNilForDefault="no"/>

In that case, the infoset ends up with something like this:

<element>
  <name>ENCODE</name>
  <value xsi:nil="true"></value>
</element>

Edit2: The nillable properties are required because otherwise IBM DFDL treats an empty string value as absent rather than having an empty value. Being absent results in the error. Newer versions of the DFDL spec add a new property, emptyElementParsePolicy, which lets you control whether or not empty strings are treated as absent or are just treated as an empty string. Daffodil implements this property as an extensions, but defaults to the treat as empty behavior. IBM DFDL has the treat as absent behavior. Daffodil has a similar behavior to IBM DFDL when setting this property to treat as absent.