Converting CSV to hierarchichal XML using XSLT

3.5k views Asked by At

I need to create an XSLT to convert a CSV (comma separated file) into hierarchical XML.

This is the input file:

<root>
L11,L12,L21,L22,L31,L32
1,A,1,C,1,G
1,A,1,C,2,H
1,A,2,D,1,I
1,A,2,D,2,J
2,B,1,E,1,K
2,B,1,E,2,L
2,B,2,F,1,M
2,B,2,F,2,N
</root>

This is desired output XML:

<?xml version="1.0" encoding="utf-8"?>
<Document>
  <Level1>
    <L11>1</L11>
    <L12>A</L12>
    <Level2>
      <L21>1</L11>
      <L22>C</L12>
      <Level3>
        <L31>1</L31>
        <L32>G</L32>
      </Level3>
      <Level3>
        <L31>2</L31>
        <L32>H</L32>
      </Level3>
    </Level2>
    <Level2>
      <L21>2</L11>
      <L22>D</L12>
      <Level3>
        <L31>1</L31>
        <L32>I</L32>
      </Level3>
      <Level3>
        <L31>2</L31>
        <L32>J</L32>
      </Level3>
    </Level2>
  </Level1>
  <Level1>
    <L11>2</L11>
    <L12>B</L12>
    <Level2>
      <L21>1</L11>
      <L22>E</L12>
      <Level3>
        <L31>1</L31>
        <L32>K</L32>
      </Level3>
      <Level3>
        <L31>2</L31>
        <L32>L</L32>
      </Level3>
    </Level2>
    <Level2>
      <L21>2</L11>
      <L22>F</L12>
      <Level3>
        <L31>1</L31>
        <L32>M</L32>
      </Level3>
      <Level3>
        <L31>2</L31>
        <L32>N</L32>
      </Level3>
    </Level2>
  </Level1>
</Document>

I've been trying to find some example online, however couldn't find anything similar. I've never done XSLT transformations before so I'd appreciate if you could point me in the right direction.

Update 1: I am thinking of a 2-step transformation. E.g. first step is to transform CSV to XML:

<?xml version="1.0" encoding="utf-8"?>
<Document>
  <row><L11>1</L11><L12>A</L12><L21>1</L12><L31>C</L31><L32>1</L31><L32>G</L32></row>
  <row><L11>1</L11><L12>A</L12><L21>1</L12><L31>C</L31><L32>2</L31><L32>H</L32></row>
  <row><L11>1</L11><L12>A</L12><L21>2</L12><L31>D</L31><L32>1</L31><L32>I</L32></row>
  <row><L11>1</L11><L12>A</L12><L21>2</L12><L31>D</L31><L32>2</L31><L32>J</L32></row>
  <row><L11>2</L11><L12>B</L12><L21>1</L12><L31>E</L31><L32>1</L31><L32>K</L32></row>
  <row><L11>2</L11><L12>B</L12><L21>1</L12><L31>E</L31><L32>2</L31><L32>L</L32></row>
  <row><L11>2</L11><L12>B</L12><L21>2</L12><L31>F</L31><L32>1</L31><L32>M</L32></row>
  <row><L11>2</L11><L12>B</L12><L21>2</L12><L31>F</L31><L32>2</L31><L32>N</L32></row>   
</Document>

And a second step is to transform that XML into the desired format using some sort of grouping. I don't mind having 2 transformations if there's no other way to achieve that.

Any advice please?

Update 2: Microsoft .NET Framework XSLT processor will be used.

If the abstract example is hard to read you can see a real-life example of the required transformation here: http://servingxml.sourceforge.net/examples/#timesheets-eg

As I understand, using a single transformation is impossible, so if someone could show me how to transform an XML from the Update 1 format to the desired XML format, half of the job would done and I will accept that answer.

2

There are 2 answers

0
michael.hor257k On BEST ANSWER

For clarity, I have modified your input slightly, so that the labels make some sense:

XML

<root>
GroupName,GroupValue,SubGroupName,SubGroupValue,ItemName,ItemValue
1,A,1,C,1,G
1,A,1,C,2,H
1,A,2,D,1,I
1,A,2,D,2,J
2,B,1,E,1,K
2,B,1,E,2,L
2,B,2,F,1,M
2,B,2,F,2,N
</root>

XSLT 1.0

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exsl="http://exslt.org/common"
extension-element-prefixes="exsl">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<xsl:key name="k1" match="row" use="cell[1]"/>
<xsl:key name="k2" match="row" use="concat(cell[1], '|', cell[3])"/>

<xsl:template match="/">
    <!-- tokenize csv -->
    <xsl:variable name="rows">
        <xsl:call-template name="tokenize">
            <xsl:with-param name="text" select="root"/>
        </xsl:call-template>
    </xsl:variable>
    <xsl:variable name="data">
        <xsl:for-each select="exsl:node-set($rows)/row[position() > 1]">
            <row>
                <xsl:call-template name="tokenize">
                    <xsl:with-param name="text" select="."/>
                    <xsl:with-param name="delimiter" select="','"/>
                    <xsl:with-param name="name" select="'cell'"/>
                </xsl:call-template>
            </row>
        </xsl:for-each>
    </xsl:variable>
    <!-- output -->
    <document>
        <xsl:for-each select="exsl:node-set($data)/row[count(. | key('k1', cell[1])[1]) = 1]">
            <group>
                <name>
                    <xsl:value-of select="cell[1]"/>
                </name>
                <value>
                    <xsl:value-of select="cell[2]"/>
                </value>
                <xsl:for-each select="key('k1', cell[1])[count(. | key('k2', concat(cell[1], '|', cell[3]))[1]) = 1]">
                    <subgroup>
                        <name>
                            <xsl:value-of select="cell[3]"/>
                        </name>
                        <value>
                            <xsl:value-of select="cell[4]"/>
                        </value>
                        <items>
                            <xsl:for-each select="key('k2', concat(cell[1], '|', cell[3]))">
                                <item>
                                    <name>
                                        <xsl:value-of select="cell[5]"/>
                                    </name>
                                    <value>
                                        <xsl:value-of select="cell[6]"/>
                                    </value>
                                </item>
                            </xsl:for-each>
                        </items>
                    </subgroup>
                </xsl:for-each>
            </group>
        </xsl:for-each>
    </document>
</xsl:template>

<xsl:template name="tokenize">
    <xsl:param name="text"/>
    <xsl:param name="delimiter" select="'&#10;'"/>
    <xsl:param name="name" select="'row'"/>
    <xsl:variable name="token" select="substring-before(concat($text, $delimiter), $delimiter)" />
    <xsl:if test="$token">
        <xsl:element name="{$name}">
            <xsl:value-of select="$token"/>
        </xsl:element>
    </xsl:if>
    <xsl:if test="contains($text, $delimiter)">
        <!-- recursive call -->
        <xsl:call-template name="tokenize">
            <xsl:with-param name="text" select="substring-after($text, $delimiter)"/>
            <xsl:with-param name="delimiter" select="$delimiter"/>
            <xsl:with-param name="name" select="$name"/>
        </xsl:call-template>
    </xsl:if>
</xsl:template>

</xsl:stylesheet>

Result

<?xml version="1.0" encoding="UTF-8"?>
<document>
   <group>
      <name>1</name>
      <value>A</value>
      <subgroup>
         <name>1</name>
         <value>C</value>
         <items>
            <item>
               <name>1</name>
               <value>G</value>
            </item>
            <item>
               <name>2</name>
               <value>H</value>
            </item>
         </items>
      </subgroup>
      <subgroup>
         <name>2</name>
         <value>D</value>
         <items>
            <item>
               <name>1</name>
               <value>I</value>
            </item>
            <item>
               <name>2</name>
               <value>J</value>
            </item>
         </items>
      </subgroup>
   </group>
   <group>
      <name>2</name>
      <value>B</value>
      <subgroup>
         <name>1</name>
         <value>E</value>
         <items>
            <item>
               <name>1</name>
               <value>K</value>
            </item>
            <item>
               <name>2</name>
               <value>L</value>
            </item>
         </items>
      </subgroup>
      <subgroup>
         <name>2</name>
         <value>F</value>
         <items>
            <item>
               <name>1</name>
               <value>M</value>
            </item>
            <item>
               <name>2</name>
               <value>N</value>
            </item>
         </items>
      </subgroup>
   </group>
</document>

Note:

  1. The element names are hard-coded into the stylesheet and not taken from the input (although that too would be possible with more effort);

  2. You may have to use the msxsl:node-set() function instead of the EXSLT one.

1
Martin Honnen On

Well, the data you have presented is not XML, so to tackle that problem you would at least need XSLT 2.0 like implemented by Saxon 9, XmlPrime, Exselt or Altova. I think in a first step one could transform the data in comma separated lines into XML elements, then in a second step one could transform the input, using grouping. It seems to me a problem where a composite grouping key use as supported in XSLT 3.0 could help so the following is XSLT 3.0:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    xmlns:array="http://www.w3.org/2005/xpath-functions/array"
    xmlns:mf="http://example.com/mf"
    exclude-result-prefixes="xs math array mf" version="3.0">

    <xsl:param name="sep" as="xs:string" select="','"/>

    <xsl:output indent="yes"/>

    <xsl:function name="mf:nest" as="element(Level)*">
        <xsl:param name="levels" as="xs:string*"/>
        <xsl:for-each-group select="$levels" group-by="substring(., 1, 1)">
            <xsl:if test="current-grouping-key() != ''">
                <Level index="{current-grouping-key()}">
                    <xsl:sequence select="mf:nest(current-group() ! substring(., 2))"/>
                </Level>
            </xsl:if>
        </xsl:for-each-group>
    </xsl:function>

    <xsl:function name="mf:group" as="element()*">
        <xsl:param name="rows" as="element(row)*"/>
        <xsl:param name="levels" as="element(Level)*"/>
        <xsl:param name="index" as="xs:integer"/>
        <xsl:variable name="current-level" as="element(Level)?" select="$levels[1]"/>
        <xsl:if test="$current-level">
            <xsl:variable name="indices-of-current-level" select="$current-level/Level/@index!(. + $index)"/>
            <xsl:for-each-group select="$rows" group-by="cell[position() = $indices-of-current-level]" composite="yes">
                <Level index="{$current-level/@index}">
                    <xsl:for-each select="current-grouping-key()">
                        <Data index="L{$current-level/@index}{position()}">
                            <xsl:value-of select="."/>
                        </Data>
                    </xsl:for-each>
                    <xsl:sequence select="mf:group(current-group(), $levels[position() gt 1], $index + count(current-grouping-key()))"/>
                </Level>
            </xsl:for-each-group>
        </xsl:if>   
    </xsl:function>

    <xsl:template match="root">
        <xsl:variable name="lines" select="tokenize(., '(\r?\n)+')[normalize-space()]"/>

        <xsl:variable name="levels" select="tokenize(normalize-space($lines[1]), $sep)"/>

        <xsl:variable name="nesting" select="mf:nest($levels ! substring(., 2))"/>

        <xsl:variable name="data" select="$lines[position() gt 1]"/>
        <xsl:variable name="rows" as="element(row)*">
            <xsl:for-each select="$data">
                <row>
                    <xsl:for-each select="tokenize(normalize-space(), $sep)">
                        <cell>
                            <xsl:value-of select="."/>
                        </cell>
                    </xsl:for-each>
                </row>
            </xsl:for-each>
        </xsl:variable>
        <document>
            <!-- only for debugging respectively to show the intermediate XML data structure used for further processing -->
            <xsl:copy-of select="$nesting"/>
            <xsl:copy-of select="$rows"/>

            <xsl:sequence select="mf:group($rows, $nesting, 0)"/>
        </document>
    </xsl:template>


</xsl:stylesheet>

That gives the result

<?xml version="1.0" encoding="UTF-8"?>
<document>
   <Level index="1">
      <Level index="1"/>
      <Level index="2"/>
   </Level>
   <Level index="2">
      <Level index="1"/>
      <Level index="2"/>
   </Level>
   <Level index="3">
      <Level index="1"/>
      <Level index="2"/>
   </Level>
   <row>
      <cell>1</cell>
      <cell>A</cell>
      <cell>1</cell>
      <cell>C</cell>
      <cell>1</cell>
      <cell>G</cell>
   </row>
   <row>
      <cell>1</cell>
      <cell>A</cell>
      <cell>1</cell>
      <cell>C</cell>
      <cell>2</cell>
      <cell>H</cell>
   </row>
   <row>
      <cell>1</cell>
      <cell>A</cell>
      <cell>2</cell>
      <cell>D</cell>
      <cell>1</cell>
      <cell>I</cell>
   </row>
   <row>
      <cell>1</cell>
      <cell>A</cell>
      <cell>2</cell>
      <cell>D</cell>
      <cell>2</cell>
      <cell>J</cell>
   </row>
   <row>
      <cell>2</cell>
      <cell>B</cell>
      <cell>1</cell>
      <cell>E</cell>
      <cell>1</cell>
      <cell>K</cell>
   </row>
   <row>
      <cell>2</cell>
      <cell>B</cell>
      <cell>1</cell>
      <cell>E</cell>
      <cell>2</cell>
      <cell>L</cell>
   </row>
   <row>
      <cell>2</cell>
      <cell>B</cell>
      <cell>2</cell>
      <cell>F</cell>
      <cell>1</cell>
      <cell>M</cell>
   </row>
   <row>
      <cell>2</cell>
      <cell>B</cell>
      <cell>2</cell>
      <cell>F</cell>
      <cell>2</cell>
      <cell>N</cell>
   </row>
   <Level index="1">
      <Data index="L11">1</Data>
      <Data index="L12">A</Data>
      <Level index="2">
         <Data index="L21">1</Data>
         <Data index="L22">C</Data>
         <Level index="3">
            <Data index="L31">1</Data>
            <Data index="L32">G</Data>
         </Level>
         <Level index="3">
            <Data index="L31">2</Data>
            <Data index="L32">H</Data>
         </Level>
      </Level>
      <Level index="2">
         <Data index="L21">2</Data>
         <Data index="L22">D</Data>
         <Level index="3">
            <Data index="L31">1</Data>
            <Data index="L32">I</Data>
         </Level>
         <Level index="3">
            <Data index="L31">2</Data>
            <Data index="L32">J</Data>
         </Level>
      </Level>
   </Level>
   <Level index="1">
      <Data index="L11">2</Data>
      <Data index="L12">B</Data>
      <Level index="2">
         <Data index="L21">1</Data>
         <Data index="L22">E</Data>
         <Level index="3">
            <Data index="L31">1</Data>
            <Data index="L32">K</Data>
         </Level>
         <Level index="3">
            <Data index="L31">2</Data>
            <Data index="L32">L</Data>
         </Level>
      </Level>
      <Level index="2">
         <Data index="L21">2</Data>
         <Data index="L22">F</Data>
         <Level index="3">
            <Data index="L31">1</Data>
            <Data index="L32">M</Data>
         </Level>
         <Level index="3">
            <Data index="L31">2</Data>
            <Data index="L32">N</Data>
         </Level>
      </Level>
   </Level>
</document>

using Saxon PE 9.6 in Oxygen 18 or the XSLT 3.0 processor in Altova XMLSpy 2017. You can of course remove the lines outputting the intermediate data structure and you can change the creation of the final XML to output elements names containing the level in the name but I prefer to have names that can be represented by a schema and to put any counter or level index into an attribute.