CSV to XML XSLT: How to quote excape

1k views Asked by At

I've been working on CSV to XML conversions using this style-sheet: XSLT 2.0 to convert CSV to XML format

I needed to account for both comma and pipe delimiters so I changed the style sheet to this:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
    <xsl:output indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:param name="csv-uri" as="xs:string" select="'file:///c:/test.csv'"/>

    <xsl:template match="/" name="csv2xml">
        <Entity>
            <Rows>
                <xsl:choose>
                    <xsl:when test="unparsed-text-available($csv-uri)">
                        <xsl:variable name="csv" select="unparsed-text($csv-uri)" />

                        <xsl:variable name="pipe" select="'\|'"/>
                        <xsl:analyze-string  select="replace($csv,$pipe,',')" regex='\r\n?|\n' >

                            <xsl:non-matching-substring>
                                <xsl:if test="not(position()=0)" >
                                    <Row>
                                        <xsl:for-each select="tokenize(.,',')" >
                                            <xsl:element name="Column_{position()}">
                                                <xsl:value-of select="."/>
                                            </xsl:element>
                                        </xsl:for-each>
                                    </Row>
                                </xsl:if>
                            </xsl:non-matching-substring>
                        </xsl:analyze-string>

                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:variable name="error">
                            <xsl:text>Error reading file: "</xsl:text>
                            <xsl:value-of select="$csv-uri"/>
                        </xsl:variable>
                        <xsl:message>
                            <xsl:value-of select="$error"/>
                        </xsl:message>
                        <xsl:value-of select="$error"/>
                    </xsl:otherwise>
                </xsl:choose>
            </Rows>
        </Entity>
    </xsl:template>
</xsl:stylesheet>

However, I've been having some difficulty accounting for quoted values from the input.

Currently if a row in the csv is:

   1,2,3,4,5,"testing,1,1,1"
   red,white,blue,green
   dogs|cats|rabbits

the "testing,1,1,1" gets split into 4 columns into the CSV instead of one column.

output:

    <?xml version="1.0" encoding="UTF-8"?>
    <Entity>
       <Rows>
          <Row>
             <Column_1>1</Column_1>
             <Column_2>2</Column_2>
             <Column_3>3</Column_3>
             <Column_4>4</Column_4>
             <Column_5>5</Column_5>
             <Column_6>"testing</Column_6>
             <Column_7>1</Column_7>
             <Column_8>1</Column_8>
             <Column_9>1"</Column_9>
          </Row>
          <Row>
             <Column_1>red</Column_1>
             <Column_2>white</Column_2>
             <Column_3>blue</Column_3>
             <Column_4>green</Column_4>
          </Row>
          <Row>
             <Column_1>dogs</Column_1>
             <Column_2>cats</Column_2>
             <Column_3>rabbits</Column_3>
          </Row>
       </Rows>
    </Entity>

I've done some research and using regex='("[^"]*")+' can accomplish this. But I'm not entirely sure how to implement this without removing something I have that I need (maybe probably in the Analyze-string block?). I need help please! Its probably something simple, so please school me or point me in the right direction. Any advice would be helpful.

1

There are 1 answers

4
Daniel Haley On BEST ANSWER

You can just add another xsl:analyze-string to process the xsl:non-matching-substring from the first xsl:analyze-string...

CSV Input

1,2,3,4,5,"testing,1,1,1"
red,white,blue,green
dogs|cats|rabbits

Modified XSLT 2.0

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
    <xsl:output indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:param name="csv-uri" as="xs:string" select="'file:///c:/users/dhaley/desktop/so.csv'"/>

    <xsl:template match="/" name="csv2xml">
        <Entity>
            <Rows>
                <xsl:choose>
                    <xsl:when test="unparsed-text-available($csv-uri)">
                        <xsl:variable name="csv" select="unparsed-text($csv-uri)" />

                        <xsl:variable name="pipe" select="'\|'"/>
                        <xsl:analyze-string  select="replace($csv,$pipe,',')" regex='\r\n?|\n' >

                            <xsl:non-matching-substring>
                                <xsl:if test="not(position()=0)" >
                                    <Row>
                                        <xsl:analyze-string select="." regex="&quot;([^&quot;]*)&quot;,?|([^,]+),?">
                                            <xsl:matching-substring>
                                                <xsl:element name="Column_{position()}">
                                                    <xsl:value-of select="normalize-space(concat(regex-group(1),regex-group(2)))"/>
                                                </xsl:element>
                                            </xsl:matching-substring>
                                            <xsl:non-matching-substring>
                                                <xsl:element name="Column_{position()}"/>
                                            </xsl:non-matching-substring>
                                        </xsl:analyze-string>
                                    </Row>
                                </xsl:if>
                            </xsl:non-matching-substring>
                        </xsl:analyze-string>

                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:variable name="error">
                            <xsl:text>Error reading file: "</xsl:text>
                            <xsl:value-of select="$csv-uri"/>
                        </xsl:variable>
                        <xsl:message>
                            <xsl:value-of select="$error"/>
                        </xsl:message>
                        <xsl:value-of select="$error"/>
                    </xsl:otherwise>
                </xsl:choose>
            </Rows>
        </Entity>
    </xsl:template>
</xsl:stylesheet>

XML Output

<Entity>
   <Rows>
      <Row>
         <Column_1>1</Column_1>
         <Column_2>2</Column_2>
         <Column_3>3</Column_3>
         <Column_4>4</Column_4>
         <Column_5>5</Column_5>
         <Column_6>testing,1,1,1</Column_6>
      </Row>
      <Row>
         <Column_1>red</Column_1>
         <Column_2>white</Column_2>
         <Column_3>blue</Column_3>
         <Column_4>green</Column_4>
      </Row>
      <Row>
         <Column_1>dogs</Column_1>
         <Column_2>cats</Column_2>
         <Column_3>rabbits</Column_3>
      </Row>
   </Rows>
</Entity>

The regex from the second xsl:analyze-string is using captured substrings to ignore the quotes and the comma. Here's an easier to read version:

"([^"]*)",?|([^,]+),?

If you want to keep the quotes, move them inside the parens:

("[^"]*"),?|([^,]+),?