Saxon, XSLT: processing thousands of xml files in a complex tree structure

240 views Asked by At

I use a python script that iterates through thousands of xml files in a complex tree structure and execute the following Saxon command:

java -cp C:\saxon\SaxonHE10-6J\saxon-he-10.6.jar net.sf.saxon.Transform -t -s:{input} -xsl:{xslt} -o:{output}

My final output is one txt file ; each line corresponds to one xml file and is a selection of xml element values from it.

This is working well but the performance is very low. I suppose it is because my python script calls the Saxon command each time a new xml file in my iteration is processed.

What would be the right approach to speeding up the process, if possible drastically?

Kind regards.

Excerpt from the python file:

for root, dirs, files in os.walk(folderXmlSource):

    for file in files:
        if file.endswith('.xml'):
            input = '"\\\\?\\' + str(os.path.join(root, file)) + '"'
            output = '"' + os.path.join(folderTxtTemp, file[:-4] + '.txt') + '"'
            try:
                transform(input, output)
                print(input, 'jjjjj', output)
                finalize(output)
            except:
                errorLog.write(input + '\n')

The transform function calls Saxon and processes the XSLT transformation. The finalize function concatenates in the final result file all the results obtained from the XSLT transformation of each xml file.

Excerpt from the XSL file:

<!--  //System:FileName  -->

    <xsl:variable name="System:FileName">
        <xsl:choose>
            <xsl:when test="//System:FileName">
                <xsl:choose>
                    <xsl:when test="//System:FileName !=''">
                        <xsl:value-of select="//System:FileName"/>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:text>System:FileName VIDE</xsl:text>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:when>
            <xsl:otherwise>
                <xsl:text>System:FileName ABSENT</xsl:text>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:variable>

The XSL file looks for specific elements like, for instance, System:FileName. If this element exists, it puts the value of it in a variable. All the variable contents that are obtained from the different elements are then concatenated into a txt file.

2

There are 2 answers

17
Martin Honnen On

I would suggest to try to use SaxonC 12 (e.g. use the PyPi package saxonche) and change the Python code to e.g.

from saxonche import PySaxonProcessor

def transform(saxon_proc, xslt30_executable, input, output):
    xdm_input = saxon_proc.parse_xml(xml_file_name=input)
    xslt30_executable.set_global_context_item(xdm_item=xdm_input)
    xslt30_executable.apply_templates_returning_file(xdm_value=xdm_input, output_file=output)


with PySaxonProcessor() as saxon_proc:
    xslt30_processor = saxon_proc.new_xslt30_processor()
    xslt30_executable = xslt30_processor.compile_stylesheet(stylesheet_file='yourXsltStylesheet.xsl')

    for file in files:
        if file.endswith('.xml'):
            input = '"\\\\?\\' + str(os.path.join(root, file)) + '"'
            output = '"' + os.path.join(folderTxtTemp, file[:-4] + '.txt') + '"'
            try:
                transform(saxon_proc, xslt30_executable, input, output)
                print(input, 'jjjjj', output)
                finalize(output)
            except:
                errorLog.write(input + '\n')`

See whether that alone not already gives a drastic performance improvement.

You can then also consider to use multithreading with Python and SaxonC, as done in https://github.com/martin-honnen/SaxonC12ThreadPoolExecutorXSLTTransformation, to further improve performance.

I will look at the XSLT and whether it is possible to delegate all to a single XSLT next.

12
Martin Honnen On

As for using a single XSLT, you seem to look for all .xml files in a certain directory, that is what Saxon allows you to do also with a file URI to that directory and the URI query parameters ?select=*.xml;recurse=yes, see https://www.saxonica.com/html/documentation12/sourcedocs/collections/collection-directories.html.

So you would write a single XSLT taking a parameter for the source directory file URI e.g.

<xsl:param name="folder-uri" as="xs:string" expand-text="no">file:///C:/Users/foo/dir/subdir</xsl:param>

and then I would suggest to use the uri-collection function on that with e.g.

<xsl:template name="xsl:initial-template">
  <xsl:variable name="input-uris" select="uri-collection($folder-uri || '?select=*.xml;recurse=yes')"/>
  <xsl:for-each select="$input-uris">
    <xsl:apply-templates select="doc(.)"/>
  </xsl:for-each>
</xsl:template>

then I would hope if you xsl:import your existing stylesheet into that main stylesheet and run it with -it from the command line without a source file (i.e. without the -s: option) or with e.g. call_template_returning_file(output_file=output) from the Python API, that it processes all input files and creates a single result file.

As for parsing or processing of individual files failing and trying to keep a record of that in a separate file, I think, with XSLT 3.0, you can use xsl:iterate over the uri-collection with a nested xsl:try/xsl:catch for that, a minimal example is in https://github.com/martin-honnen/SaxonC12IterateTryCatchTest, the XSLT is in https://github.com/martin-honnen/SaxonC12IterateTryCatchTest/blob/master/iterate-try-catch-test1.xsl and basically does e.g.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="#all"
  expand-text="yes">

  <xsl:output method="xml" indent="no"/>

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:template name="xsl:initial-template">
    <xsl:variable name="input-uris" select="uri-collection('input-samples?select=*.xml')"/>
    <xsl:iterate select="$input-uris">
      <xsl:param name="failed-uris" as="xs:anyURI*" select="()"/>
      <xsl:on-completion>
        <xsl:if test="not(empty($failed-uris))">
          <xsl:result-document href="failed-uris.txt" method="text" item-separator="&#10;">
            <xsl:sequence select="$failed-uris"/>
          </xsl:result-document>
        </xsl:if>
      </xsl:on-completion>
      <xsl:try>
        <xsl:apply-templates select="doc(.)"/>
        <xsl:catch> 
          <xsl:next-iteration>
            <xsl:with-param name="failed-uris" select="$failed-uris, ."/>
          </xsl:next-iteration>
        </xsl:catch>
      </xsl:try>
    </xsl:iterate>
  </xsl:template>

</xsl:stylesheet>

In the context of the previous suggestion with apply-templates you could therefore try e.g.

<xsl:template name="xsl:initial-template">
  <xsl:variable name="input-uris" select="uri-collection($folder-uri || '?select=*.xml;recurse=yes')"/>
   <xsl:iterate select="$input-uris">
      <xsl:param name="failed-uris" as="xs:anyURI*" select="()"/>
      <xsl:on-completion>
        <xsl:if test="not(empty($failed-uris))">
          <xsl:result-document href="error-log.txt" method="text" item-separator="&#10;">
            <xsl:sequence select="$failed-uris"/>
          </xsl:result-document>
        </xsl:if>
      </xsl:on-completion>
      <xsl:try>
        <xsl:apply-templates select="doc(.)"/>
        <xsl:catch> 
          <xsl:next-iteration>
            <xsl:with-param name="failed-uris" select="$failed-uris, ."/>
          </xsl:next-iteration>
        </xsl:catch>
      </xsl:try>
    </xsl:iterate>
</xsl:template>