Delet with XSLT2 elements who doesN’T match a regex

29 views Asked by At

The new version of the question (2023-10-01)

General overview

I try to make the table of content (TOC) of a document by picking only his title nodes like h1, h2… h9 (h[0-9]), and so delete all the others nodes outside the title nodes.

I tried to use the match() statement who is only available on XSLT2, that’s why I use Saxon.

For, the moment I have the following MWE:

Minimal working example (MWE)

document.xml

<?xml version="1.0" encoding="UTF-8"?>

<document>
<h1>Lorem ipsum dolor</h1>
<h2>Lorem ipsum dolor</h2>
<p>
Sed ut <i>perspiciatis</i> unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?
</p>

<h1>sit amet et consectetur</h1>
<h2>Quia adipit</h2>
<p>
Sed ut <i>perspiciatis</i> unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?
</p>

</document>

maketoc.xslt

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

  <xsl:output method="text"/>

  <xsl:template match="*[not(*) and not(matches(name(), '^h[0-9]$'))]">
  </xsl:template>

</xsl:stylesheet>

The conversion command

saxon-xslt -o output.txt document.xml maketoc.xslt 

The rendering when I execute the command

Lorem ipsum dolor
Lorem ipsum dolor

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?


sit amet et consectetur
Quia adipit

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?

The problem

As you see, in the rendering, all the nodes still remain. I am not able to delete the nodes who are not matching h[0-9].

In the match statement I used *[not(matches(name(), '^h[0-9]$'))] or *[not(*) and not(matches(name(), '^h[0-9]$'))] as suggested by Michael Kay with the same result.

An intermediate solution

I finaly get the following maketoc.xslt who can isolate just title nodes:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="html"/>

    <!-- Template to match and output title nodes (h1 to h9) -->
    <xsl:template match="h1 | h2 | h3 | h4 | h5 | h6 | h7 | h8 | h9">
        <xsl:value-of select="."/>
        <xsl:text>&#10;</xsl:text> <!-- Add a newline after each title -->
        <xsl:apply-templates select="node()"/> <!-- Process child nodes if needed -->
    </xsl:template>

    <!-- Template to skip all other nodes -->
    <xsl:template match="node()">
        <xsl:apply-templates select="node()"/> <!-- Process child nodes recursively -->
    </xsl:template>
</xsl:stylesheet>

But, as you see, he dosen’t really use a regex like h[0-9]. I have to explicitly cite each h0, h1,… h9 possibility. When the goal is just to match it with a regex.

The question

So, how to delete all the nodes who doesn’t match h[0-9] regex?


The old version of the question

In order to make a TOC for a document, I search to catch only the h[0-9] nodes and delete all other nodes who haven’t theire place in the TOC.

So, in XSLT2, I made the following lines:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

  <xsl:output method="text"/> 


  <xsl:template match="*[not(matches(name(), '^h[0-9]$'))]"> <!-- This is the relevant line -->
  <!-- I let it empty in order to delet its content -->
  </xsl:template>


</xsl:stylesheet>

I compile it using saxon-xslt -o output.txt example.xml maketoc.xslt

But, unfortunatly, the not(matches()) template doesn’t affect the target lines. And it basically affect no line.

So, how to delete all the nodes who doesn’t match h[0-9] regex?

2

There are 2 answers

2
Michael Kay On

You haven't shown your source document, but the most likely explanation is that the outermost element of the source document has a name that doesn't match h[0-9] which means that the element will be deleted and its children will not be processed.

Perhaps you should add a rule

<xsl:template match="*[*]"> 
  <xsl:apply-templates select="*"/>
</xsl:template>

to ensure that processing continues to the children of such an element.

Or you could change the pattern for elements that you want to delete to

match="*[not(*) and not(matches(name(), '^h[0-9]$'))]"
0
Bryn Lewis On

You define a template, but you don't use it. You need to add an apply-templates in the root:

<xsl:apply:templates select="*" />

Otherwise, nothing gets selected.

This would go before the template. The value in the 'select' depends on what you want to match.

Note - in xsl, you don't delete elements, you select the ones you want. ie anything not selected will not be in the ouput.

As @michael-kay noted, you probably need the apply-templates in the template as well.