Sort strings, treating hyphen, slash, and space as equal, using UCA collation

96 views Asked by At

Problem

I'm using Saxon-EE 11 and my platform's language is en-us.

I'm attempting to implement custom sorting behavior for an <xsl:sort> instruction by specifying a UCA collation. Ignoring the XML document details and just getting to the core, string-by-string comparison question, I want these strings:

ABSENTEES
ABSENTEE VOTING
MINNEAPOLIS TEACHERS RETIREMENT FUND ASSOCIATION (MTRFA)
MINNEAPOLIS-SAINT PAUL INTERNATIONAL AIRPORT
MINNEAPOLIS/SAINT PAUL HOUSING FINANCE BOARD
MINNEAPOLIS
MINNEAPOLIS PORT AUTHORITY

to be sorted into this order:

ABSENTEE VOTING
ABSENTEES
MINNEAPOLIS
MINNEAPOLIS PORT AUTHORITY
MINNEAPOLIS/SAINT PAUL HOUSING FINANCE BOARD
MINNEAPOLIS-SAINT PAUL INTERNATIONAL AIRPORT
MINNEAPOLIS TEACHERS RETIREMENT FUND ASSOCIATION (MTRFA)

Attempting to render the rules into English:

  1. A string that shares a common prefix with another string, but diverges at a space should sort before that other string (ABSENTEE VOTING before ABSENTEES)
  2. Hyphens and slashes should be considered the same as spaces.

What I've tried

The UCA collation http://www.w3.org/2013/collation/UCA?alternate=shifted handles the MINNEAPOLIS* strings correctly, but it will put ABSENTEES before ABSENTEE VOTING.

The bare UCA collation http://www.w3.org/2013/collation/UCA handles ABSENTEES and ABSENTEE VOTING correctly, but will place the MINNEAPOLIS/SAINT PAUL and MINNEAPOLIS-SAINT PAUL strings after anything with MINNEAPOLIS and a space character.

I've attempted a few other combinations of parameters, though none of them has produced anything closer to what I'm looking for. I'm close to giving up and implementing either a custom pre-processing before applying the collation or else dropping into a Java implementation.

If what I'm looking for is truly not achievable with UCA collations, that's good to know.

1

There are 1 answers

3
michael.hor257k On BEST ANSWER

Using an input of:

XML

<root>
    <string>ABSENTEES</string>
    <string>ABSENTEE VOTING</string>
    <string>MINNEAPOLIS TEACHERS RETIREMENT FUND ASSOCIATION (MTRFA)</string>
    <string>MINNEAPOLIS-SAINT PAUL INTERNATIONAL AIRPORT</string>
    <string>MINNEAPOLIS/SAINT PAUL HOUSING FINANCE BOARD</string>
    <string>MINNEAPOLIS</string>
    <string>MINNEAPOLIS PORT AUTHORITY</string>
</root>

and the following stylesheet:

XSLT 2.0

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>

<xsl:template match="/root">
    <output>
        <xsl:perform-sort select="string">
            <xsl:sort select="translate(., '-/', '  ')"/>
        </xsl:perform-sort>
    </output>
</xsl:template>

</xsl:stylesheet>

I get:

Result

<?xml version="1.0" encoding="UTF-8"?>
<output>
   <string>ABSENTEE VOTING</string>
   <string>ABSENTEES</string>
   <string>MINNEAPOLIS</string>
   <string>MINNEAPOLIS PORT AUTHORITY</string>
   <string>MINNEAPOLIS/SAINT PAUL HOUSING FINANCE BOARD</string>
   <string>MINNEAPOLIS-SAINT PAUL INTERNATIONAL AIRPORT</string>
   <string>MINNEAPOLIS TEACHERS RETIREMENT FUND ASSOCIATION (MTRFA)</string>
</output>