icu (uca) support for frisian collation

85 views Asked by At

In frisian the y is and i and sorts just after it, see http://download.mimer.com/pub/developer/charts/frisian.htm.

I try to sort data using xquery processor saxonica using frisian language code, or collation rules, see http://saxonica.com/html/documentation/extensibility/config-extend/collation/

Sofar no luck, tried several combinations of settings, nothing seems to work, as well tried with the latest icu-j on the classpath. icu does support frisian although I doubt if the collation is right.

Does anyone have experience in this and can give me some pointers?

Bye, Eduard

1

There are 1 answers

6
Martin Honnen On BEST ANSWER

I don't know whether ICU supports that language and you haven't really explained which language code it has but based on https://stackoverflow.com/a/48439714/252228 I have copied the code of the table in http://download.mimer.com/pub/developer/charts/frisian.htm to an XQuery file to create a Saxon configuration collation element:

let $fy-table := <table xmlns="http://www.w3.org/1999/xhtml" summary="Chart">
<tbody><tr>
<td class="p" title="LATIN SMALL LETTER A [409C.002.02]">a<br /><tt>0061</tt></td>
<td class="t" title="LATIN CAPITAL LETTER A [409C.002.08]">A<br /><tt>0041</tt></td>
<td class="s" title="LATIN SMALL LETTER A WITH CIRCUMFLEX [409C.002.02][0000.048.02]">â<br /><tt>00E2</tt></td>
<td class="t" title="LATIN CAPITAL LETTER A WITH CIRCUMFLEX [409C.002.08][0000.048.02]">Â<br /><tt>00C2</tt></td>
<td class="s" title="LATIN SMALL LETTER A WITH DIAERESIS [409C.002.02][0000.050.02]">ä<br /><tt>00E4</tt></td>
<td class="t" title="LATIN CAPITAL LETTER A WITH DIAERESIS [409C.002.08][0000.050.02]">Ä<br /><tt>00C4</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER B [409E.002.02]">b<br /><tt>0062</tt></td>
<td class="t" title="LATIN CAPITAL LETTER B [409E.002.08]">B<br /><tt>0042</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER C [40A0.002.02]">c<br /><tt>0063</tt></td>
<td class="t" title="LATIN CAPITAL LETTER C [40A0.002.08]">C<br /><tt>0043</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER D [40A4.002.02]">d<br /><tt>0064</tt></td>
<td class="t" title="LATIN CAPITAL LETTER D [40A4.002.08]">D<br /><tt>0044</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER E [40A6.002.02]">e<br /><tt>0065</tt></td>
<td class="t" title="LATIN CAPITAL LETTER E [40A6.002.08]">E<br /><tt>0045</tt></td>
<td class="s" title="LATIN SMALL LETTER E WITH ACUTE [40A6.002.02][0000.042.02]">é<br /><tt>00E9</tt></td>
<td class="t" title="LATIN CAPITAL LETTER E WITH ACUTE [40A6.002.08][0000.042.02]">É<br /><tt>00C9</tt></td>
<td class="s" title="LATIN SMALL LETTER E WITH CIRCUMFLEX [40A6.002.02][0000.048.02]">ê<br /><tt>00EA</tt></td>
<td class="t" title="LATIN CAPITAL LETTER E WITH CIRCUMFLEX [40A6.002.08][0000.048.02]">Ê<br /><tt>00CA</tt></td>
<td class="s" title="LATIN SMALL LETTER E WITH DIAERESIS [40A6.002.02][0000.050.02]">ë<br /><tt>00EB</tt></td>
<td class="t" title="LATIN CAPITAL LETTER E WITH DIAERESIS [40A6.002.08][0000.050.02]">Ë<br /><tt>00CB</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER F [40A8.002.02]">f<br /><tt>0066</tt></td>
<td class="t" title="LATIN CAPITAL LETTER F [40A8.002.08]">F<br /><tt>0046</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER G [40AA.002.02]">g<br /><tt>0067</tt></td>
<td class="t" title="LATIN CAPITAL LETTER G [40AA.002.08]">G<br /><tt>0047</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER H [40AC.002.02]">h<br /><tt>0068</tt></td>
<td class="t" title="LATIN CAPITAL LETTER H [40AC.002.08]">H<br /><tt>0048</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER I [40AE.002.02]">i<br /><tt>0069</tt></td>
<td class="t" title="LATIN CAPITAL LETTER I [40AE.002.08]">I<br /><tt>0049</tt></td>
<td class="s" title="LATIN SMALL LETTER I WITH DIAERESIS [40AE.002.02][0000.050.02]">ï<br /><tt>00EF</tt></td>
<td class="t" title="LATIN CAPITAL LETTER I WITH DIAERESIS [40AE.002.08][0000.050.02]">Ï<br /><tt>00CF</tt></td>
<td class="s" title="LATIN SMALL LETTER Y [40AE.003.02]">y<br /><tt>0079</tt></td>
<td class="t" title="LATIN CAPITAL LETTER Y [40AE.003.08]">Y<br /><tt>0059</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER J [40B0.002.02]">j<br /><tt>006A</tt></td>
<td class="t" title="LATIN CAPITAL LETTER J [40B0.002.08]">J<br /><tt>004A</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER K [40B2.002.02]">k<br /><tt>006B</tt></td>
<td class="t" title="LATIN CAPITAL LETTER K [40B2.002.08]">K<br /><tt>004B</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER L [40B4.002.02]">l<br /><tt>006C</tt></td>
<td class="t" title="LATIN CAPITAL LETTER L [40B4.002.08]">L<br /><tt>004C</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER M [40B6.002.02]">m<br /><tt>006D</tt></td>
<td class="t" title="LATIN CAPITAL LETTER M [40B6.002.08]">M<br /><tt>004D</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER N [40B8.002.02]">n<br /><tt>006E</tt></td>
<td class="t" title="LATIN CAPITAL LETTER N [40B8.002.08]">N<br /><tt>004E</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER O [40BC.002.02]">o<br /><tt>006F</tt></td>
<td class="t" title="LATIN CAPITAL LETTER O [40BC.002.08]">O<br /><tt>004F</tt></td>
<td class="s" title="LATIN SMALL LETTER O WITH CIRCUMFLEX [40BC.002.02][0000.048.02]">ô<br /><tt>00F4</tt></td>
<td class="t" title="LATIN CAPITAL LETTER O WITH CIRCUMFLEX [40BC.002.08][0000.048.02]">Ô<br /><tt>00D4</tt></td>
<td class="s" title="LATIN SMALL LETTER O WITH DIAERESIS [40BC.002.02][0000.050.02]">ö<br /><tt>00F6</tt></td>
<td class="t" title="LATIN CAPITAL LETTER O WITH DIAERESIS [40BC.002.08][0000.050.02]">Ö<br /><tt>00D6</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER P [40BE.002.02]">p<br /><tt>0070</tt></td>
<td class="t" title="LATIN CAPITAL LETTER P [40BE.002.08]">P<br /><tt>0050</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER Q [40C0.002.02]">q<br /><tt>0071</tt></td>
<td class="t" title="LATIN CAPITAL LETTER Q [40C0.002.08]">Q<br /><tt>0051</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER R [40C2.002.02]">r<br /><tt>0072</tt></td>
<td class="t" title="LATIN CAPITAL LETTER R [40C2.002.08]">R<br /><tt>0052</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER S [40C4.002.02]">s<br /><tt>0073</tt></td>
<td class="t" title="LATIN CAPITAL LETTER S [40C4.002.08]">S<br /><tt>0053</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER T [40C6.002.02]">t<br /><tt>0074</tt></td>
<td class="t" title="LATIN CAPITAL LETTER T [40C6.002.08]">T<br /><tt>0054</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER U [40C8.002.02]">u<br /><tt>0075</tt></td>
<td class="t" title="LATIN CAPITAL LETTER U [40C8.002.08]">U<br /><tt>0055</tt></td>
<td class="s" title="LATIN SMALL LETTER U WITH ACUTE [40C8.002.02][0000.042.02]">ú<br /><tt>00FA</tt></td>
<td class="t" title="LATIN CAPITAL LETTER U WITH ACUTE [40C8.002.08][0000.042.02]">Ú<br /><tt>00DA</tt></td>
<td class="s" title="LATIN SMALL LETTER U WITH CIRCUMFLEX [40C8.002.02][0000.048.02]">û<br /><tt>00FB</tt></td>
<td class="t" title="LATIN CAPITAL LETTER U WITH CIRCUMFLEX [40C8.002.08][0000.048.02]">Û<br /><tt>00DB</tt></td>
<td class="s" title="LATIN SMALL LETTER U WITH DIAERESIS [40C8.002.02][0000.050.02]">ü<br /><tt>00FC</tt></td>
<td class="t" title="LATIN CAPITAL LETTER U WITH DIAERESIS [40C8.002.08][0000.050.02]">Ü<br /><tt>00DC</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER V [40CA.002.02]">v<br /><tt>0076</tt></td>
<td class="t" title="LATIN CAPITAL LETTER V [40CA.002.08]">V<br /><tt>0056</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER W [40CC.002.02]">w<br /><tt>0077</tt></td>
<td class="t" title="LATIN CAPITAL LETTER W [40CC.002.08]">W<br /><tt>0057</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER X [40CE.002.02]">x<br /><tt>0078</tt></td>
<td class="t" title="LATIN CAPITAL LETTER X [40CE.002.08]">X<br /><tt>0058</tt></td>
</tr>
<tr>
<td class="p" title="LATIN SMALL LETTER Z [40D2.002.02]">z<br /><tt>007A</tt></td>
<td class="t" title="LATIN CAPITAL LETTER Z [40D2.002.08]">Z<br /><tt>005A</tt></td>
</tr>
</tbody></table>
return 

<collations>
      <collation uri="http://example.com/fy"
      rules="{string-join(('', $fy-table//*:td/text()[1]/normalize-space()), ' &lt; ')}"/>
</collations>

I have then used that result it creates to use oXygen to create a Saxon configuration file with

<configuration edition="EE" xmlns="http://saxon.sf.net/ns/configuration">
     <collations>
          <collation uri="http://example.com/fy"
               rules="&lt; a &lt; A &lt; â &lt; Â &lt; ä &lt; Ä &lt; b &lt; B &lt; c &lt; C &lt; d &lt; D &lt; e &lt; E &lt; é &lt; É &lt; ê &lt; Ê &lt; ë &lt; Ë &lt; f &lt; F &lt; g &lt; G &lt; h &lt; H &lt; i &lt; I &lt; ï &lt; Ï &lt; y &lt; Y &lt; j &lt; J &lt; k &lt; K &lt; l &lt; L &lt; m &lt; M &lt; n &lt; N &lt; o &lt; O &lt; ô &lt; Ô &lt; ö &lt; Ö &lt; p &lt; P &lt; q &lt; Q &lt; r &lt; R &lt; s &lt; S &lt; t &lt; T &lt; u &lt; U &lt; ú &lt; Ú &lt; û &lt; Û &lt; ü &lt; Ü &lt; v &lt; V &lt; w &lt; W &lt; x &lt; X &lt; z &lt; Z"/>
     </collations>
</configuration>

now when I run Saxon in oXygen providing that configuration file (or by using e.g. -config:fy-collations-saxon.config.xml on the Saxon command line) for running the XQuery

sort(('L', 'J', 'Y', 'I', 'A'), 'http://example.com/fy')

the result is A I Y J L.