JLanguageTool do not ignore digits in words

144 views Asked by At

I'm using JLanguageTool with the German language (de-DE) for spellchecking and noticed that digits seems to be used as a word separator (just like spaces?). For example We8lt is not reported as a single incorrect word but as two spelling erros (one for We and one for lt). Or for example bis8 is not reported as an error at all.

Example call (I'm using it as a Java library but the behaviour is the same):

$ echo "Hallo We8lt bis8 Test" | java -jar languagetool-commandline.jar -l de-DE -
Expected text language: German (Germany)
Working on STDIN...

1.) Line 1, column 7, Rule ID: GERMAN_SPELLER_RULE prio=-3
Message: Möglicher Tippfehler gefunden.
Suggestion: WE; Der; Den; Des; Dem
Hallo We8lt bis8 Test 
      ^^              

2.) Line 1, column 10, Rule ID: GERMAN_SPELLER_RULE prio=-3
Message: Möglicher Tippfehler gefunden.
Suggestion: LT; als; lag; alt; elf
Hallo We8lt bis8 Test 
         ^^           

Time: 1618ms for 1 sentences (0.6 sentences/sec)

This is a big problem for as as for example missing spaces between words and numbers are not found. How can I get the library/tool to do not treat numbers as word separators? Thanks a lot.

1

There are 1 answers

0
F. Knorr On

Yes, you are right: LanguageTool treats numbers as word separators in German.

To modify this behaviour, you have to change the source code and change this line in GermanSpellerRule.java from

String pattern = "(" + nonWordPattern.pattern() + "|(?<=[\\d°])-|-(?=\\d+))";

to

String pattern = ("(" + nonWordPattern.pattern() + "|(?<=[\\d°])-|-(?=\\d+))").replace("{L}", "{L}\\d");

Alternatively, you could add another rule to grammar.xml which complains about missing spaces before/after numbers:

<rule id="RULE" name="rule">
<pattern>
    <token regexp="yes">[a-zäöüß]+\d+[a-zäöüß]*</token>
</pattern>
<message>Fehlt hier ein Leerzeichen?</message>
<example correction=""><marker>P4sswort</marker>.</example>

Ruleeditor