uima wordlist missing entries

68 views Asked by At

using uima ruta 2.7.0

DECLARE Substance;
WORDLIST EnzymeSearchList = 'enzyme.txt';
Document{-> MARKFAST(Substance, EnzymeSearchList, true)}; // true ignores case

enzyme.txt contains ~ 16.000 entries (=lines)

If I use a file containing few entries, for example 5, my further rules work without any problem. Once I provide the full list of thousands of entries, my results are incomplete.

Can be the issue caused by reaching WORDLIST limit? Or heap maybe? Nothing fails upon program execution.

I have found a thread specifically stating

There is no maximum size for the wordlists in UIMA Ruta. ... My largest wordlist consisted of about 500k entries

1

There are 1 answers

1
Peter Kluegl On BEST ANSWER

I assume that you mean by incomplete that several (obivous) entities have not been found/annotated in the document?

This is most likely caused by whitespaces in the enzyme.txt file. Can you verify this, e.g., be removing all whitespace in this file and retest the script

If the problem is caused by whitespaces, there are several options to solve/avoid this. You can for example set the config param 'dictRemoveWS' to true for automatically removing the whitepaces when the dictionary is loaded.

Is upgrading to UIMA Ruta 2.8.1 (which should also fix this problem) an option?