UIMA Ruta. Retrieve phrases separated by WS (spaces, breaks, etc.)

187 views Asked by At

I'm going to retrieve phrases separated by spaces, breaks and other punctuation symbols.

I've spent a lot of time trying to find out the best way to do that.

Option 1. The easiest way.

DECLARE T1, T2;
"cool rules" -> T1;
"cool rule" -> T2;

Input: "123cool rules". Result: T1 and T2 are triggered;

Option 2. Using WORDLIST and WORDTABLE.

Let wordlist 1.txt contains 2 rows:

cool rules
cool

code for extraction is the following

WORDLIST WList = '1.txt';
DECLARE W1;
Document{-> MARKFAST(W1, WList, true, 2)};

Input: "cool rules". Result: only first row is extracted. I guess that in this case intersected rules are not triggered.

Option 3. Mark combination of two tokens

DECLARE T1;
("cool" "rule") {-> T1};

Input: "cool rules cool rule 1cool rule" Result: 2 annotations: cool rule + 1cool rule. Loss of extraction speed in 10 times.

Option 4. REGEXP matching Maybe it is possible to match such pattern "cool\\srule", but I have no idea how to define the type expression. SW*{REGEXP("cool\\srule")->T1} does not provide results.

As you see, I'm trying to solve a very simple task, but did not succeed yet. The option 3 is a really good way to do that, but extraction process becomes slower in 10 times.

1

There are 1 answers

0
Peter Kluegl On

If you want to identify specific phrases, you should use a dictionary lookup, not directly rules.

Therefore, I'd recommend the MARKFAST option 2. However, there are two problems: (a) only longest matches are supported and (b) you either need to change the segmentation (tokenization) or do some postprocessing.

(a) This cannot be solved. If this is really required, a different dictionary annotator should be used. See e.g., the UIMA mailing lists.

(b) The MARKFAST works on RutaBasic annotations which are automatically created for each smallest part. Because of the default seeder, the token "1cool" consists of two RutaBasics, one for the NUM, one for the SW. If you do not want to change the preprocessing, you can simply apply a rule that fixed that like

RETAINTYPE(WS);
ANY{-PARTOF(WS)} t:@T1{-> UNMARK(t)};

btw, option 4 won't work because the REGEXP condition checks on the covered text of the matched annotation SW which only represents one token. If you do something like (SW+){REGEXP("cool\\srule")->T1}, then the rule wont match if there is another SW afterwards.

DISCLAIMER: I am a developer Of UIMA Ruta