Using the Tokenizer in openNLP

224 views Asked by At

I am getting the POS tagged text in R in the form of:


id   type    start    end      features
1    word     1         5        POS=NNP
2    word     7         8        POS=IN

.....

I want to retrieve the word that it has tagged for example instead of the column 'type' with all values as words retrieve the actual words. I can use scan_tokenizer, but problem comes in when there are forms like "isn't" the POS tagger breaks it into "is" and "not", which is great but the scan_tokenizer doesn't tokenize that way it just keeps it at "isn't". Can anyone please help me retrieve the word that R has tokenized and used to POS tag?

Thanks

1

There are 1 answers

2
Daniel On BEST ANSWER

Why don't you use Illinois POS tagger? It is easy to use and visualize:

http://cogcomp.cs.illinois.edu/page/software_view/3

http://cogcomp.cs.illinois.edu/demo/pos/?id=4