How to use OpenNLP to get POS tags in R?

Question

How to use OpenNLP to get POS tags in R?

13.3k views Asked by user4599 At 23 June 2015 at 06:19

Here is the R Code:

library(NLP) 
library(openNLP)
tagPOS <-  function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)}
str <- "this is a the first sentence."
tagged_str <-  tagPOS(str)

Output is :

tagged_str $POStagged [1]"this/DT is/VBZ a/DT the/DT first/JJ sentence/NN ./."

Now I want to extract only NN word i.e sentence from the above sentence and want to store it into a variable .Can anyone help me out with this .

Original Q&A

There are 3 answers

Ken Benoit On 23 June 2015 at 10:30

Here is a more general solution, where you can describe the Treebank tag you desire to extract using a regular expression. In your case for instance, "NN" returns all noun types (e.g. NN, NNS, NNP, NNPS) while "NN$" returns just NN.

It operates on a character type, so if you have your texts as a list, you will need to lapply() it as in the examples below.

txt <- c("This is a short tagging example, by John Doe.",
         "Too bad OpenNLP is so slow on large texts.")

extractPOS <- function(x, thisPOSregex) {
    x <- as.String(x)
    wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
    POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
    POSwords <- subset(POSAnnotation, type == "word")
    tags <- sapply(POSwords$features, '[[', "POS")
    thisPOSindex <- grep(thisPOSregex, tags)
    tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex], tags[thisPOSindex])
    untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
    untokenizedAndTagged
}

lapply(txt, extractPOS, "NN")
## [[1]]
## [1] "tagging/NN example/NN John/NNP Doe/NNP"
## 
## [[2]]
## [1] "OpenNLP/NNP texts/NNS"
lapply(txt, extractPOS, "NN$")
## [[1]]
## [1] "tagging/NN example/NN"
## 
## [[2]]
## [1] ""

Ken Benoit On 24 June 2017 at 15:22

Here is another answer that uses the spaCy parser and tagger, from Python, and the spacyr package to call it.

This library is orders of magnitude faster and almost as good as the stanford NLP models. It is still incomplete in some languages, but for english is a pretty good and promising option.

You first need to have Python installed and to have installed spaCy and a language module. Instructions are available from the spaCy page and here.

Then:

txt <- c("This is a short tagging example, by John Doe.",
         "Too bad OpenNLP is so slow on large texts.")

require(spacyr)
## Loading required package: spacyr
spacy_initialize()
## Finding a python executable with spacy installed...
## spaCy (language model: en) is installed in /usr/local/bin/python
## successfully initialized (spaCy Version: 1.8.2, language model: en)

spacy_parse(txt, pos = TRUE, tag = TRUE)
##    doc_id sentence_id token_id   token   lemma   pos tag   entity
## 1   text1           1        1    This    this   DET  DT         
## 2   text1           1        2      is      be  VERB VBZ         
## 3   text1           1        3       a       a   DET  DT         
## 4   text1           1        4   short   short   ADJ  JJ         
## 5   text1           1        5 tagging tagging  NOUN  NN         
## 6   text1           1        6 example example  NOUN  NN         
## 7   text1           1        7       ,       , PUNCT   ,         
## 8   text1           1        8      by      by   ADP  IN         
## 9   text1           1        9    John    john PROPN NNP PERSON_B
## 10  text1           1       10     Doe     doe PROPN NNP PERSON_I
## 11  text1           1       11       .       . PUNCT   .         
## 12  text2           1        1     Too     too   ADV  RB         
## 13  text2           1        2     bad     bad   ADJ  JJ         
## 14  text2           1        3 OpenNLP opennlp PROPN NNP         
## 15  text2           1        4      is      be  VERB VBZ         
## 16  text2           1        5      so      so   ADV  RB         
## 17  text2           1        6    slow    slow   ADJ  JJ         
## 18  text2           1        7      on      on   ADP  IN         
## 19  text2           1        8   large   large   ADJ  JJ         
## 20  text2           1        9   texts    text  NOUN NNS         
## 21  text2           1       10       .       . PUNCT   .

**RHertel** · Accepted Answer · 2015-06-23T08:14:03+00:00

RHertel On 23 June 2015 at 08:14 BEST ANSWER

There might be more elegant ways to obtain the result, but this one should work:

q <- strsplit(unlist(tagged_str[1]),'/NN')
q <- tail(strsplit(unlist(q[1])," ")[[1]],1)
#> q
#[1] "sentence"

Hope this helps.

TechQA.

How to use OpenNLP to get POS tags in R?

There are 3 answers

Related Questions in R

Related Questions in NLP

Related Questions in TEXT-MINING

Related Questions in OPENNLP

Related Questions in POS-TAGGER

Popular Questions

Trending Questions