NLP in R: working with tokenization in a CONLLU-style dataframe

113 views Asked by Bruno Maroneze At 02 June 2022 at 16:53

I am working in a Portuguese Digital Humanities project using R. I created a CONLLU-style dataframe with the corpus data, using the UDPipe library:

textAnnotated <- udpipe::udpipe_annotate(m_port, x = textCorpus) %>%
  as.data.frame()

The beginning of my dataframe is like this:

doc_id   paragraph_id sentence_id   sentence            token_id  token            
1   doc1            1           1   DICCIONARIO DOS...         1  DICCIONARIO
2   doc1            1           1   DICCIONARIO DOS...       2-3  DOS
3   doc1            1           1   DICCIONARIO DOS...         2  DE
4   doc1            1           1   DICCIONARIO DOS...         3  OS
5   doc1            1           1   DICCIONARIO DOS...         4  TERMOS
6   doc1            1           1   DICCIONARIO DOS...         5  TECHCNICOS

What I would like to do is to mark each token in each corresponding sentence; for instance, I could rewrite each sentence with the corresponding token as bold. For example, the first sentence is "DICCIONARIO DOS TERMOS TECHNICOS". I need to replace the sentence in number one with DICCIONARIO DOS TERMOS TECHNICOS; and then, the sentence in number two would be DICCIONARIO DOS TERMOS TECHNICOS; the sentence in number five (because numbers 3 and 4 would be deleted) would be DICCIONARIO DOS TERMOS TECHNICOS; and so on.

I cannot simply match the token in the sentence with, say, str_replace(), because the same token may occur multiple times in one sentence.

At first, I thought that I could use the function word(token, token_id) to find the token in the sentence and replace it by, say, token, with a code like that:

for(i in 1:length(textAnnotated$doc_id)) {

textAnnotated$sentence[i] <- sub(word(TextAnnotated$sentence[i], TextAnnotated$token_id[i]),
                                 paste0("<b>",
                                       word(TextAnnotated$sentence[i], 
                                       TextAnnotated$token_id[i]),
                                        "</b>"),
                             textAnnotated$sentence[i])
}

But there are two problems with that:

Some token numbers are marked "2-3", because of contractions ("dos" = "de + os"). I wrote a simple code that solves this problem by deleting all the uncontracted forms and renumbering the tokens:

for(i in 1:length(textAnnotated$token_id)){
  if(str_detect(textAnnotated$token_id[i], "-")){
    textAnnotated$token_id[i+1] <- 0
    textAnnotated$token_id[i+2] <- 0
  }
}
textAnnotated <- subset(textAnnotated, token_id!=0)
for(k in 1:length(textAnnotated$sentence_id)){

  for(j in 1:length(textAnnotated$token_id[textAnnotated$sentence_id == k])){
  
    textAnnotated$token_id[textAnnotated$sentence_id == k][j] <- j
  }
}

But there is a second problem: punctuation (commas, parentheses etc.) also counts as tokens for the UDPipe annotator, but not for the word() function; besides that, the word() function identifies a string such as "word)," as a word, instead of just "word" (I suppose it is because it only uses blankspace as a separator).

I was wondering if there is a function similar to word() that also counts punctuation characters as tokens, but ignores them in returning the result. Or maybe someone could point me another way to solve this. Thank you!

Original Q&A

TechQA.

NLP in R: working with tokenization in a CONLLU-style dataframe

There are 0 answers

Related Questions in R

Related Questions in NLP

Related Questions in TOKENIZE

Related Questions in UDPIPE

Related Questions in CONLL

Popular Questions

Popular Tags

Trending Questions