NLP in R: working with tokenization in a CONLLU-style dataframe

117 views Asked by At

I am working in a Portuguese Digital Humanities project using R. I created a CONLLU-style dataframe with the corpus data, using the UDPipe library:

textAnnotated <- udpipe::udpipe_annotate(m_port, x = textCorpus) %>%
  as.data.frame()

The beginning of my dataframe is like this:

doc_id   paragraph_id sentence_id   sentence            token_id  token            
1   doc1            1           1   DICCIONARIO DOS...         1  DICCIONARIO
2   doc1            1           1   DICCIONARIO DOS...       2-3  DOS
3   doc1            1           1   DICCIONARIO DOS...         2  DE
4   doc1            1           1   DICCIONARIO DOS...         3  OS
5   doc1            1           1   DICCIONARIO DOS...         4  TERMOS
6   doc1            1           1   DICCIONARIO DOS...         5  TECHCNICOS

What I would like to do is to mark each token in each corresponding sentence; for instance, I could rewrite each sentence with the corresponding token as bold. For example, the first sentence is "DICCIONARIO DOS TERMOS TECHNICOS". I need to replace the sentence in number one with <b>DICCIONARIO</b> DOS TERMOS TECHNICOS; and then, the sentence in number two would be DICCIONARIO <b>DOS</b> TERMOS TECHNICOS; the sentence in number five (because numbers 3 and 4 would be deleted) would be DICCIONARIO DOS <b>TERMOS</b> TECHNICOS; and so on.

I cannot simply match the token in the sentence with, say, str_replace(), because the same token may occur multiple times in one sentence.

At first, I thought that I could use the function word(token, token_id) to find the token in the sentence and replace it by, say, <b>token</b>, with a code like that:

for(i in 1:length(textAnnotated$doc_id)) {

textAnnotated$sentence[i] <- sub(word(TextAnnotated$sentence[i], TextAnnotated$token_id[i]),
                                 paste0("<b>",
                                       word(TextAnnotated$sentence[i], 
                                       TextAnnotated$token_id[i]),
                                        "</b>"),
                             textAnnotated$sentence[i])
}

But there are two problems with that:

  1. Some token numbers are marked "2-3", because of contractions ("dos" = "de + os"). I wrote a simple code that solves this problem by deleting all the uncontracted forms and renumbering the tokens:
for(i in 1:length(textAnnotated$token_id)){
  if(str_detect(textAnnotated$token_id[i], "-")){
    textAnnotated$token_id[i+1] <- 0
    textAnnotated$token_id[i+2] <- 0
  }
}
textAnnotated <- subset(textAnnotated, token_id!=0)
for(k in 1:length(textAnnotated$sentence_id)){

  for(j in 1:length(textAnnotated$token_id[textAnnotated$sentence_id == k])){
  
    textAnnotated$token_id[textAnnotated$sentence_id == k][j] <- j
  }
}
  1. But there is a second problem: punctuation (commas, parentheses etc.) also counts as tokens for the UDPipe annotator, but not for the word() function; besides that, the word() function identifies a string such as "word)," as a word, instead of just "word" (I suppose it is because it only uses blankspace as a separator).

I was wondering if there is a function similar to word() that also counts punctuation characters as tokens, but ignores them in returning the result. Or maybe someone could point me another way to solve this. Thank you!

0

There are 0 answers