I am working in a Portuguese Digital Humanities project using R. I created a CONLLU-style dataframe with the corpus data, using the UDPipe library:
textAnnotated <- udpipe::udpipe_annotate(m_port, x = textCorpus) %>%
as.data.frame()
The beginning of my dataframe is like this:
doc_id paragraph_id sentence_id sentence token_id token
1 doc1 1 1 DICCIONARIO DOS... 1 DICCIONARIO
2 doc1 1 1 DICCIONARIO DOS... 2-3 DOS
3 doc1 1 1 DICCIONARIO DOS... 2 DE
4 doc1 1 1 DICCIONARIO DOS... 3 OS
5 doc1 1 1 DICCIONARIO DOS... 4 TERMOS
6 doc1 1 1 DICCIONARIO DOS... 5 TECHCNICOS
What I would like to do is to mark each token in each corresponding sentence; for instance, I could rewrite each sentence with the corresponding token as bold.
For example, the first sentence is "DICCIONARIO DOS TERMOS TECHNICOS". I need to replace the sentence in number one with <b>DICCIONARIO</b> DOS TERMOS TECHNICOS
; and then, the sentence in number two would be DICCIONARIO <b>DOS</b> TERMOS TECHNICOS
; the sentence in number five (because numbers 3 and 4 would be deleted) would be DICCIONARIO DOS <b>TERMOS</b> TECHNICOS
; and so on.
I cannot simply match the token in the sentence with, say, str_replace()
, because the same token may occur multiple times in one sentence.
At first, I thought that I could use the function word(token, token_id)
to find the token in the sentence and replace it by, say, <b>token</b>
, with a code like that:
for(i in 1:length(textAnnotated$doc_id)) {
textAnnotated$sentence[i] <- sub(word(TextAnnotated$sentence[i], TextAnnotated$token_id[i]),
paste0("<b>",
word(TextAnnotated$sentence[i],
TextAnnotated$token_id[i]),
"</b>"),
textAnnotated$sentence[i])
}
But there are two problems with that:
- Some token numbers are marked "2-3", because of contractions ("dos" = "de + os"). I wrote a simple code that solves this problem by deleting all the uncontracted forms and renumbering the tokens:
for(i in 1:length(textAnnotated$token_id)){
if(str_detect(textAnnotated$token_id[i], "-")){
textAnnotated$token_id[i+1] <- 0
textAnnotated$token_id[i+2] <- 0
}
}
textAnnotated <- subset(textAnnotated, token_id!=0)
for(k in 1:length(textAnnotated$sentence_id)){
for(j in 1:length(textAnnotated$token_id[textAnnotated$sentence_id == k])){
textAnnotated$token_id[textAnnotated$sentence_id == k][j] <- j
}
}
- But there is a second problem: punctuation (commas, parentheses etc.) also counts as tokens for the UDPipe annotator, but not for the
word()
function; besides that, theword()
function identifies a string such as "word)," as a word, instead of just "word" (I suppose it is because it only uses blankspace as a separator).
I was wondering if there is a function similar to word()
that also counts punctuation characters as tokens, but ignores them in returning the result. Or maybe someone could point me another way to solve this. Thank you!