I'm working on speech in conversational speaking turns and want to extract words that are repeated across turns. The task I'm grappling with is to extract words that inexactly repeated.
Data:
X <- data.frame(
speaker = c("A","B","A","B"),
speech = c("i'm gonna take a look you okay with that",
"sure looks good we can take a look you go first",
"okay last time I looked was different i think that is it yeah",
"yes you're right i think that's it"), stringsAsFactors = F
)
I have a for
loop that successfully extracts exact repetitions:
# initialize vectors:
pattern1 <- c()
extracted1 <- c()
# run `for` loop:
library(stringr)
for(i in 2:nrow(X)){
# define each 'speech` element as a pattern for the next `speech` element:
pattern1[i-1] <- paste0("\\b(", paste0(unlist(str_split(X$speech[i-1], " ")), collapse = "|"), ")\\b")
# extract all matched words:
extracted1[i] <- str_extract_all(X$speech[i], pattern1[i-1])
}
# result:
extracted1
[[1]]
NULL
[[2]]
[1] "take" "a" "look" "you"
[[3]]
character(0)
[[4]]
[1] "i" "think" "that" "it"
However, I also want to extract inexact repetitions. For example, looks
in row #2 is an inexact repetition of look
in row #1, looked
in row #3 fuzzily repeats looks
in row #2, and yes
in row #4 is an approximate match of yeah
in row #3.
I've recently come across agrep
, which is used for approximate matching, but I don't know how to use it here or whether it's the right way to go at all. Any help is greatly appreciated.
Note that the actual data comprises thousands of speaking turns with highly unpredictable content so that it's not possible to define a list of all possible variants beforehand.
I think this can be done really well using a tidy approach. The problem you already solved can be done (probably much quicker) using
tidytext
:But of course what you want to do is a bit more complex. The example words you highlight are not exactly the same but have a Levenshtein distance of up to 2:
There is a great package for this following the same tidyverse logic. Unfortunately, the
by
argument in the respective function does not seem to be able to handle two columns (or it applies a fuzzy logic to both columns so 0 and 2 are treated as the same?), so this does not work:However, using a loop we can implement the missing function anyway:
Created on 2021-04-22 by the reprex package (v2.0.0)
I'm not sure how ideal the distance is in your case and if you consider the results correct. Alternatively you can try stemming or lemmatization before matching, which might work better. I also wrote a new function for the package implementing a stringsim_join version, which takes into account the length of the words you are trying to match. But the PR hasn't been approved yet.