I am trying to compare texts in a column to identify the text similarity, in terms of whether adjacent letters in the texts are similar; how many substition is necessary for two adjacent letters to make the both letters same.
Example: JANE-JNAE (1 - AN/NA), MARY-MART(0), CLERA-LCREA(2 - CL/LC & ER/RE)
I have tried stringdist methods but they do not provide solutions for my problem.
Since I am new to R, I could not write an efficent code to show here:
substition <- function(text1,tex2){
if(text1 == text2){
return(TRUE)
}
if(nchar(text1) != nchar(text2)){
return(FALSE)
}
vec1 <- strsplit("text1",split="")[[1]]
vec2 <- strsplit("text2",split="")[[1]]
(can't go on)
. But to illustrate:
data is something like this
df$NO df$names
1 JANE
2 MARY
3 CLERA
4 JNAE
5 LCREA
6 MART
and the desired output is:
df$NO df$names df$substition
1 JANE 1
2 MARY 0
3 CLERA 2
4 JNAE 1
5 LCREA 2
6 MART 0
You can use the Levenshtein distance (https://en.wikipedia.org/wiki/Levenshtein_distance) between strings. The distance gives the minimal number of insertions, deletions and substitutions needed to transform one string into another.
Usage
Returns a 3x3 matrix of distances: