I am attempting to get a percentage of an edit distance from a group of sequences. So far this is what I have:
library(stringdist)
sequence <- c("CA--------W----------------------EKDRRTEAF---F------",
"CA--------W----------------------EKDRRTEAF---F------",
"CA--------S-------------------SLVFGQGDNIQY---F------",
"RA--------S-------------------SLIYSP----LH---F------")
edit_dist <- stringdistmatrix(sequence)
#0
#13 13
#14 14 11
len <- stri_length(gsub('-', '', sequence))
#13 13 16 12
As each line of len
is equivalent to each line of sequence
, when comparing the two lines I would like to use the largest len
to get the percentage. So as when having an edit distance between the second and third sequence it would use the length of 16 rather than 13 to get a percentage.
I know this code is wrong, but it is generally the idea I am going for:
for (i in len) {
num1 <- len[i]
for (j in len){
num2 <- len[j]
if (num2 > num1){
num <- num2
}else{
num <- num1
}
}
edit_dist/num
}
The answer should look something similar to that below:
0 .8125 .8125 1.0769 1.0769 .6875
You can construct a suitable matrix of the max length with
outer
andpmax
, which you can then coerce todist
class (likeedit_dist
) so you can divide: