Hamming distance on categorical data with multiple values in one cell

89 views Asked by At

I'm calculating pairwise hamming distance between rows in data frame which contains strings. Some of the cells contain two values.

I would like to calculate hamming distance as follows:

if row1 = "d,e" and row2 = "d", Hd(row1, row2) is
Hd("d", "d") + Hd("e", "d") / length(row1) + length(row2)
0 + 1 / 2 + 1 = 1/3 = 0.33

or

if row1 = "d,e", and row2 = "d,f", Hd(row1, row2) is
Hd("d", "d") + Hd("d", "f") + Hd("e", "d") + Hd("e", "f") / length(row1) + length(row2)
0 + 1 + 1 + 1 / 2 + 2 = 3/4 = 0.75

I've managed to calculate hamming distance between cells that contain only one value, but stuck with those that contain more than one value.

0

There are 0 answers