R String similarity matrix

1.8k views Asked by At

I am busy with a text analytic project on masses of complaints data. One of the issues with the data is that you get multiple synonyms of the same word, e.g. bill, billing, billed, bills etc. Normally I would create a word frequency list and manually match the obvious ones and then apply the main word back to the original corpus for every synonym instance, e.g. billing, billed, bills -> bill (as it is all bill related). I have a nifty piece of code that someone on here helped me with.

Recently I have been playing around with the idea of using a string distance algorithm to make my life easier by identifying possible synonyms. I am using the stringdist package, but I am at a loss as how to efficiently implement the test. Basically I need a matrix of all words and at the intersection a result of the stringdist function.

I use the stringdist function as follows:

library(stringdist)
1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)

Gives a similarity score of 0.955

So from a word list of a,b,c, I want to get to (values purely indicative):

   a    b    c
a  1    0.4  0.4
b  0.4  1    0.4
c  0.4  0.4  1

Where the intersection is the result of the stringdist function.

Alternatively I can also work with:

a  a  1
a  b  0.4
a  c  0.4
b  a  0.4
b  b  1
b  c  0.4
c  a  0.4
c  b  0.4
c  c  1

The only problem with the latter are the duplicates, e.g. a, b and b, a which could be eliminated as it yields the same result.

So clever R coders, please help me. I guess the answer is somewhere in matrix functions, but I am not a good enough R coder.

Cheers

2

There are 2 answers

0
RUser On BEST ANSWER

To remove the duplicates as described above:

dist.mat.tab.sort <- t(apply(dist.mat.tab, 1, sort))
dist.mat.tab <- dist.mat.tab[!duplicated(dist.mat.tab.sort),]

Where dist.mat.tab is the melted distance matrix

0
Hisham Mahrous On

I suggest you use a stemmer, you will find it in tm package. If it is required to use a distance measurement then you can use cosine similarity rather than Jaro-winkler.