R String similarity matrix

Question

R String similarity matrix

1.8k views Asked by RUser At 11 December 2014 at 03:20

I am busy with a text analytic project on masses of complaints data. One of the issues with the data is that you get multiple synonyms of the same word, e.g. bill, billing, billed, bills etc. Normally I would create a word frequency list and manually match the obvious ones and then apply the main word back to the original corpus for every synonym instance, e.g. billing, billed, bills -> bill (as it is all bill related). I have a nifty piece of code that someone on here helped me with.

Recently I have been playing around with the idea of using a string distance algorithm to make my life easier by identifying possible synonyms. I am using the stringdist package, but I am at a loss as how to efficiently implement the test. Basically I need a matrix of all words and at the intersection a result of the stringdist function.

I use the stringdist function as follows:

library(stringdist)
1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)

Gives a similarity score of 0.955

So from a word list of a,b,c, I want to get to (values purely indicative):

   a    b    c
a  1    0.4  0.4
b  0.4  1    0.4
c  0.4  0.4  1

Where the intersection is the result of the stringdist function.

Alternatively I can also work with:

a  a  1
a  b  0.4
a  c  0.4
b  a  0.4
b  b  1
b  c  0.4
c  a  0.4
c  b  0.4
c  c  1

The only problem with the latter are the duplicates, e.g. a, b and b, a which could be eliminated as it yields the same result.

So clever R coders, please help me. I guess the answer is somewhere in matrix functions, but I am not a good enough R coder.

Cheers

Original Q&A

There are 2 answers

Hisham Mahrous On 18 September 2015 at 01:00

I suggest you use a stemmer, you will find it in tm package. If it is required to use a distance measurement then you can use cosine similarity rather than Jaro-winkler.

**RUser** · Accepted Answer · 2016-06-21T00:26:36+00:00

RUser On 21 June 2016 at 00:26 BEST ANSWER

To remove the duplicates as described above:

dist.mat.tab.sort <- t(apply(dist.mat.tab, 1, sort))
dist.mat.tab <- dist.mat.tab[!duplicated(dist.mat.tab.sort),]

Where dist.mat.tab is the melted distance matrix

TechQA.

R String similarity matrix

There are 2 answers

Related Questions in R

Related Questions in TM

Related Questions in SYNONYM

Related Questions in STRINGDIST

Popular Questions

Popular Tags

Trending Questions