merging data.frame rows based on similar strings in r

2k views Asked by At

I have one data.frame with multiple columns. The first column contains company names. These have been entered by users and many values contain similar strings representing the same entity. For example Company A Pty. Company A Pty. Ltd. Company A Georgia.

I would like to replace these variations with a single common string Company A in another column. I have looked at stringdist and other functions- but they don't seem to support this use case.

This would then allow me to summarise/aggregate based on that common string.

Third party tools such as Google Refine would work - but I would prefer to operate within R.

1

There are 1 answers

0
bartektartanus On

Use agrep function.

Initial data:

x <- c("Company A Pty.","BigData GMBH","Company A Pty. Ltd.","Red Pants Warsaw", "Company A Georgia", "Red Pants Ltd", "BlueSocks House")

first argument is pattern that you want to look in data (eg x[1]), second is where you want to look, max is the max distance two strings can differ. value means that we want to obtain strings instead of indexes of vector.

If there is no match, you can change max, but be careful! More is not always better.

agrep(x[1],x, max=0.1, value=TRUE)
## [1] "Company A Pty."      "Company A Pty. Ltd."
agrep(x[1],x, max=0.3, value=TRUE)
## [1] "Company A Pty."      "Company A Pty. Ltd." "Company A Georgia"  
agrep(x[1],x, max=0.7, value=TRUE)
## [1] "Company A Pty."      "Company A Pty. Ltd." "Company A Georgia"   "Red Pants Ltd" 

What's more, this is not symetrical. "Red Pants Warsaw" (x[4]) was not matched to "Red Pants Ltd" (x[6]) but it worked other way - x[6] was matched to x[4]. Be aware of this.

agrep(x[4],x, max=0.2, value=TRUE)
## [1] "Red Pants Warsaw"
agrep(x[6],x, max=0.2, value=TRUE)
## [1] "Red Pants Warsaw" "Red Pants Ltd"