I have one data.frame with multiple columns. The first column contains company names. These have been entered by users and many values contain similar strings representing the same entity. For example Company A Pty. Company A Pty. Ltd. Company A Georgia.
I would like to replace these variations with a single common string Company A in another column. I have looked at stringdist and other functions- but they don't seem to support this use case.
This would then allow me to summarise/aggregate based on that common string.
Third party tools such as Google Refine would work - but I would prefer to operate within R.
Use
agrep
function.Initial data:
first argument is pattern that you want to look in data (eg x[1]), second is where you want to look,
max
is the max distance two strings can differ.value
means that we want to obtain strings instead of indexes of vector.If there is no match, you can change
max
, but be careful! More is not always better.What's more, this is not symetrical. "Red Pants Warsaw" (x[4]) was not matched to "Red Pants Ltd" (x[6]) but it worked other way - x[6] was matched to x[4]. Be aware of this.