There are many choices of string distance calculation methods in R in package {stringdist}
(https://cran.r-project.org/web/packages/stringdist/stringdist.pdf), very curious about if it is possible to include user defined match items by using regex
or some other ways in the Jaro
or Jaro-Winker
distance calculations? If not, is there any other packages provide this kind of function?
For example:
for string "USA Starwar Corporation"
(a)
, "US Starwar Corporation"
(b)
, "United States Starwar Corporation"
(c)
currently the Jaro distances between ((a),(b)),((b),(c)),((a),(c))
are respectively 0.01449275, 0.2020202, 0.216513
. Is there any way to define "USA"
matches "US"
matches"United States"
in the calculation and therefore the distance could be 0,0,0
?
Thanks!