I have 5 vectors with column names, which are similar, but not identical.
I am trying to find a way to correct the entries in vector2
, vector3
, vector4
, vector5
, based on the names in vector1
.
I have been getting some ideas here and here, leading to the code below. But in the end, I even get stuck comparing the first two.vectors. Let alone overwriting them.
library(dplyr)
library(fuzzyjoin)
vector1 <- c("something","nothing", "anything", "number4")
vector2 <- c("some thing","no thing","addition", "anything", "number4")
vector3 <- c("some thing wrong","nothing", "anything_")
vector4 <- c("something","nothingg", "anything", "number_4")
vector5 <- c("something","nothing", "anything happening", "number4")
I started out as follows:
apply(adist(x = vector1, y = vector2), 1, which.min)
data.frame(string_to_match = vector1,
closest_match = vector2[apply(adist(x = vector1, y = vector2), 1, which.min)])
string_to_match closest_match
1 something some thing
2 nothing no thing
3 anything anything
4 number4 number4
Is there anyway to add the distance to this solution and to overwrite the vector based on the distance?
Desired result:
string_to_match closest_match distance
1 something some thing 1
2 nothing no thing 1
3 anything anything 0
4 number4 number4 0
vector1 <- c("something","nothing", "anything", "number4")
vector2 <- c("something","nothing","addition", "anything", "number4")
vector3 <- c("something","nothing", "anything")
vector4 <- c("something","nothing", "anything", "number4")
vector5 <- c("something","nothing", "anything", "number4")
Is there anyone who can put me on the right track?
fuzzyjoin
functions will add the distance metric. You don't need to overwrite if you just select the closest_match column/vector.Created on 2021-01-06 by the reprex package (v0.3.0)