I'm running into an issue with encoding and partial matching.
I have two data frames, A and B. A called in via UTF-8 encoding and B on Latin1. This could already be part of the issue although I'm not sure. This was the only way I knew how to import it properly.
edit: I should clarify. This is just sample data. Both dataframes contain a large number of rows and other columns as well.
A B
ID Name Expense Employee Category
1 Mike Adall 3 Lothar Fiend B2
2 Brian Adams 4 Rohan Sudarsh A2
3 Adrián 1 Adrián Silva A1
4 Floyd Oid 1 Semi Ajayi A1
5 Semi Ajayi 4 Micheal Adall A1
6 Jomu Aké 3 Jomü Ria Aké B1
Brian Adams B2
Floyd Öid Matheus B1
I've been trying to extract the B$Employee$ and partially match them with A$Name to create a new df C that would include B$Category. This is the output that I would like.
edit: With Category, I would also want to include all the other columns of both A & B excluding Employee.
C
ID Name Expense Category
1 Mike Adall 3 A1
2 Brian Adams 4 B2
3 Adrián 1 A1
4 Floyd Oid 1 B1
5 Semi Ajayi 4 A1
6 Jomu Aké 3 B1
So far I've got it to match 80% of the characters using the fuzzyjoin package.
C <- A %>% fuzzy_inner_join(B, by = c(Name = "Employee"))
The main issue seems to be these odd latin characters such as Ö,ß, etc. or sometimes when it occurs at the end of a name like 'Aké'. The results seem to vary from name to name.
How could I get it to partially match all the names?
In base R, you could use both
agrep
andadist
as follows:EDIT:
using the
stringdist
package: You could do: