Merge 2 data frames by the columns that do not match exactly

79 views Asked by At

I have 2 data frames. I am trying to merge/join them together while specifying how I want rows to align. Mock data below.

df <- data.frame(Race = c("White", "NHPI", "AA"),
                 Cases = c(20, 4, 7)
                 )
df
> df
   Race Cases
1 White    20
2  NHPI     4
3    AA     7

df2 <- data.frame(Race2 = c("African American", "White", "Hawaiian or Pacific Islander"),
                  TotalPopulation = c(200, 400, 50))
df2
> df2
                         Race2 TotalPopulation
1             African American             200
2                        White             400
3 Hawaiian or Pacific Islander              50

What I would like to do is merge the data while specifying which row matches with the other dataset- for example, I want the merge to align "African American" and "AA". Ideal output would look like this-

   Race Cases                        Race2 TotalPopulation
1    AA     7             African American             200
2 White    20                        White             400
3  NHPI     4 Hawaiian or Pacific Islander              50

I am slightly newer to coding, so if you could walk through the code that would be great. Thanks so much.

2

There are 2 answers

3
Jilber Urbina On BEST ANSWER

Based on @M-- comment, you can do:

df %>%
  left_join(df2 %>% 
               mutate(Race = case_when(
                 Race2 == "African American" ~ "AA",
                 Race2 == "Hawaiian or Pacific Islander" ~ "NHPI",
                 .default = Race2
               )))

   Race Cases                        Race2 TotalPopulation
1 White    20                        White             400
2  NHPI     4 Hawaiian or Pacific Islander              50
3    AA     7             African American             200

If you can set full name and then extract abbreviations you can do it faster, however, note you have NHPI and Hawaiian or Pacific Islander and there is not match. If you can include N in Hawaiian or Pacific Islander, then you can use:

df %>% 
  left_join(df2 %>% 
              mutate(Race = gsub("[:a-z:]*\\s*", "", Race2),
                     Race = sub("W", "White", Race)))

This work under the assumption of the df2 is as follows:

df2 <- data.frame(Race2 = c("African American", "White", "Naaaa Hawaiian or Pacific Islander"),
                  TotalPopulation = c(200, 400, 50))
0
M-- On

Here's an attempt at matching the columns by the first letter of each Race name. As I said in my comment, matching these without any "human supervision" is error prone, so keep that in mind.

library(dplyr)
library(powerjoin)
library(stringdist)

power_full_join(mutate(df1, Race_match = gsub("[^A-Z]", "", df1$Race)),
                mutate(df2, Race_match = gsub("[^A-Z]", "", df2$Race)), 
                           by = ~ stringdist(.x$Race_match, .y$Race_match) < 2) %>% 
  select(-contains("Race_match"))
#>    Race Cases                        Race2 TotalPopulation
#> 1 White    20                        White             400
#> 2  NHPI     4 Hawaiian or Pacific Islander              50
#> 3    AA     7             African American             200

Created on 2023-12-13 with reprex v2.0.2

Explanation:

Here, I am adding a column to each dataframe with the capital letters extracted from Race columns, using dplyr::mutate.

mutate(df1, Race_match = gsub("[^A-Z]", "", df1$Race))
#>    Race Cases Race_match
#> 1 White    20          W
#> 2  NHPI     4       NHPI
#> 3    AA     7         AA

mutate(df2, Race_match = gsub("[^A-Z]", "", df2$Race2))
#>                          Race2 TotalPopulation Race_match
#> 1             African American             200         AA
#> 2                        White             400          W
#> 3 Hawaiian or Pacific Islander              50        HPI

As you can see, Race_match columns are not a 1 to 1 match (see HPI and NHPI). But they are close enough (what do I mean by close enough?).

We can use powerjoin library along with stringdist to define where/how do we wanna join the dataframes. Here, I am joining the rows if the difference in Race_match columns is less than 2. So, that's what I considered close enough!

power_full_join(mutate(df1, Race_match = gsub("[^A-Z]", "", df1$Race)),
                mutate(df2, Race_match = gsub("[^A-Z]", "", df2$Race)), 
                           by = ~ stringdist(.x$Race_match, .y$Race_match) < 2)
#>    Race Cases Race_match.x                        Race2 TotalPopulation   Race_match.y
#> 1 White    20            W                        White             400     W
#> 2  NHPI     4         NHPI Hawaiian or Pacific Islander              50   HPI
#> 3    AA     7           AA             African American             200    AA

Lastly, I drop Race_match columns, using dplyr::select, as they were created just to help with the join.