I am attempting to join 2 tibbles (dplyr::inner_join) based on a UUID column in each. Oddly, instead of the expected one-to-one relationship, the join produces a many-to-many relationship (as stated in a warning message and also indicated by the resulting tibble containing substantially more records than either of the original tibbles). In digging into the join results, it is apparent that the tibbles are being joined on unmatched UUIDs and that is producing the incorrect many-to-many relationship. Changing the data type of the UUID columns from UUID to character produces the correct one-to-one relationship (as indicated by the resulting tibble containing the same number of records as the two original tibbles).
I shared my code with a colleague and when he ran the code (i.e., inner joining by the UUID columns, data type for each column = UUID), it produced the expected one-to-one relationship. The only obvious difference between our respective computing environments is the OS: mine is Ubuntu 22.04.4 and his is Mac OS Ventura. We are both using RStudio 2023.12.1+402.
I'd appreciate any insights into what is causing this OS-related difference, and how to avoid the mis-joining of the UUID columns when using a Linux OS.
Here is a minimum reproducible example to demonstrate the issue:
# load packages
library(dplyr)
library(uuid)
# create 2 datasets
id1 <- UUIDgenerate(n = 100000, output = "uuid")
data1a <- tibble(id1, type_m = "aaaa")
data1b <- data1a %>%
rename(type_n = type_m) %>%
mutate(type_n = replace(type_n, type_n == "aaaa", "bbbb"))
# join on id1 column with "UUID" data type
data1c <- data1a %>%
inner_join(data1b, by = join_by(id1), keep = TRUE)
# identify mis-joined records
data1c_troubleshoot <- data1c %>% filter(id1.x != id1.y)
# change data type of join_by column to "character" and join
data1a <- data1a %>% mutate(id1 = as.character(id1))
data1b <- data1b %>% mutate(id1 = as.character(id1))
data1c <- data1a %>%
inner_join(data1b, by = join_by(id1), keep = TRUE)
data1c_troubleshoot <- data1c %>% filter(id1.x != id1.y)
I ran the above code on an AWS Linux/Ubuntu (18.04.1-Ubuntu) server with R version 4.0.2 (2020-06-22) and had the same issues/improper joining of UUID data type. A many to many relationship occurred.