Generate new unique ID numbers while excluding previously generated ID numbers in R

1.7k views Asked by At

I would like to generate unique IDs for rows in my database. I will be adding entries to this database on an ongoing basis so I'll need to generate new IDs in tandem. While my database is relatively small and the chance of duplicating random IDs is minuscule, I still want to build in a programmatic fail-safe to ensure that I never generate an ID that has already been used in the past.

For starters, here are some sample data that I can use start an example database:

library(tidyverse)
library(ids)
library(babynames)
    
database <- data.frame(rid = random_id(5, 5), first_name = sample(babynames$name, 5))

print(database)
          rid first_name
1  07282b1da2      Sarit
2  3c2afbb0c3        Aly
3  f1414cd5bf    Maedean
4  9a311a145e    Teriana
5  688557399a    Dreyton

And here is some sample data that I can use to represent new data that will be appended to the existing database:

new_data <- sample(babynames$name, 5)

print(new_data)

 first_name
1    Hamzeh
2   Mahmoud
3   Matelyn
4    Camila
5     Renae

Now, what I want is to bind a new column of randomly generated IDs using the random_id function while simultaneously checking to ensure that newly generated IDs don't match any existing IDs within the database object. If the generator created an identical ID, then ideally it would generate a new replacement until a truly unique ID is created.

Any help would be much appreciated!

UPDATE

I've thought of a possibility that helps but still is limited. I could generate new IDs and then use a for() loop to test whether any of the newly generated IDs are present in the existing database. If so, then I would regenerate a new ID. For example...

new_data$rid <- random_id(nrow(new_data), 5)

for(i in 1:nrow(new_data)){
  if(new_data$rid[i] %in% unique(database$rid)){
    new_data$rid[id] = random_id(1, 5)
  }
}

The problem with this approach is that I would need to build an endless stream of nested if statements to continuously test the newly generated value against the original database again. I need a process to keep testing until a truly unique value that is not found in the original database is generated.

2

There are 2 answers

4
manotheshark On BEST ANSWER

Use of ids::uuid() would likely preclude having to check for duplicate id values. In fact, if you were to generate 10 trillion uuids, there would be something along the lines of a .00000006 chance of two uuids being the same per What is a UUID?

Here is a base function that will quickly check for duplicate values without needing to do any iteration:

anyDuplicated(1:4)
[1] 0

anyDuplicated(c(1:4,1))
[1] 5

The first result above shows there are no duplicate values. The second is showing that element 5 is a duplicate as 1 is used twice. Below is how to check without iterating, the new_data had the database$rid copied so all five were duplicates. This will repeat until all rid are unique, but note that it presumes that all existing database$rid are unique.

library(ids)
set.seed(7)
new_data$rid <- database$rid
repeat {
  duplicates <- anyDuplicated(c(database$rid, new_data$rid))
  if (duplicates == 0L) {
    break
  }
  new_data$rid[duplicates - nrow(database)] <- random_id(1, 5)
}

All new_data$rid have been replaced with unique values.

rbind(database, new_data)

          rid first_name
1  07282b1da2      Sarit
2  3c2afbb0c3        Aly
3  f1414cd5bf    Maedean
4  9a311a145e    Teriana
5  688557399a    Dreyton
6  52f494c714     Hamzeh
7  ac4f522860    Mahmoud
8  ffe74d535b    Matelyn
9  e3dccc4a8e     Camila
10 e0839a0d34      Renae
0
Emman On

This answer is inspired by @manotheshark's answer, with 2 major changes:

  1. It's a function.
  2. I changed the mechanism of replacing the duplicates. Instead of looping and replacing one duplicate in each iteration as in @manotheshark's, here I replace them in bigger chunks.
library(ids)

generate_random_unique_ids <- function(n) {
  vec_ids <- ids::random_id(n = n, bytes = 4, use_openssl = FALSE)
  repeat {
    duplicates <- duplicated(vec_ids)
    if (!any(duplicates)) {
      break
    }
    vec_ids[duplicates] <- ids::random_id(n = sum(duplicates), bytes = 4, use_openssl = FALSE)
  }
  vec_ids
}

Some timings for example

library(tictoc)

tic()
v_1e6 <- generate_random_unique_ids(1e6)
toc()
#> 7.14 sec elapsed

tic()
v_3e7 <- generate_random_unique_ids(3e7)
toc()
#> 296.42 sec elapsed

Would love to learn if there's a way to optimize this function to get speedier execution times.