Remove duplicated rows within and between data frames stored in a list

49 views Asked by At

A list of data frames:

my_list <- list(structure(list("_uuid" = c("xxxyz", 
                                      "xxxyz", "zzuio", "iiopz"), country = c("USA", 
                                                                              "USA", "Canada", "Switzerland")), class = "data.frame", row.names = c(NA, -4L)), 
                structure(list("_uuid" = c("xxxyz", "ppuip", 
                                      "zzuio"), country = c("USA", "Canada", "Canada")), class = "data.frame", row.names = c(NA, 
                                                                                                                             -3L)))
my_list
[[1]]
  _uuid     country
1 xxxyz         USA
2 xxxyz         USA
3 zzuio      Canada
4 iiopz Switzerland

[[2]]
  _uuid country
1 xxxyz     USA
2 ppuip  Canada
3 zzuio  Canada

I want to remove duplicated rows both within and between the data frames stored in that list.

This works to remove duplicates within each data frame:

my_list <- lapply(my_list, function(z) z[!duplicated(z[["_uuid"]]),])
my_list
[[1]]
  _uuid     country
1 xxxyz         USA
3 zzuio      Canada
4 iiopz Switzerland

[[2]]
  _uuid country
1 xxxyz     USA
2 ppuip  Canada
3 zzuio  Canada

But there are still duplicates between data frames. I want to remove them all, with the following desired output:

[[1]]
 _uuid   country
 iiopz   Switzerland

 [[2]]
 _uuid   country
 xxxyz   USA
 zzuio   Canada
 ppuip   Canada
 

Notes:

  1. I want to eliminate duplicates on _uuid (other variables can be duplicated)
  2. I need a solution where it is not needed to merge the data frames before checking for duplicates
  3. If possible, I wish to retain the last observation. For example, in the desired output above, "zzuio Canada" existed in both df, but was kept in the last df only, that is, df 2.
  4. I have more than 100 dfs, with variable names that don't necessarily match between dfs. That said, the id is always called "_uuid"
  5. I need to reassign the result to the same object (in the case above, my_list)
2

There are 2 answers

2
r2evans On BEST ANSWER

Here's a shot, starting with a reduction and then Map-applying it to the original list of frames.

previous_ids <- rev(Reduce(
  function(prev, this) unique(c(prev, this$id)),
  rev(my_list), init = character(0), accumulate = TRUE))[-1]
previous_ids
# [[1]]
# [1] "xxxyz" "ppuip" "zzuio"
# [[2]]
# character(0)

This gives us the ids from frames further in my_list that we need to remove "here". We interpret this to mean that for the last frame, we have no IDs we need to remove from other frames; in the first frame, we have 3 ids that are seen later in the list, so they need to be removed from "this" (first) frame. (Side note: perhaps the name previous_ids is a misnomer ...)

With this, we can do simply:

Map(my_list, previous_ids,
    f = function(dat, rmid) {
      dat[!duplicated(dat$id, fromLast = TRUE) & !dat$id %in% rmid,]
    })
# [[1]]
# [1] id      country
# <0 rows> (or 0-length row.names)
# [[2]]
#      id country
# 1 xxxyz     USA
# 2 ppuip  Canada
# 3 zzuio  Canada

Using your updated data, this still works. The only thing is since your ID field is non-standard (starting with a _, R does not like that), we need to either use backticks dat$`_uuid` or use [[ dat[["_uuid"]].


previous_ids <- rev(Reduce(
  function(prev, this) unique(c(prev, this[["_uuid"]])),
  rev(my_list), init = character(0), accumulate = TRUE))[-1]
previous_ids

Map(my_list, previous_ids,
    f = function(dat, rmid) {
      dat[!duplicated(dat[["_uuid"]], fromLast = TRUE) & !dat[["_uuid"]] %in% rmid,]
    })
# [[1]]
#   _uuid     country
# 4 iiopz Switzerland
# [[2]]
#   _uuid country
# 1 xxxyz     USA
# 2 ppuip  Canada
# 3 zzuio  Canada
5
jay.sf On

We could add a temporary column ".rm" (for remove), created from unlisting the "id" column and scanning it for rev duplicated. This gives a vector that can be splitted along repeated consecutive integers each nrow times for each sub-list and added to the sub-lists using `[<-`(). Finally we subset for not TRUEs in ".rm" and remove that temporary column.

> fn <- \(x, idcol) {
+   Map(`[<-`, x, '.rm', value=
+         lapply(x, `[[`, idcol) |> 
+         unlist() |> 
+         rev() |> 
+         duplicated() |> 
+         rev() |> 
+         split(
+           sapply(x, nrow) |> {
+             \(.) mapply(rep.int, seq_along(.), .) |> unlist()
+           }()
+         )
+   ) |> 
+     lapply(subset, !.rm, select=-.rm)
+ }
> fn(my_list, '_uuid')
[[1]]
  _uuid     country
4 iiopz Switzerland

[[2]]
  _uuid country
1 xxxyz     USA
2 ppuip  Canada
3 zzuio  Canada

Data:

> dput(my_list)
list(structure(list(`_uuid` = c("xxxyz", "xxxyz", "zzuio", "iiopz"
), country = c("USA", "USA", "Canada", "Switzerland")), class = "data.frame", row.names = c(NA, 
-4L)), structure(list(`_uuid` = c("xxxyz", "ppuip", "zzuio"), 
    country = c("USA", "Canada", "Canada")), class = "data.frame", row.names = c(NA, 
-3L)))