Row bind dataframes and keep unique IDs incrementing

238 views Asked by At

I have 2 datasets that I am about to stack together. Please note that record_ids for the first dataset also start at 1

record_id       any_column
     591             a
     591             b
     591             c
     592             d
     592             e
     592             f
     592             g
     593             h
     593             i
     593             j

2nd dataset

record_id       any_column
     1             a
     1             b
     1             c
     2             d
     2             e
     2             f
     2             g
     3             h
     3             i
     3             j 

when stacking I want record_ids for the 2nd data set to start right where the recod_ids for the 1st dataset ended for instant in this case at 593, I would like to convert ids for the 2nd dataset to something starting at 594 and so forth. How can I best approach this? Thanks

The expectations for the future datasets are as follows:

record_id       any_column        grand_id
     591             a               591
     591             b               591
     591             c               591
     592             d               592
     592             e               592
     592             f               592
     592             g               592
     593             h               593
     593             i               593
     593             j               593
     1             a                 594
     1             b                 594
     1             c                 594
     2             d                 595
     2             e                 595
     2             f                 595
     2             g                 595
     3             h                 596
     3             i                 596
     3             j                 596 
4

There are 4 answers

0
Ronak Shah On BEST ANSWER

You can get max of record_id from df1 and add to it record_id of df2.

rbind(transform(df1, grand_id = record_id), 
      transform(df2, grand_id = record_id + max(df1$record_id)))

If you prefer dplyr :

library(dplyr)

df1 %>%
  mutate(grand_id = record_id) %>%
  bind_rows(df2 %>% mutate(grand_id = record_id + max(df1$record_id)))

#   record_id any_column grand_id
#1        591          a      591
#2        591          b      591
#3        591          c      591
#4        592          d      592
#5        592          e      592
#6        592          f      592
#7        592          g      592
#8        593          h      593
#9        593          i      593
#10       593          j      593
#11         1          a      594
#12         1          b      594
#13         1          c      594
#14         2          d      595
#15         2          e      595
#16         2          f      595
#17         2          g      595
#18         3          h      596
#19         3          i      596
#20         3          j      596
2
Ххх ххХ On

I think you should use if statement when 1st data set ended, just save last value index and increment it , don't forget to check if 2nd data set record id are similar

1
wibeasley On

Can the id values of the top dataset shift? If so, another option is to use dplyr::dense_rank(). This might make it easier to generalize to more than two datasets. It will preserve the relative order within and between the input datasets

ds_stack <- 
  ds_top %>% 
  # dplyr::union_all(ds_middle) %>% # Add more datasets in here if necessary
  dplyr::union_all(ds_bottom) %>% 
  dplyr::mutate(
    grand_id = dplyr::dense_rank(paste0(source, "-", record_id))
  )

# # A tibble: 20 x 4
#       record_id any_column source grand_id
#       <int> <chr>       <int>    <int>
# 1       591 a               1        1
# 2       591 b               1        1
# 3       591 c               1        1
# 4       592 d               1        2
# 5       592 e               1        2
# 6       592 f               1        2
# 7       592 g               1        2
# 8       593 h               1        3
# 9       593 i               1        3
# 10      593 j               1        3
# 11        1 a               2        4
# 12        1 b               2        4
# 13        1 c               2        4
# 14        2 d               2        5
# 15        2 e               2        5
# 16        2 f               2        5
# 17        2 g               2        5
# 18        3 h               2        6
# 19        3 i               2        6
# 20        3 j               2        6

Input datasets:

library("magrittr")
ds_top <- 
  tibble::tribble(
    ~record_id, ~any_column,
    591L,"a",
    591L,"b",
    591L,"c",
    592L,"d",
    592L,"e",
    592L,"f",
    592L,"g",
    593L,"h",
    593L,"i",
    593L,"j"
  ) %>% 
  dplyr::mutate(
    source = 1L
  )

ds_bottom <- 
  tibble::tribble(
    ~record_id, ~any_column,
    1L, "a",
    1L, "b",
    1L, "c",
    2L, "d",
    2L, "e",
    2L, "f",
    2L, "g",
    3L, "h",
    3L, "i",
    3L, "j"
  ) %>% 
  dplyr::mutate(
    source = 2L
  )
0
wibeasley On

Here's a slight variation of a previous answer that generalizes to an arbitrary number of input datasets. And since you don't the count of input datasets, you won't be able to hardcode a fixed number of dplyr::union_all() calls.

If the input datasets are elements in a list called l, then purrr::map_dfr() handles the unions implicitly.

l %>% 
  purrr::map_df(~.) %>% 
  dplyr::mutate(
    grand_id = dplyr::dense_rank(paste0(source, "-", record_id))
  )
# # A tibble: 15 x 4
#   source record_id any_column grand_id
#    <int>     <int> <chr>         <int>
# 1      1       591 a                 1
# 2      1       591 b                 1
# 3      1       592 c                 2
# 4      1       592 d                 2
# 5      1       593 e                 3
# 6      2         1 a                 4
# 7      2         1 b                 4
# 8      2         2 c                 5
# 9      2         2 d                 5
# 10     2         3 e                 6
# 11     3         1 f                 7
# 12     3         1 g                 7
# 13     3         2 h                 8
# 14     3         3 i                 9
# 15     3         3 j                 9

In this scenario, the datasets are probably coming from a variety of similarly-structured databases that essentially produce

library("magrittr")
ds_top <- 
  tibble::tribble(
    ~source, ~record_id, ~any_column,
    1L, 591L,"a",
    1L, 591L,"b",
    1L, 592L,"c",
    1L, 592L,"d",
    1L, 593L,"e"
  )

ds_middle <- 
  tibble::tribble(
    ~source, ~record_id, ~any_column,
    2L, 1L, "a",
    2L, 1L, "b",
    2L, 2L, "c",
    2L, 2L, "d",
    2L, 3L, "e"
  )

ds_bottom <- 
  tibble::tribble(
    ~source, ~record_id, ~any_column,
    3L, 1L, "f",
    3L, 1L, "g",
    3L, 2L, "h",
    3L, 3L, "i",
    3L, 3L, "j"
  )
l <- list(ds_top, ds_middle, ds_bottom)

But if you're pulling from something like identically-structured csv files, some operations can be combined like

ds_stack <-
  vector_of_file_paths %>%
  purrr::map_dfr(~readr:read_csv(.), .id = "source") %>% 
  dplyr::mutate(
    grand_id = dplyr::dense_rank(paste0(source, "-", record_id))
  )