Combine dtplyr and multidplyr to deal with large mutate operation

308 views Asked by At

I am combining dtplyr and multidplyr libraries to handle some basic mutate/summarise operations carried out on a very large db. final_db_partition, after merging is sometimes 30m lines long.

I cannot figure out if I am doing something wrong but the R session is aborted or I finish my memory.

R version 4.0.5 (2021-03-31) / Platform: x86_64-apple-darwin17.0 (64-bit) / Running under: macOS Big Sur 10.16

How should I tackle this issue?

library(multidplyr)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
library(data.table)
library(stringr)


default_cluster(parallel::detectCores()-1)
cluster_library(default_cluster(), 'dplyr')
cluster_library(default_cluster(), 'stringr')

db1 <- db1 %>% 
    data.table::data.table() %>% 
    lazy_dt(immutable = FALSE) 
  
  db2 <- db2 %>% 
    data.table::data.table() %>% 
    lazy_dt(immutable = FALSE)  

final_db_partition <- db1 %>% 
    left_join(db2)  %>% 
    as.data.frame() %>% 
    group_by(id) %>% 
    partition(cluster = default_cluster()) 

final_db <- final_db_partition %>% 
    as.data.table() %>% 
    #lazy_dt(immutable = FALSE)  %>% 
    mutate(m1=ifelse(stringi::stri_detect_regex(m_destination, paste0("\\b", m_origin,"\\b")),1,0)) %>% 
    as.data.frame() %>% 
    group_by(across(c(-v1,-v2,-v3))) %>% 
    summarise(finalv1 = sum(finalv1,na.rm=T),
              finalv2 = sum(finalv2,na.rm=T)) %>% 
    collect() 


0

There are 0 answers