I'm trying to expand a dataframe based on the value of a column, using parallel cores with multidplyr (using dplyr). Since the command uncount() does not work with multidplyr, I am using default rep function. I get an error. Below a MWE, where I want to parallelise according to YEAR (2010 and 2020) and then expand using FACTOR variable for each ID.
library(dplyr)
library(multidplyr)
set.seed(123)
df <- data.frame(ID = rep(1:4, 2),
YEAR = c(rep(2010,4),rep(2020,4)),
FACTOR = floor(runif(8,1,10)))
cluster <- new_cluster(2)
# Parallelised dataset, by YEAR
df2 <- df %>% group_by(YEAR) %>% partition(cluster)
# This code works without parallelisation (notice I need to group by YEAR too)
df %>%
group_by(ID,YEAR) %>%
slice(rep(1:n(), first(FACTOR))) %>%
arrange(ID, YEAR, FACTOR) %>%
collect()
# This code produces an error
df2 %>%
group_by(ID) %>%
# uncount(Factor, .remove = FALSE) %>% BUG, NOT WORKING YET
slice(rep(1:n(), first(Factor))) %>%
arrange(ID, YEAR, FACTOR) %>%
collect()
The error I get is
Error in `cluster_call()`:
! Remote computation failed in worker 1
Caused by error:
ℹ In argument: `rep(1:n(), first(FACTOR))`.
ℹ In group 1: `ID = 1`.
Caused by error in `n()`:
! could not find function "n"
It's probably trivial but haven't found a solution.
Noob mistake. I had to pass the library to the clusters using