Error with rep using multidplyr: cannot find function "n"

71 views Asked by At

I'm trying to expand a dataframe based on the value of a column, using parallel cores with multidplyr (using dplyr). Since the command uncount() does not work with multidplyr, I am using default rep function. I get an error. Below a MWE, where I want to parallelise according to YEAR (2010 and 2020) and then expand using FACTOR variable for each ID.

library(dplyr)
library(multidplyr)

set.seed(123)
df <- data.frame(ID = rep(1:4, 2), 
                 YEAR = c(rep(2010,4),rep(2020,4)),
                 FACTOR = floor(runif(8,1,10)))

cluster <- new_cluster(2)

# Parallelised dataset, by YEAR
df2 <- df %>% group_by(YEAR) %>% partition(cluster)

# This code works without parallelisation (notice I need to group by YEAR too)
df %>% 
  group_by(ID,YEAR) %>% 
  slice(rep(1:n(), first(FACTOR))) %>%
  arrange(ID, YEAR, FACTOR) %>%
  collect()

# This code produces an error
df2 %>% 
  group_by(ID) %>% 
  # uncount(Factor, .remove = FALSE) %>% BUG, NOT WORKING YET
  slice(rep(1:n(), first(Factor))) %>%
  arrange(ID, YEAR, FACTOR) %>%
  collect()

The error I get is

Error in `cluster_call()`:
! Remote computation failed in worker 1
Caused by error:
ℹ In argument: `rep(1:n(), first(FACTOR))`.
ℹ In group 1: `ID = 1`.
Caused by error in `n()`:
! could not find function "n"

It's probably trivial but haven't found a solution.

1

There are 1 answers

0
luchonacho On

Noob mistake. I had to pass the library to the clusters using

cluster_library(cluster, "dplyr")