Error in checkForRemoteErrors(val) : 7 nodes produced errors; first error: could not find function "fread"

315 views Asked by At

All of the code included in this question is from the script called "LASSO code (Version for Antony)" in my GitHub Repo for this project. And you can run it on the file folder called "last 40" to verify my claim that it does run on limited sized datasets and if you really feel like going the extra mile, message me here and I'll share a 10k scale file folder full of datasets zipped of via OneDrive or Google Drive (whichever you prefer lad) with ya so you can also verify that the same script doesn't work in file folders of that volume.

This is absolutely going to drive me mad I swear, I have been using the lappy function below without issue for a week now, and starting several hours ago, it is giving me this error:

> datasets <- parLapply(CL, paths_list, function(i) {fread(i)})
Error in checkForRemoteErrors(val) : 
  7 nodes produced errors; first error: could not find function "fread" 

Here is the rest of the script I am working with up until this line (after the lines I used to load all of the libraries I utilize):

# these 2 lines together create a simple character list of 
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/12th & 13th 10k"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)

# reformat the names of each of the csv file formatted datasets
DS_names_list <- basename(paths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)


# sort both of the list of file names so that they are in the proper order
my_order = DS_names_list |> 
  # split apart the numbers, convert them to numeric 
  strsplit(split = "-", fixed = TRUE) |>  unlist() |> as.numeric() |>
  # get them in a data frame
  matrix(nrow = length(DS_names_list), byrow = TRUE) |> as.data.frame() |>
  # get the appropriate ordering to sort the data frame
  do.call(order, args = _)

DS_names_list = DS_names_list[my_order]
paths_list = paths_list[my_order]

# this line reads all of the data in each of the csv files 
# using the name of each store in the list we just created
CL <- makeCluster(detectCores() - 2L)
clusterExport(CL, c('paths_list'))
library(data.table)
system.time( datasets <- parLapply(CL, paths_list, fread) )

After looking up the documentation for the 3rd time today, I am thinking of trying:

system.time( datasets <- parLapply(CL, paths_list, fun = fread) )

Will that work??

p.s. Here is all of the libraries I load as the first thing I do:

# load all necessary packages
library(plyr)
library(dplyr)
library(tidyverse)
library(readr)
library(stringi)
library(purrr)
library(stats)
library(leaps)
library(lars)
library(elasticnet)
library(data.table)
library(parallel)

Also, I have already tried the following and none worked:

datasets <- parLapply(CL, paths_list, function(i) {fread(i)})
datasets <- parLapply(CL, paths_list, function(i) {fread[i]})
datasets <- parLapply(CL, paths_list, function(i) {fread[[i]]})

datasets <- parLapply(CL, paths_list, \(ds) 
                      {fread(ds)})

system.time( datasets <- lapply(paths_list, fread) )

And when I run that last one, datasets <- lapply(paths_list, fread), I get the same error, this was exactly the original successful version I ran at the beginning of last week and I only chose to use the parallel version because the datasets folder I am importing/loading has 260,000 csv file-formatted datasets in it. So, this means two version which have worked dozens of times already just stopped working suddenly today!

1

There are 1 answers

5
wibeasley On BEST ANSWER

See if this works consistently. It hasn't failed yet on my Windows desktop with 20k files (I copied & pasted your 40 files a bunch). It's run 5 times and I've restarted the R session and RStudio each time.

It's too bad that the problem arises non-deterministically, but that's part of the parallel-computation game. See if this stripped-down example run consistently?

Notice I'm avoiding library() to eliminate naming collisions caused by packages with identically-named functions. Also, I closed the cluster connection at the end.

# Enumerate files
paths_list <- 
  "~/Documents/delete-me/EER-Research-Project-main/20k" |> 
  list.files(full.names = T, recursive = T)

# Establish cluster
CL <- parallel::makeCluster(parallel::detectCores() - 2L)
parallel::clusterExport(CL, c('paths_list'))

# Read files
system.time({
  datasets <- parallel::parLapply(CL, paths_list, data.table::fread)
})

# Stop cluster
parallel::stopCluster(CL)

#>    user  system elapsed 
#>    7.09    1.22  101.93