I have the following code which downloads a link into the appropriate folder > subfolder. This code works great although it is very slow. I have a couple of hundred .zip files that I am attempting to download so that they can be processed.
Within the folders and subfolders structure of country data, some countries may have 1 subfolder while others may have multiple subfolders.
In order to mimic the existing conditions, I am also enclosing some dummy code below:
library("furrr")
library("curl")
country.year.dir <- c("/test/GB/GB_2010", "/test/GB/GB_2014", "/test/GN/GN_2016",
"/test/GY/GY_2000", "/test/GY/GY_2006-2007", "/test/GY/GY_2014")
my.country.names_DTs$URL <- c("https://GB 2010 Datasets.zip",
"https://GB 2014 Datasets.zip", "https://GN 2016 Datasets.zip",
"https://GY 2006-2007 Datasets.zip", "https://GY 2014 Datasets.zip")
my.country.names_DTs$URL_Clean <- c("https://GB_2010_Datasets.zip",
"https://GB_2014_Datasets.zip", "https://GN_2016_Datasets.zip",
"https://GY_2006-2007_Datasets.zip", "https://GY_2014_Datasets.zip")
for(i in seq(country.year.dir)){
setwd(country.year.dir[i])
my.shortcut.2 <- curl_download(my.country.names_DTs[i]$URL, destfile =
my.country.names_DTs[i]$URL_Clean)
}
I searched for ways to speed up the process to download links, and I came across this answer: How can I configure future to download more files?
I have modified that code to fit my unique situation; however, the code below does not work. I receive an error message.
download_template <- function(.x) {
for(i in seq(country.year.dir)) {
my.shortcut.2 <- curl_download(url =
my.country.names_DTs[i]$URL, destfile = my.country.names_DTs[i]$URL_Clean)
}
}
download_future_core <- function() {
plan(multiprocess)
future_map(my.country.names_DTs$URL, download_template)
}
download_future_core()
Is there anyway to speed up the code that works so that the same functionality can be kept?
Thank you.
UPDATE
Instead of attempting to use furrr
, I rewrote the function using foreach
. The revised code is below:
library("foreach")
library("curl")
import::from(future, plan, cluster)
import::from(doParallel, registerDoParallel)
import::from(snow, stopCluster)
import::from(parallel, makeCluster, detectCores)
cl <- makeCluster(detectCores())
plan(strategy = "cluster", workers = cl)
registerDoParallel(cl)
download_MICS_files <- foreach(i = seq(country.year.dir_MICS)) %dopar% {
currDir <- getwd()
on.exit(setwd(currDir))
setwd(country.year.dir_MICS[i])
MICS_downloaded <- curl_download(my.country.names_MICS_DTs[i]$URL, destfile =
my.country.names_MICS_DTs[i]$URL_Clean)
}
As I was (and am still getting) this error message from the foreach
loop:
Error in { : task 1 failed - "cannot change working directory"
I searched for help regarding setwd and foreach
loops. I came across the following answer:
How to change working directory in asynchronous futures in R
and I used a couple of lines from that answer, but I am still getting the same error message.
What's the best way to navigate between the working directories so that the foreach
construct works as well as the plain for loop with the setwd() error message?
Thank you.