Downloading large batch of files, want a way to skip files that already exist in case download times out

316 views Asked by At

I am downloading CMIP6 data files using this code:

#install.packages("epwshiftr")
library("epwshiftr")

#this indexes all the information about the models
test = init_cmip6_index(activity = "CMIP",
                        variable = 'pr',
                        frequency = 'day',
                        experiment = c("historical"),
                        source = NULL,
                        years= c(1981,1991,2001,2014),
                        variant = "r1i1p1f1" , replica = F,
                        latest = T,
                        limit = 10000L,data_node = NULL,
                        resolution = NULL
                        )

#Download gcms#
ntest=nrow(test)
for(i in 1:ntest){
url<-test$file_url[i]
destfile<-paste("D:/CMIP6 data/Data/",test$source_id[i],"-",test$experiment_id[i],"-",test$member_id[i],"-",test$variable_id[i],"-",test$datetime_start[i],"to",test$datetime_end[i],".nc",sep="")
download.file(url,destfile)
}

The files are very large, and it will take a few hours, and I am encountering some time-outs so I may need to run this code multiple times to finish downloading all the files.

Is there a way to code it such that it checks if the specific file name already exists, and if it does, it will skip that file and move on to the next one?

For reference, the files look like this when they are downloaded: downladed files

Any help would be appreciated. Thank you!

EDIT: Would it also be possible for the code to not completely stop incase the URL of a particular file is not responding? This is because I noticed that some URLs take too long to respond, and R decides to time-out the operation after waiting for a certain period of time.

1

There are 1 answers

0
marine-ecologist On BEST ANSWER

I had the exact same unanswered question as @mikaeldp, and it took me a few moments to figure out how to make the answer in @user12728748's comment work within the loop.

The following code using an if…else statement works and may save someone else some time in case they come across the same question:

# add if...else statement to the loop to check for file existence before downloading
ntest=nrow(test)
for(i in 1:ntest){
url<-test$file_url[i]
destfile<-paste("D:/CMIP6 data/Data/",test$source_id[i],"-",test$experiment_id[i],"-",test$member_id[i],"-",test$variable_id[i],"-",test$datetime_start[i],"to",test$datetime_end[i],".nc",sep="")

if(file.exists(destfile)) { }
  else { download.file(url,destfile) }

}

Similarly, the timeout can be changed from the default 60 seconds to avoid dropouts (which occur pretty frequently) as follows:

options(timeout = max(900, getOption("timeout")))

EDIT:

Even increasing the timeout, my connection frequently dropped out resulting with multiple incomplete partial files in the download folder. To avoid this, I added an AND statement to the if...else that checks for: i) the existence of the file name in the folder AND ii) that the file size is the same as in the index output from test. If the filename does exist (i.e. something downloaded) but the size doesn't match (i.e. it's a partially downloaded file), then the file gets re-downloaded in the loop like so:

ntest=nrow(test)
for(i in 1:ntest){
url<-test$file_url[i]
destfile<-paste("D:/CMIP6 data/Data/",test$source_id[i],"-",test$experiment_id[i],"-",test$member_id[i],"-",test$variable_id[i],"-",test$datetime_start[i],"to",test$datetime_end[i],".nc",sep="")

if(
  file.exists(destfile) & (file.info(destfile)$size) ==  (test$file_size[i])) { }
  else { download.file(url,destfile)}
}