I use parSapply()
from parallel
package in R. I need to perform calculations on huge amount of data. Even in parallel it takes hours to execute, so I decided to regularly write results to a file from clusters using write.table()
, because the process crashes from time to time when running out of memory or for some other random reason and I want to continue calculations from the place it stopped. I noticed that some lines of csv files that I get are just cut in the middle, probably as a result of several processes writing to the file at the same time. Is there a way to place a lock on the file for the time while write.table()
executes, so other clusters can't access it or the only way out is to write to separate file from each cluster and then merge the results?
Lock file when writing to it from parallel processes in R
1.8k views Asked by user1603038 At
2
There are 2 answers
0
On
The old unix technique looks like this:
`#make sure other processes are not writing to the files by trying to create a directory: if the directory exists it sends an error and then tries again. Exit the repeat when it successfully creates the lock directory
repeat{
if(system2(command="mkdir", args= "lockdir",stderr=NULL)==0){break}
}
write.table(MyTable,file=filename,append=T)
#get rid of the locking directory
system2(command = "rmdir", args = "lockdir")
`
It is now possible to create file locks using
filelock
(GitHub)In order to facilitate this with
parSapply()
you would need to edit your loop so that if the file is locked the process will not simply quit, but either try again orSys.sleep()
for a short amount of time. However, I am not certain how this will affect your performance.Instead I recommend you create cluster-specific files that can hold your data, eliminating the need for a lock file and not reducing your performance. Afterwards you should be able to weave these files and create your final results file. If size is an issue then you can use
disk.frame
to work with files that are larger than your system RAM.