How to Append Large Files Efficiently Using Rsync

43 views Asked by At

In our company, we frequently need to merge multiple large files ranging from 5GB to 200GB in size. These files, compressed in .fastq.gz format, are stored on our storage servers (servers A through H), all mounted to the root folder /storage/archXX (mounted with sshfs), where archXX represents one of several hard drives on each server. Our production server X serves as the destination for these merged files.

To concatenate these files, we've been using the following command:

pv /storage/archXX/customerID/fastq/data.fq.gz >> /work/customerID/fastq/data.fq.gz && pv /storage/archYY/customerID/fastq/data.fq.gz >> /work/customerID/fastq/data.fq.gz

This method works well since appending .gz files preserves data integrity without loss.

My Problem: However, this method strains our system significantly when transferring these files from storage to production servers, which results in slow data access for my colleagues.

My Solution: To resolve this, I've started using rsync to copy the files directly from the archive to our production machine. This solution reduced the load from our systems.

Currently, I'm using the following template to handle multiple files:

CustomerID=10
ARCH=arch49
#get the (ssh) path for the wanted archive
ARCH_PATH=$(ssh production "df | grep -w ${ARCH} | sed 's/.*@//'" | awk '{print $1}')
cd /work/${CustomerID}/fastq/
#copy directly to work/CustomerID/fastq/ 
rsync -r -v --progress user@${ARCH_PATH}/${CustomerID}/fastq/* /work/${CustomerID}/fastq/

ARCH=arch51
ARCH_PATH=$(ssh production "df | grep -w ${ARCH} | sed 's/.*@//'" | awk '{print $1}')
cd /work/temp
#copy to work/temp directory
rsync -r -v --progress user@${ARCH_PATH}/${CustomerID}/fastq/* /work/temp/
#append files from /work/temp to /work/CustomerID/fastq/
pv "/work/temp/${CustomerID}_R1.fastq.gz" >> "/work/${CustomerID}/fastq/${CustomerID}_R1.fastq.gz" && pv "/work/temp/${CustomerID}_R2.fastq.gz" >> "/work/${CustomerID}/fastq/${CustomerID}_R2.fastq.gz"

This works, but is still inefficient due to unnecessary copies to and from the temporary directory. So I'd like to append to /work/CustomerID/fastq/data.gz directly with rsync

The Question: While exploring the rsync manual, I came across the --append flag. According to the manual, this flag only appends to shorter files. I unfortunately can't be sure, that the first file on my server is the shortest, and when appending three or more files, each pair of files copied first, could be larger than the left files.

Do I understand the --append flag correctly? Is there another flag for rsync that allows me to append files in this way? Does the append function work as I expect it too?

Alternatively, do you have any other ideas on how I can efficiently move these large files through my network?

0

There are 0 answers