RSync single (archive) file that changes every time

6.3k views Asked by At

I am working on an open source backup utility that backs up files and transfers them to various external locations such as Amazon S3, Rackspace Cloud Files, Dropbox, and remote servers through FTP/SFTP/SCP protocols.

Now, I have received a feature request for doing incremental backups (in case the backups that are made are large and become expensive to transfer and store). I have been looking around and someone mentioned the rsync utility. I performed some tests with this but am unsure whether this is suitable, so would like to hear from anyone that has some experience with rsync.

Let me give you a quick rundown of what happens when a backup is made. Basically it'll start dumping databases such as MySQL, PostgreSQL, MongoDB, Redis. It might take a few regular files (like images) from the file system. Once everything is in place, it'll bundle it all in a single .tar (additionally it'll compress and encrypt it using gzip and openssl).

Once that's all done, we have a single file that looks like this:
mybackup.tar.gz.enc

Now I want to transfer this file to a remote location. The goal is to reduce the bandwidth and storage cost. So let's assume this little backup package is about 1GB in size. So we use rsync to transfer this to a remote location and remove the file backup locally. Tomorrow a new backup file will be generated, and it turns out that a lot more data has been added in the past 24 hours, and we build a new mybackup.tar.gz.enc file and it looks like we're up to 1.2GB in size.

Now, my question is: Is it possible to transfer just the 200MB that got added in the past 24 hours? I tried the following command:

rsync -vhP --append mybackup.tar.gz.enc backups/mybackup.tar.gz.enc

The result:

mybackup.tar.gz.enc 1.20G 100% 36.69MB/s 0:00:46 (xfer#1, to-check=0/1)

sent 200.01M bytes
received 849.40K bytes
8.14M bytes/sec
total size is 1.20G
speedup is 2.01

Looking at the sent 200.01M bytes I'd say the "appending" of the data worked properly. What I'm wondering now is whether it transferred the whole 1.2GB in order to figure out how much and what to append to the existing backup, or did it really only transfer the 200MB? Because if it transferred the whole 1.2GB then I don't see how it's much different from using the scp utility on single large files.

Also, if what I'm trying to accomplish is at all possible, what flags do you recommend? If it's not possible with rsync, is there any utility you can recommend to use instead?

Any feedback is much appreciated!

3

There are 3 answers

5
Piskvor left the building On BEST ANSWER

It sent only what it says it sent - only transferring the changed parts is one of the major features of rsync. It uses some rather clever checksumming algorithms (and it sends those checksums over the network, but this is negligible - several orders of magnitude less data than transferring the file itself; in your case, I'd assume that's the .01 in 200.01M) and only transfers those parts it needs.

Note also that there already are quite powerful backup tools based on rsync - namely, Duplicity. Depending on the license of your code, it may be worthwhile to see how they do this.

1
Rob Redpath On

The nature of gzip is such that small changes in the source file can result in very large changes to the resultant compressed file - gzip will make its own decisions each time about the best way to compress the data that you give it.

Some versions of gzip have the --rsyncable switch which sets the block size that gzip works at to the same as rsync's, which results in a slightly less efficient compression (in most cases) but limits the changes to the output file to the same area of the output file as the changes in the source file.

If that's not available to you, then it's typically best to rsync the uncompressed file (using rsync's own compression if bandwidth is a consideration) and compress at the end (if disk space is a consideration). Obviously this depends on the specifics of your use case.

1
Tapio Rantala On

New rsync --append WILL BREAK your file contents, if there are any changes in your existing data. (Since 3.0.0)