Using md5sum for speeding up dd disk imaging, sample script: Good idea?

1.7k views Asked by At

I was thinking of ways to have my laptop HDD backed up safely, and still being able to put the backup rapidly in use if needed. My method would be the following: I would buy an 2.5" HDD of the same size with USB to SATA cable and clone the internal to it, when disaster strikes, I would just have to swap the HDD in the laptop for the other one and I would be good to go again. However, I would like to avoid writing 500GB each time I want to backup my HDD, especially when I know that a fair part of it (+/- 80GB) is rarely written to, this is where the following md5sum/dd script comes to the rescue, I hope:

#!/bin/bash
block="1M"
end=50000
count=10

input="/dev/sda"

output="/dev/sdb"
output="/path/to/imagefile"


function md5compute()
{
    dd if=$1 skip=$2 bs=$block count=$count | md5sum - | awk '{ print $1 }'
}
for i in {0..$end}
do
    start=$(($i*$count))
    md5source=$(md5compute $input $start)
    md5destination=$(md5compute $output $start)
    if [ "$md5source" != "$md5destination" ]
    then
        dd if=$input of=$output skip=$start seek=$start count=$count conv=sync,noerror,notrunc
    fi
done

Now, the question part:

A) With running this, would I miss some part of the disk? Do you see some flaws?

B) Could I win some time compared to the 500GB read/write?

C) Obviously I potentially write less to the target disk. Will I improve the lifetime of that disk?

D) I was thinking of leaving count to 1, and increasing the block size.Is this good idea/bad idea?

E) Would this same script work with an image file as output?

Not being very fluent in programming, there should be plenty of room for improvement, any tips?

Thank you all...

1

There are 1 answers

0
F. Hauri  - Give Up GitHub On BEST ANSWER

Point by point answer:

  1. With running this, would I miss some part of the disk?

    • no.
  2. Do you see some flaws?

    • While units are differents. this implie double read at source and full read before write at destination. This will mostly improve backup time.
    • there is a little probability of having MD5 matching while differences exists. This probability is reducted by using SHA1 or MD256 or other harder checksum algo. but this implie more resource on both ends. (See Birthday problem on wikipedia)
  3. Could I win some time compared to the 500GB read/write?

    • In case both units are already same, yes, because reading is generaly quicker than writting. (depending on processor, for checksum computation: this could be significant on very poor processors)
  4. Obviously I potentially write less to the target disk. Will I improve the lifetime of that disk?

    • In this case, yes, but if you write only diff, this will go a lot faster and improve really you disks lifetime.
    • When disk are different, you re-write whole disk, this is not efficient!
  5. I was thinking of leaving count to 1, and increasing the block size.Is this good idea/bad idea?

    • I find this globally a bad idea. Why re-inventing wheel?
  6. Would this same script work with an image file as output?

    • Yes.

Functionnality answer.

For jobs like this, you may use rsync! With this tools you may

  • Compress data during transfer
  • copy over network
  • tunneling with SSH (or not)
  • transfer (an write) only modified blocks

Using , dd and md5sum

There is a kind of command I run sometime:

ssh $USER@$SOURCE "dd if=$PATH/$SRCDEV |tee >(sha1sum >/dev/stderr);sleep 1" |
    tee >(sha1sum >/dev/tty) | dd of=$LOCALPATH/$LOCALDEV

This will do a full read on souce host, than a sha1sum before sending to localhost (destination), than a sha1sum to ensure transfer before writting to local device.

This may render something like:

2998920+0 records in
2998920+0 records out
1535447040 bytes (1.4gB) copied, 81.42039 s, 18.3 MB/s
d61c645ab2c561eb10eb31f12fbd6a7e6f42bf11  -
d61c645ab2c561eb10eb31f12fbd6a7e6f42bf11  -
2998920+0 records in
2998920+0 records out
1535447040 bytes (1.4gB) copied, 81.42039 s, 18.3 MB/s