Do I need to do checksum verification of my file post upload to my hadoop cluster using webhdfs? How to compare local file and hadoop file checksum

224 views Asked by At
  1. Does webhdfs carry out checksum verification? When I upload a file to my remote hadoop cluster using webhdfs, does it carry out checksum verification of the file before upload and after upload to vefify the file was uploaded to hadoop correctly? Just for sake of completeness, I am using this library to carry out web hdfs actions; https://github.com/mtth/hdfs. Can anyone share the github repo for webhdfs so that I can see it myself?I tried finding it in official hadoop repo but I was not able to
  2. How do I compare the checksum of my local file and hadoop file? Hadoop HDFS uses CRC32C, which calculates MD5 of all MD5 checksums of individual chunks.However unix does not not calculate checksum in this manner if I use the md5sum function. As a result, the same file is giving me different checksum on local system and different checksum from hadoop
  3. Possible ways to compare checksum I have found till now and why they dont work for me:

Suggested way 1:

$ hadoop fs -cat /path/to/hdfs/file.dat|md5sum
cb131cdba628676ce6942ee7dbeb9c0f  -

$ md5sum /path/to/localFilesystem/file.txt
cb131cdba628676ce6942ee7dbeb9c0f  /path/to/localFilesystem/file.txt

Why I do not like this:This will not work efficiently for very large files.My files could have millions of rows and size in GBs

Suggested way 2:

Starting from Hadoop 3.1, checksums can be performed in HDFS. However, the comparison depends on how you put the file to HDFS in the first place. By default, HDFS uses CRC32C, which calculates MD5 of all MD5 checksums of individual chunks.

This means that you can't easily compare that checksum with one of a local copy. You can write the file initially with CRC32 checksum:

hdfs dfs -Ddfs.checksum.type=CRC32 -put myFile /tmp

Then, to get the checksum:

hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/myFile

For the local copy:

crc32 myFile

If you didn't upload the file with CRC32 checksum, or don't want to upload it again with CRC32 checksum, you can also just upload the local copy you want to compare with again with CRC32C checksum:

hdfs dfs -put myFile /tmp

And compare the two files on HDFS with:

hdfs dfs -checksum /tmp/myFile and hdfs dfs -checksum /tmp/myOtherFile.

Why this solution does not work for me: I am using webhdfs. How do I achieve the above action using a webhdfs library?

0

There are 0 answers