Python hashlib MD5 digest of any UNC file always yields same hash

673 views Asked by At

The below code shows that three files which are on a UNC share hosted on another machine have the same hash. It also shows that local files have different hashes. Why would this be? I feel that there is some UNC consideration that I don't know about.

Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> fn_a = '\\\\some.host.com\\Shares\\folder1\\file_a'
>>> fn_b = '\\\\some.host.com\\Shares\\folder1\\file_b'
>>> fn_c = '\\\\some.host.com\\Shares\\folder2\\file_c'
>>> fn_d = 'E:\\file_d'
>>> fn_e = 'E:\\file_e'
>>> fn_f = 'E:\\folder3\\file_f'
>>> f_a = open(fn_a, 'r')
>>> f_b = open(fn_b, 'r')
>>> f_c = open(fn_c, 'r')
>>> f_d = open(fn_d, 'r')
>>> f_e = open(fn_e, 'r')
>>> f_f = open(fn_f, 'r')
>>> hashlib.md5(f_a.read()).hexdigest()
'54637fdcade4b7fd7cabd45d51ab8311'
>>> hashlib.md5(f_b.read()).hexdigest()
'54637fdcade4b7fd7cabd45d51ab8311'
>>> hashlib.md5(f_c.read()).hexdigest()
'54637fdcade4b7fd7cabd45d51ab8311'
>>> hashlib.md5(f_d.read()).hexdigest()
'd2bf541b1a9d2fc1a985f65590476856'
>>> hashlib.md5(f_e.read()).hexdigest()
'e84be3c598a098f1af9f2a9d6f806ed5'
>>> hashlib.md5(f_f.read()).hexdigest()
'e11f04ed3534cc4784df3875defa0236'

EDIT: To further investigate the problem, I also tested using a file from another host. It appears that changing the host will change the result.

>>> fn_h = '\\\\host\\share\\file'
>>> f_h = open(fn_h, 'r')
>>> hashlib.md5(f_h.read()).hexdigest()
'f23ee2dbbb0040bf2586cfab29a03634'

...but then I tried a different file on the new host, and got a new result!

>>> fn_i = '\\\\host\\share\\different_file'
>>> f_i = open(fn_i, 'r')
>>> hashlib.md5(f_i.read()).hexdigest()
'a8ad771db7af8c96f635bcda8fdce961'

So, now I'm really confused. Could it have something to do with the fact that the original host is a \\host.com format and the new host is a \\host format?

3

There are 3 answers

0
Shaun On BEST ANSWER

I did some additional research based on the comments and answers everyone provided. I decided I needed to study permutations of these two features of the code:

  1. A raw string literal is used for the path name, i.e. whether or not:
    A. The file path string is raw with single backslashes in the path, vs.
    B. The file path string is not raw with double backslashes in the path

    (FYI to those who don't know, a raw string is one which is proceeded by an "r" like this: r'This is a raw string')

  2. The open function mode is r or rb.
    (FYI again to those who don't know, the b in rb mode indicates to read the file as binary.)

The results demonstrated:

  • The string literal / backslashes make no difference in whether or not the hashes of different files are different
  • My error was not opening the file in binary mode. When using rb mode in open, I got different results.

Yay! And thanks for the help.

1
satoru On

Use f1.seek(0) if you intend to use it again, otherwise it would be a file completely read and calling read() again would just return a empty string.

0
patthoyts On

I don't reproduce your problem. I'm using Python 3.4 on Windows 7 here with the following test script which accesses files on a network hard disk:

import sys, hashlib
def main():
    fn0 = r'\\NAS\Public\Software\Backup\Test\Vagrantfile'
    fn1 = r'\\NAS\Public\Software\Backup\Test\z.xml'
    with open(fn0, 'rb') as f:
        h0 = hashlib.md5(f.read())
        print(h0.hexdigest())
    with open(fn1, 'rb') as f:
        h1 = hashlib.md5(f.read())
        print(h1.hexdigest())

if __name__ == '__main__':
    sys.exit(main())

Running this results in two different hash values (as expected):

c:\src\python>python hashtest.py
8af202dffb88739c2dbe188c12291e3d
2ff3db61ff37ca5ceac6a59fd7c1018b

If reading the file contents returns different data for the remote files then passing that data into md5 has to result in different hash values. You might want to print out the first 80 bytes of each file as a check that you are getting what you expect.