I am building a site where users can upload content. As always I aim for world dominance, so I would like to avoid storing the same file twice. For instance, if a user tries to upload the same file two times (by renaming or simply forgetting about what she has done in the past).
My current approach is to have the database that tracks each uploaded file store the following information about each file:
- file size in bytes
- MD5 sum of file contents
- SHA1 sum of file contents
And then a unique index on those three columns. Using two hashes to minimize the risk of false positives.
So, my question is really: what is the probability of two different ("real-world") files of the same size having identical MD5 and SHA1 hashes?
Or: Is there a smarter method of similar (un)complexity?
(I understand that the probability could depend on the file size).
Thanks!
The probability of two real-world files of the same size having the same SHA1 hash is zero for all practical purposes. Some weaknesses in SHA1 have been found, but creating a file from a SHA1 hash and a size (1) is incredibly expensive in terms of computing power and (2) produces either garbage or the original file.
Adding MD5 to the mix is total overkill. If you don't trust SHA-1, then a better option is to switch to SHA-2.
If you're really paranoid, try comparing files with identical (size, SHA1) signatures. That will, however, have to read both the files entirely if they are equal.