I am implementing a way to quickly find changes in the sources of my datawarehouse.
After couple of try we have found the hashing all the attribute of a given table and comparing it to the target is one of the most efficient way to compare it.
However the non negligible issue for us is the collision risk. Because I need to trust my data 100%
My understanding is that with SHA-512 it should be close to 0 (2^-256...). But what we cannot find is if the length of my input string can influence the probaility of collision.
Because in the case of a table with 20 field I am confident it will work, but for a table with 280 fields some of them having free text ... I want to be sure.
I know the maximum length of a string is 2^128 but does hashing a longer string of 20.000 character instead of 200, will raise the probability of a collision ?
Thanks for your help.
Hashing algorithms internal functions always work with fixed-length inputs. So when hashing long strings it will split the string into data blocks that are as long as the required input length of the internal function (padding the last one if necessary). Then it will loop over the blocks and combine the output of a block with the current state, the combined output from all the previous blocks.
It's been shown that this construct makes the final hash as much resistant to collision as the internal function. Check the Merkle–Damgård construction (used in SHA-512) article.