How do content addressable storage systems deal with possible hash collisions?

Question

How do content addressable storage systems deal with possible hash collisions?

1.1k views Asked by user7610 At 19 December 2014 at 11:48

Content addressable storage systems use the hash of the stored data as the identifier and the address. Collisions are incredibly rare, but if the system is used a lot for a long time, it might happen. What happens if there are two pieces of data that produce the same hash? Is it inevitable that the most recently stored one wins and data is lost, or is it possible to devise ways to store both and allow accessing both?

To keep the question narrow, I'd like to focus on Camlistore. What happens if permanodes collide?

Original Q&A

There are 3 answers

eltronix On 04 January 2022 at 14:29

Composite Key e.g hash + userId

Jim Grisham On 14 February 2022 at 22:48

In an ideal collision-resistant system, when a new file / object is ingested:

A hash is computed of the incoming item.
If the incoming hash does not already exist in the store:
1. the item data is saved and associated with the hash as its identifier
If incoming hash does match an existing hash in the store:
1. The existing data is retrieved
2. A bit-by-bit comparison of the existing data is performed with the new data
3. If the two copies are found to be identical, the new entry is linked to the existing hash
4. If the new copies are not identical, the new data is either
  1. Rejected, or
  2. Appended or prefixed* with additional data (e.g. a timestamp or userid) and re-hashed; this entire process is then repeated.

So no, it's not inevitable that information is lost in a content-addressable storage system.

* Ideally, the existing stored data would then be re-hashed in the same way, and the original hash entry tagged somehow (e.g. linked to a zero-byte payload) to notate that there were multiple stored objects that originally resolved to that hash (similar in concept to a 'Disambiguation page' on Wikipedia). Whether that is necessary depends on how data needs to be retrieved from the system.

While intentionally causing a collision may be astronomically impractical for a given algorithm, a random collision is possible as soon as the second storage transaction.

Note: Some small / non-critical systems skip the binary comparison step, trading risk for bandwidth or processing time. (Usually, this is only done if certain metadata matches, such as filename or data length.)

The risk profile of such a system (e.g. a single git repository) is far different than for an enterprise / cloud-scale environment that ingests large amounts of binary data, especially if that data is apparent random binary data (e.g. encrypted / compressed files) combined with something like sliding-window deduplication.

TechQA.

How do content addressable storage systems deal with possible hash collisions?

There are 3 answers

Related Questions in STORAGE

Related Questions in HASH-COLLISION

Related Questions in FUTURE-PROOF

Related Questions in CAMLISTORE

Popular Questions

Popular Tags

Trending Questions