How to use std::hash to hash data by chunks and not all together?

653 views Asked by At

I am learning about std::hash and I would like to make a program that hashes a file, even if that file is large. I don't want to load that entire file in memory, I'd rather just hash individual chunks and then somehow hash the hashes.

I understand that no matter what I do, if the base of it is std::hash, it will not be super robust. But at least, I want a way to combine the data in the hash so that all data have influence on the hash.

This is my code now, it should run just fine with C++17:

#include <cstddef>
#include <string>
#include <functional>
#include <cinttypes>
#include <fstream>
#include <array>
#include <cstring>

using byte = unsigned char;
using hashType = std::uint64_t;

std::size_t hashFileNaive(const std::string& file)
{
  std::hash<hashType> myHash;

  constexpr std::size_t chunkSize = 4*1024;
  // we need to be able to split each chunk into a set of hashable numbers
  static_assert(chunkSize % sizeof(hashType) == 0);
  constexpr std::size_t valuesPerChunk = chunkSize / sizeof(hashType);

  size_t hash = 0;

  std::ifstream ifs(file, std::ios::binary | std::ios::ate);

  if (!ifs)
    throw std::runtime_error(std::string("Cannot open file, ERROR:") + std::strerror(errno) + " for path " + file);

  // calculate size, for empty files we define the hash as 0
  auto end = ifs.tellg();
  ifs.seekg(0, std::ios::beg);
  auto start = ifs.tellg();
  std::size_t size = end - start;

  if(size == 0)
    return 0;

  // read in chunks, later maybe read while hashing the previous chunk
  // also maybe use larger chunks and allocate them on the heap
  std::array<hashType, valuesPerChunk> chunk;
  
  do
  {
    // clear old data
    memset(chunk.data(), 0, chunkSize);
    if (!ifs.read((char*)chunk.data(), chunk.size()))
      throw std::runtime_error(std::string("Read failed, ERROR:") + std::strerror(errno) + " for path " + file);

    for (std::size_t i = 0; i < chunk.size(); ++i)
    {
      // this is already sketchy - feels like at some point, early values will have no effect on the hash
      hash *= 7;
      hash = hash ^ myHash(chunk[i]);
    }
  } while ((std::size_t)ifs.tellg() + chunkSize < size);

  return hash;
}

It seems to work, but note the part where I multiply the hash by 7 every time I XOR it with a new value. For sure that must mean that with enough values, the first value will have no influence on the hash as all the multiplications will move it "out of scope" of the 16 bit uint.

Is there a better way? I know I can get actual cryptographic library, or at least rip some more robust code off the internet, but I'd like to educate myself - this is not a work project so it doesn't need to be perfect.

0

There are 0 answers