I have an algorithm to one-hot encode minHashed genomes and I am seeking opinions on whether I have constructed it correctly based on the nature of minHashing. There's some disagreement between myself and a collaborator and we are trying to find the correct approach.
I have used MASH (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) to minHash a database of raw genetic sequence reads (fastq files) for 1,000 samples. In summary, for one sample this produces a sketch of 2000 hash functions, where each hash function encodes a 21-kmer sequence of alleles (alphabet {ATCG}).
I one-hot encode these sketches by comparing the hash functions in each new sketch to the hash functions from previously processed samples database. If the new sketch has a hash in the database it gets a 1 in that column, if the hash is not in the database we add a column to the database for that hash with a 1 for the current sample and a 0 for all previous samples. I believe this produces an accurate one-hot encoding.
My collaborator believes the order of the hash functions in the sketches matter. If this is true, then comparison to the database of previous hashes is only valid if the hash function in the new sample has the same index in the 2,000 length vector as the previous hash function it is being compared to.
My understanding of minHashing is that assuming no hash collisions, each hash function should represent a unique k-mer. Sorting the sketch in ascending order of hashes is largely for randomization and thus it is not important to compare hashes at the same index, but rather to see if any of the hashes contained in one sketch are present in the others.
This feels quite niche and difficult to explain in writing so please let me know if any clarification is needed. Thanks!