hash table suffix tree explanation

1k views Asked by At

I am asking this here because I couldn't find the answer I am looking for elsewhere and I don't know where else I could ask this. I hope someone can reply without saying that the question is irrelevant to the forum. I have a biology background and I am currently using bioinformatics. I need to understand in lay language hash tables and suffix trees. Something simple, I don't get the O(n) concepts and all that stuff, I think they are both kind of the same: a way to store string data? But I would like to understand better the differences. This will help enormously to other people like me. We are a lot in this field now!

Thanks in advance.

1

There are 1 answers

1
dkatzel On BEST ANSWER

OK, lets use bioinformatics to help illustrate the differences.

Let's say you have several DNA sequences that are pretty long. If we want to store these sequences in a datastructure.

If we want to use a hashtable

A Hashtable is a useful way to store a bunch of objects but very quickly search the datastructure to see if we already contain a particular object.

One bioinformatics usecase that we can solve with a hashtable is de-duping a large sequence set. Let's say we have a huge dataset of next-gen sequenced data and we want to de-duplicate it before we assemble. We can use a hashtable to store the unique sequences. Before inserting any sequences into the hashtable, we can first check to see if it already exists in the hashtable and if it does we skip that read. Only if it is not yet in the hashtable do we add it. Then when we are done the elements in the hash will be the unique sequences.

Hashtables are basically an array of LinkedLists. Each cell in the array we will call a "bin". When we insert or search for something in the hashtable, we have to first know what bin it is in. The way we determine which bin to use is by a hash algorithm.

  1. We have to come up with a hash algorithm. Something that will convert our sequence into a number. A requirement of this equation is the same sequence must always evaluate to the same number. It's OK if different sequences evaluate to the same number (which is called as hash collision) since there are an infinite number of possible sequences and we will only have a limited range of possible number values in our hash.

A simple hash algorithm is to assign a value to each base A =1 G =2 C = 3 T =4 (assume no ambiguities) then we can just sum up the bases in our sequence. This would mean that any sequences with the same number of As, Cs Gs and Ts will have the same hash value. If we wanted, we could also have a more complicated algorithm that also takes position into account so to get the same number we would have to also have the same sequence in the same order.

  1. Once we have our hash algorithm. We can make a hash table by binning the sequences by their hash values. The more bins we have in our table, the fewer hash values per bin. Hashtables are often implemented by an array of LinkedLists. This is a very fast lookup because to see if a sequence is in our hashtable or to add a new sequence to our hash table, we just compute the hash value for the sequence to see what bin it is in, then we only have to look at the values inside that bin. We can ignore the rest of the bins.

suffix tree

A Suffix Tree is a different datastructure which is a graph where each node is (in this case) a residue in our sequence. Edges in the graph will point to the next node etc. So for example if our sequence was ACGT the path in the graph will be A->C->G->T->$. If we had another sequence ACTT the path will be A->C->T->T->$.

We can combine consecutive nodes if there is only 1 path so in the previous example since both sequence start with AC then the paths will be AC->G->T->$and AC->T->T->$.

In bioinformatics this is really useful for substring matching (like finding repetitive regions or primer binding sites etc) since we can easily see where there are subpaths in our graph that match our motif.

Hope that helps