I have a large set of strings, on order ~10^12 or so, and I need to choose an appropriate data structure so that, provided a string, I can retrieve and associated integer value in something like O(log(n)) or O(m) time where 'n' is the length of the list of strings and 'm' is the length of each string.

We can expect that our set of strings, each of length 'm' and encoded over some alphabet of size 'q', covers nearly all possible strings of this length. For example, imagine we have 10^12 all-unique binary strings of length m = 39. This implies that we've covered ~54% of the set of all possible binary strings of this length.

As such, I'm concerned about finding an appropriate hashing function for the strings that avoids collisions. Is there a good one I can use? How long will it take me to index my set of n strings?

Or should I go with a suffix tree? We know that Ukkonen’s algorithm allows for linear time construction, and my guess is that this will save space given the large numbers of similar strings?

3

There are 3 answers

0
Leaurus On

Given your extremely large number of strings your choice has to focus in several points:

1. Are your indexing structures going to fit in memory?

For hashtables the answer is clearly not. Thus the access time will be much slower than O(1). Still you just need one disk access (the whole insertion process would be O(N)).

For b-tree I made some theoretical calculations, assuming b+tree (to save more space in interior nodes) and also that interior nodes get fully occupied. By this analysis it won't fit in memory:

  • The usual disk page size is 4096 bytes. That is the size of one b-tree node.
  • The average size of your strings is 70 bytes (if it less, the better).
  • Child node address has 4 bytes.
  • An interior node holds d keys and have d+1 children addresses:
    **4096B = 4*(d+1)+70*d <=> d = 4096/75 => d = 54 **

* #internal nodes in memory -> #leaves nodes in disk -> #strings mapped*

0 internal nodes -> 1 leaves node -> 53 strings mapped
1 internal node -> 54 leaves nodes used (each with 53 leaves) -> 53² strings mapped
1+54 internal nodes -> 54² leaves nodes used -> 53³ strings mapped
...
...+54⁵ internal nodes -> 54⁶ leaves nodes = 53⁷ strings mapped

53⁷ > 10^12 , but 54⁵*4096 bytes > 1TB of memory

If your strings are not uniformly distributed, you can explore common prefixes. This way an internal node might be able to address more children, allowing you to save memory. BerkeleyDB has that option.


2. What kind of access are you going to employ? Large or small number of reads?
If you have large number of reads, are they random or sequential?  
  • If your access is sequential you still might benefit of btree, because you will use cached nodes a lot (not needing disk access) and leaves are sequentially linked (b+tree). This is also great for range queries (which I think is not the case). If your access is completely random then hashtable is faster, as it always needs only one disk access and btree needs on disk access for each level stored in disk.

  • If you are going to make a small number of accesses, hashtable is preferable due to the fact that insertion will be always faster.

  • Since you know the total number of your strings you can indicate it to the hashtable, and you will not loose time in bucket scale operations (which implies all elements to be rehashed).

Note: I found something about your ukkonens suffix tree. Insertion is linear, and access is sequential also. However I only found it being used with some GBs. Here are some references about suffix tree algorithms: [ref1], [ref2] and [ref3].

Hope this helps somehow...

0
Georgi On

...

Hi Bob,

the long answer short: the classic HASH+BTREE approach is strong and superfast.

Whether 10 million or 10 billion strings are to be stored in the above structure it doesn't matter - you always have a very low MAX seeks threshold.

Well, you need 10^12 = 1,000,000,000,000 - but this is 1 trillion, it surprises me - even my heavy string corpora are in range of 1 billion.

Just check my implementation in C at: http://www.sanmayce.com/#Section13Level

As such, I'm concerned about finding an appropriate hashing function for the strings that avoids collisions. Is there a good one I can use?

The fastest hash table-lookup function in C is here:

http://www.sanmayce.com/Fastest_Hash/index.html#KT_torture3

It is 300-500% faster than strong CRC32 8slice variants (both Castagnoli's & Koopman's) while featuring similar collisions.

0
user1951747 On

Hash tables are useful when the keys are sparse, but when the keys are dense, there is no need to hash; you can use the key (the string) itself to index. To support simple membership queries, you could use a bit vector. If your data is 39-bit binary strings, you'd have a bit vector of length 2^39. 1 means the string is present, 0 means it's absent. The bit vector would not be terribly large, since it's only 2^39 bits = 2^31 bytes = 2 GB.

To go from a string over a q-letter alphabet to an integer, you treat it as a number in base q. For example, if q=4 and the string is 3011, find the integer as 3*4^3 + 0*4^2 + 1*4^1 + 1*4^0, which equals 197.

The associated integer values will consume a lot of space. You could store them in an array indexed by the string; so in your example, you'd have an array of 2^39 integers, with some slots empty. That is unlikely to fit in memory, though, since it would consume a terabyte even if each integer was only one byte. In that case, you could store them sequentially in a disk file.

You'll probably find it helpful to look up information on bit vectors/bit arrays: http://en.wikipedia.org/wiki/Bit_array

The wikipedia link talks about compression, which might be applicable.