I have a large set of strings, on order ~10^12 or so, and I need to choose an appropriate data structure so that, provided a string, I can retrieve and associated integer value in something like O(log(n)) or O(m) time where 'n' is the length of the list of strings and 'm' is the length of each string.
We can expect that our set of strings, each of length 'm' and encoded over some alphabet of size 'q', covers nearly all possible strings of this length. For example, imagine we have 10^12 all-unique binary strings of length m = 39. This implies that we've covered ~54% of the set of all possible binary strings of this length.
As such, I'm concerned about finding an appropriate hashing function for the strings that avoids collisions. Is there a good one I can use? How long will it take me to index my set of n strings?
Or should I go with a suffix tree? We know that Ukkonen’s algorithm allows for linear time construction, and my guess is that this will save space given the large numbers of similar strings?
Given your extremely large number of strings your choice has to focus in several points:
For hashtables the answer is clearly not. Thus the access time will be much slower than O(1). Still you just need one disk access (the whole insertion process would be O(N)).
For b-tree I made some theoretical calculations, assuming b+tree (to save more space in interior nodes) and also that interior nodes get fully occupied. By this analysis it won't fit in memory:
**4096B = 4*(d+1)+70*d <=> d = 4096/75 => d = 54 **
* #internal nodes in memory -> #leaves nodes in disk -> #strings mapped*
0 internal nodes -> 1 leaves node -> 53 strings mapped
1 internal node -> 54 leaves nodes used (each with 53 leaves) -> 53² strings mapped
1+54 internal nodes -> 54² leaves nodes used -> 53³ strings mapped
...
...+54⁵ internal nodes -> 54⁶ leaves nodes = 53⁷ strings mapped
If your strings are not uniformly distributed, you can explore common prefixes. This way an internal node might be able to address more children, allowing you to save memory. BerkeleyDB has that option.
If your access is sequential you still might benefit of btree, because you will use cached nodes a lot (not needing disk access) and leaves are sequentially linked (b+tree). This is also great for range queries (which I think is not the case). If your access is completely random then hashtable is faster, as it always needs only one disk access and btree needs on disk access for each level stored in disk.
If you are going to make a small number of accesses, hashtable is preferable due to the fact that insertion will be always faster.
Since you know the total number of your strings you can indicate it to the hashtable, and you will not loose time in bucket scale operations (which implies all elements to be rehashed).
Note: I found something about your ukkonens suffix tree. Insertion is linear, and access is sequential also. However I only found it being used with some GBs. Here are some references about suffix tree algorithms: [ref1], [ref2] and [ref3].
Hope this helps somehow...