Find Substring of Trie Keys

671 views Asked by At

This seems like it should be a common problem yet I can't seem to find anything that exactly fits my needs.

I have 2 sequence(fasta) files. One is an assembly of the other, so I would like to make a trie or suffix tree out of the assembled sequences, and search each for the sub sequences.

For example:

s1 = 'ATTCCG'
s2 = 'ATT'
s3 = 'CCG'

I would like to use s1 as a key in a trie, and search it for substrings s2 and s3.

Here is what I have tried so far:

  1. Bio pythons trie and triefind:

This only allows for the whole key to match, or for a prefix to match. So in the above example, I could only successfully search s2 but not s3.

  1. Suffix tree for python.

I looked at numbers 2 and 4 from this but neither seemed capable of handling the immense string sizes that I am working with (each string can be up to a few million characters long).

If possible, I would prefer to use the Trie becuase I have used it on this sized dataset before with success. Is there a way to search for substrings that are not whole matches or prefixes?

If not, what is a suitable suffix tree library that can handle very large strings?

I am using a linux 24 core 128 GB RAM machine.

Thank you

0

There are 0 answers