Repeated ordered sequence search algorithm

213 views Asked by At

I have large ordered sequence of symbols, millions of symbols. I have to find repeated ordered subsequences such that:

  1. Search subsequences are unknown, I have to find subsequences that repeats elsewhere of large sequence.
  2. Subsequences may have differences such as presence some amount of noise and absence of some symbols.

Not necessary condition:

  1. Subsequences may have little amount of permutations of neighbor symbols.

The alphabet consists of thousands symbols.

Can you recommend well-known and well-studied algorithm for such task?

1

There are 1 answers

1
Micromega On

You can try aho-corasick multiple pattern matching and use a wildcard to search for substrings. For subsequence you want also the levenstein-distance. You can try my implementation in PHP of aho-corasick algorithm with wildcard at https://phpahocorasick.codeplex.com.