I have large ordered sequence of symbols, millions of symbols. I have to find repeated ordered subsequences such that:
- Search subsequences are unknown, I have to find subsequences that repeats elsewhere of large sequence.
- Subsequences may have differences such as presence some amount of noise and absence of some symbols.
Not necessary condition:
- Subsequences may have little amount of permutations of neighbor symbols.
The alphabet consists of thousands symbols.
Can you recommend well-known and well-studied algorithm for such task?
You can try aho-corasick multiple pattern matching and use a wildcard to search for substrings. For subsequence you want also the levenstein-distance. You can try my implementation in PHP of aho-corasick algorithm with wildcard at https://phpahocorasick.codeplex.com.