I am looking for the most efficient way to store binary strings in a data structure (insert function) and then when getting a string I want to check if some cyclic string of the given string is in my structure.
I thought about storing the input strings in a Trie but then when trying to determine whether some cyclic string of the string I got now was inserted to the Trie means to do |s| searches in the Trie for all the possible cyclic strings.
Is there any way to do that more efficiently while the place complexity will be like in a Trie?
Note: When I say cyclic strings of a string I mean that for example all the cyclic strings of 1011
are: 0111, 1110, 1101, 1011
Can you come up with a canonicalizing function for cyclic strings based on the following:
This would canonicalize everything in the equivalence class (1011, 1101, 1110, 0111) to the lexicographically least value: 0111.
0101010101
is a thorny instance for which this algo will not perform well, but if your bits are roughly randomly distributed, it should work well in practice for long strings.You can then hash based on the canonical form or use a trie that will include only the empty string and strings that start with 0 and a single trie run will answer your question.
EDIT:
That's why I said
010101....
is a value for which it performs badly. Let's say the string is of length n and the longest run of 1's is of length r. If the bits are randomly distributed, the length of the longest run is O(log n) according to "Distribution of longest run".The time to find the longest run is O(n). You can implement shifting using an offset instead of a buffer copy, which should be O(1). The number of runs is worst case O(n / m).
Then, the time to do step 3 should be
This leads to a worst case of O(n²) but an average case of O(n + log² n) ≅ O(n).