I am trying to find the time complexity of str.replace()
built into python, and this is the data I've managed to gather (here and on other sites):
I know replace()
is based on the Boyer–Moore algorithm, which takes worst-case time of O(n*m) to find a substring, but is this for a single substring?
Does replace()
return a copy of the "fixed" string when it finds the first substring and then start to search again?
And what about when we have multiple occurrences of a substring, like the following example:
old_string = '192.168.1.1'
new_string = old_string.replace('.', '|')
If it can only replace one substring at a time, then for a single substring we get O(n*m), times the number of substrings which is a maximum of n/m. This comes to O(n^2)!
Given that a simple loop would take O(n), like:
old_string = '192.168.1.1'
new_string = []
for ch in old_string:
new_string.append('|' if ch == '.' else ch)
Does that make sense? Am I missing something?
Could the built in replace() be flawed for multiple replacements, or is it implemented in a way to continue where it left off?
The worst case is
O(n*(m1 + m2/m1))
wheren
is the length of the string,m1
is the length of the searched for string andm2
is the length of the replacement.The average case is
O(n * (1 + m2/m1))
.In principle the algorithm looks like this:
There are a lot of details. (For example they have to manage memory, and take fast paths for cases like the replacement is the size of the original.) But here is an explanation of each step and why it takes that time.
O(n)
data to initialize so timeO(n)
.m1-1
characters forward, fail to match the last, back up and try again. Hence that can beO(n*m1)
.O(n)
data takesO(n)
time.O(n/m1)
matches, each of which we copym2
data for. However we can also exceed the size we allocated to put data in. In that case we have to create a new place to put data, copy over what we did, then proceed. The thresholds for resizing are chosen so that the total cost has a maximumO(n)
time cost.O(n)
data after the last match.Add those together and absorb
O(n)
terms intoO(n*m1)
and you get the original estimate.Going back to the average case, the string search usually won't go to near the end of the substring before falling back. Most letters don't match. Most of the time if the first letter matches, the second one does not. And so on. So the search is usually going to be
O(n)
. Drop that out and you get to the other estimate.