Does str.replace() have a time complexity O(n^2)?

2.4k views Asked by At

I am trying to find the time complexity of str.replace() built into python, and this is the data I've managed to gather (here and on other sites):

I know replace() is based on the Boyer–Moore algorithm, which takes worst-case time of O(n*m) to find a substring, but is this for a single substring?

Does replace() return a copy of the "fixed" string when it finds the first substring and then start to search again?

And what about when we have multiple occurrences of a substring, like the following example:

old_string = '192.168.1.1'
new_string = old_string.replace('.', '|')

If it can only replace one substring at a time, then for a single substring we get O(n*m), times the number of substrings which is a maximum of n/m. This comes to O(n^2)!

Given that a simple loop would take O(n), like:

old_string = '192.168.1.1'
new_string = []
for ch in old_string:
    new_string.append('|' if ch == '.' else ch)

Does that make sense? Am I missing something?

Could the built in replace() be flawed for multiple replacements, or is it implemented in a way to continue where it left off?

1

There are 1 answers

2
btilly On BEST ANSWER

The worst case is O(n*(m1 + m2/m1)) where n is the length of the string, m1 is the length of the searched for string and m2 is the length of the replacement.

The average case is O(n * (1 + m2/m1)).

In principle the algorithm looks like this:

initialize data structures.     # max time O(n)
while find next match:          # max time O(n*m1)
    copy unchanged string.      # max time O(n)
    copy replacement            # max time O((n/m1) * m2) + O(n)
copy rest of the string         # max time O(n)

There are a lot of details. (For example they have to manage memory, and take fast paths for cases like the replacement is the size of the original.) But here is an explanation of each step and why it takes that time.

  1. You are initializing a data structure to take the result. This initialization is fast, but it is O(n) data to initialize so time O(n).
  2. Finding all matches is worst case that for each character you compare m1-1 characters forward, fail to match the last, back up and try again. Hence that can be O(n*m1).
  3. Copying O(n) data takes O(n) time.
  4. There can be at most O(n/m1) matches, each of which we copy m2 data for. However we can also exceed the size we allocated to put data in. In that case we have to create a new place to put data, copy over what we did, then proceed. The thresholds for resizing are chosen so that the total cost has a maximum O(n) time cost.
  5. There can be at most O(n) data after the last match.

Add those together and absorb O(n) terms into O(n*m1) and you get the original estimate.

Going back to the average case, the string search usually won't go to near the end of the substring before falling back. Most letters don't match. Most of the time if the first letter matches, the second one does not. And so on. So the search is usually going to be O(n). Drop that out and you get to the other estimate.