python member str performance too slow

563 views Asked by At

I have encountered a strange performance problem when adding to an str class member in python 2.7.3. I know that accessing local variables are faster, however, in the problem below, there is more than 100x speed difference between the two loops. The one that accesses a.accum_ starts fast but slows down, as if str iadd is O(n^2) with the length of str.

Does anyone know the reason?

# Fast ( < 1sec):
accum = str()
for ii in range(1000000):
    if (ii % 10000) == 0:
        print 'fast cnt', ii
    accum += 'zzzzz\n'

# Much slower ( > 5 mins):
class Foo:
    pass
a = Foo()
a.accum_ = str()
for ii in range(1000000):
    if (ii % 10000) == 0:
        print 'slow cnt', ii
    a.accum_ += 'zzzzz\n'
2

There are 2 answers

1
Dunes On

Python strings are immutable and so can't have an __iadd__ method. What you are witnessing in the first example is a micro optimisation of the CPython interpreter. In the first example the interpreter has noticed that it has a local variable that has a reference count of 1. Thus, the interpreter can cheekily get away with modifying the string in place. Even though this violates the contract of str, at no point during the execution of the program will it be apparent that this contract was briefly violated.

In the latter example this micro optimisation isn't being implemented, which is why it's so slow. It looks like the optimisation could be applied, so I'm not sure why it isn't being applied.

Generally though, if building a string, collate the substrings in a list and then use str.join to create the end product.

0
Ashwini Chaudhary On

For the first example it is pretty clear that it is case of single reference optimisation(there are two references actually: one from the object itself and one LOAD_FAST; unicode_concatenate will try to reduce it to 1 before passing control to PyUnicode_Append) done by CPython using this unicode_modifiable function:

static int
unicode_modifiable(PyObject *unicode)
{
    assert(_PyUnicode_CHECK(unicode));
    if (Py_REFCNT(unicode) != 1)
        return 0;
    if (_PyUnicode_HASH(unicode) != -1)
        return 0;
    if (PyUnicode_CHECK_INTERNED(unicode))
        return 0;
    if (!PyUnicode_CheckExact(unicode))
        return 0;
#ifdef Py_DEBUG
    /* singleton refcount is greater than 1 */
    assert(!unicode_is_singleton(unicode));
#endif
    return 1;
}

But in the second case as the instance data is stored in a Python dict rather than a simple variable, so the things get little different.

a.accum_ += 'foo'

actually requires prefetching the value of a.accum_ and storing it on to the stack. So, now the string has atleast three references: one from the instance dictionary, one from DUP_TOP and one from PyObject_GetAttr used by LOAD_ATTR. Hence, Python cannot optimise this case as modifying one of them in-place will affect other references as well.

>>> class A:
    pass
...
>>> a = A()
>>> def func():
    a.str = 'spam'
    print a.str
    return '_from_func'
...
>>> a.str = 'foo'
>>> a.str += func()
spam

You would expect the output here to be 'spam_from_func', but it is going to be different because the original value of a.str was stored by Python before func() was called.

>>> a.str
'foo_from_func'

Byte Code:

>>> import dis
>>> def func_class():
        a = Foo()
        a.accum = ''
        a.accum += 'zzzzz\n'
...
>>> dis.dis(func_class)
  2           0 LOAD_GLOBAL              0 (Foo)
              3 CALL_FUNCTION            0 (0 positional, 0 keyword pair)
              6 STORE_FAST               0 (a)

  3           9 LOAD_CONST               1 ('')
             12 LOAD_FAST                0 (a)
             15 STORE_ATTR               1 (accum)

  4          18 LOAD_FAST                0 (a)
             21 DUP_TOP
             22 LOAD_ATTR                1 (accum)
             25 LOAD_CONST               2 ('zzzzz\n')
             28 INPLACE_ADD
             29 ROT_TWO
             30 STORE_ATTR               1 (accum)
             33 LOAD_CONST               0 (None)
             36 RETURN_VALUE

Note that this optimisation was done in around 2004(CPython 2.4) to prevent users from slowness of a += b or a = a + b, so it is mostly meant for simple variables and works only if the next instruction is STORE_FAST(local variable), STORE_DEREF(closures) and STORE_NAME. It is not a general solution, the best way to do this in Python is to create a list and join its items using str.join.

CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations.