(python) Implementing C insort in python

367 views Asked by At

So i'm using insort from bisect to insert strings into a sorted list. It's somehow faster if i use the built in one. **by faster i mean, on average twice as fast (1 millisecond vs 2 millisecond on a 10,000 word list). I run through the unix cmd on a bigger list than the one i have below through:

time script.py < wordList.txt

I'm thinking it has to do with C, but i don't understand how or why. I want to make mine as fast, but without using the built in.

Here it is straight from the bisect source-code:

def insort_left(a, x, lo=0, hi=None):
"""Insert item x in list a, and keep it sorted assuming a is sorted.

If x is already in a, insert it to the left of the leftmost x.

Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""

if lo < 0:
    raise ValueError('lo must be non-negative')
if hi is None:
    hi = len(a)
while lo < hi:
    mid = (lo+hi)//2
    if a[mid] < x: lo = mid+1
    else: hi = mid
a.insert(lo, x)

Bisect Source Code

This is the part that i think makes it different:

# Overwrite above definitions with a fast C implementation
try:
    from _bisect import *
except ImportError:
    pass

Here is a list of input:

#wordlist = ["hello", "jupiter", "albacore", "shrimp", "axe", "china", "lance", "peter", "sam", "uncle", "jeniffer", "alex", "fer", "joseph"]

Some code to make it work:

sorted_list = []
for line in wordlist:
    insort_left(sorted_list, line)
print(sorted_list)

So my concern is implementing a C based insort in python without using modules. How can i do this?

1

There are 1 answers

5
user2246674 On

Python executing on a VM will never be as fast as native code in a case like this.

The reason is that the Python VM simply has to do more work to execute the Python code. The C code is already compiled and is executed directly. While the C extension code still needs to access the Python VM/runtime it has the additional benefit that it can perform certain operations directly through the exposed C API. There is the RPython project which can eliminate some of the additional runtime execution overhead of the Python code due to typing, but it has restrictions (no pun).

Instead of trying to make the Python code "faster" than C (which will not happen for the same algorithm), make the Python code "smarter" than C.

In this case, the algorithm used has a poor complexity of ~O(n^2) as it is effectilvely nested loops - one loop for reading a line, one loop in the nested insort call (for the insert, at least). This complexity can and likely should be improved to ~O(n lg n) and it will make a difference in performance for n above some (relatively small) value.

Assume that lines is a list containing all lines in the file, then consider the following approaches. Do not re-run these sort proposals inside a loop!

  1. lines.sort() - built-in list.sort, is a good hybrid sort and has a much better bounds than the above use of repeated calls to insort. This is possibly cheating because it still uses "native C" implementation provided by Python. In any case, this should absolutely crush the insort implementation for more than a couple [dozen] lines.

  2. largeSort(lines) - where largeSort is a "pure Python" implementation of a merge sort or a comb sort. For large enough n (across unsorted data) this will be faster than the insort C code. Due to the additional constant performance overhead of executing Python code, tests will need to be conducted to find out for which value of n this starts to dominate.

  3. smallSort(lines) - where smallSort is a "pure Python" insertion sort implementation. This may be better for really small lists and can be performed online/streaming like the insort approach. Performance tests would need to be done, but this may be faster than (a "pure Python") insort in this particular case.

In any case, the precise constant factor overheads and datasets being sorted are relevant. I would suggest looking at (i.e. graphing) the performance over several approaches and anticipated data including worst-case scenarios. Even something such as pre-splitting data may (relatively) benefit one approach over another or yield misleading/non-ideal gains. Another possible consideration is timings ("benchmarks") of the entire Python program may be dominated by overhead not even associated with the sorting mechanism employed.