Slice List of floats by value in Python

2.5k views Asked by At

I have a list of several thousand floats that I want to be able to slice by min and max values.

E.G. using:

flist = [1.9842, 9.8713, 5.4325, 7.6855, 2.3493, 3.3333]

(my actual list is 400,000 floats long, but the above is a working example)

I want something like

def listclamp(minn, maxn, nlist):

such that

print listclamp(3, 8, flist)

should give me

[3.3333, 5.4325, 7.6855]

I also need to do this 10,000 to 30,000 times, so speed does count.

(I have no sample code for what I've tried so far, because this is new python territory for me)

3

There are 3 answers

3
syntagma On BEST ANSWER

This will return sorted list you want:

flist = [1.9842, 9.8713, 5.4325, 7.6855, 2.3493, 3.3333]

def listclamp(minn, maxn, nlist): 
    return sorted(filter(lambda x: xminn <= x <= maxn, nlist))

print listclamp(3, 8, flist) 

A faster approach, using list comprehensions:

def listclamp2(minn, maxn, nlist): 
    return sorted([x for x in flist if (minn <= and x<=maxn)])

print listclamp2(3, 8, flist)

Note that depending on your data it may be better to filter the list first and then sort it (as I did in the code above).

For more information on performance, refer to this link.

5
ventsyv On

Sort the list (if you use the same list over and over, sort it just once), then use binary search to find the position of the lower and upper bounds. Think of it, there is a package that does - bisect.

0
abarnert On

The obvious thing to do is either sort then filter, or filter then sort.

If you have the same list every time, sorting first is obviously a win, because then you only need to sort once instead of every time. It also means you can use a binary search for the filtering instead of a linear walk (as explained in ventsyv's answer—although that probably won't pay off unless your lists are much longer than this one.

If you have different lists every time, filtering first is probably a win, because the sort is probably the slow part, and you're sorting a smaller list that way.

But let's stop speculating and start testing.

Using a list of several thousand floats, about half of which are in range:

In [1591]: flist = [random.random()*10 for _ in range(5000)]
In [1592]: %timeit sorted(x for x in flist if 3 <= x < 8)
100 loops, best of 3: 3.12 ms per loop
In [1593]: %timeit [x for x in sorted(flist) if 3 <= x < 8]
100 loops, best of 3: 4 ms per loop
In [1594]: %timeit l=sorted(flist); l[bisect.bisect_left(l, 3):bisect.bisect_right(l, 8)]
100 loops, best of 3: 3.36 ms per loop

So, filtering then sorting wins; ventsyn's algorithm does make up for part of the difference, but not all of it. But course if we only have a single list to sort, sorting it once instead of thousands of times is an obvious win:

In [1596]: l = sorted(flist)
In [1597]: %timeit l[bisect.bisect_left(l, 3):bisect.bisect_right(l, 8)]
10000 loops, best of 3: 29.2 µs per loop

So, if you have the same list over and over, obviously sort it once.

Otherwise, you could test on your real data… but we're talking about shaving up to 22% off of something that takes milliseconds. Even if you do it many thousands of times, that's saving you under a second. Just the cost of typing the different implementations—much less understanding them, generalizing them, debugging them, and performance testing them—is more than that.


But really, if you're doing millions of operations spread over hundreds of thousands of values, and speed is important, you shouldn't be using a list in the first place, you should be using a NumPy array. NumPy can store just the raw float values, without boxing them up as Python objects. Besides saving memory (and improving cache locality), this means that the inner loop in, say, np.sort is faster than the inner loop in sorted, because it doesn't have to make a Python function call that ultimately involves unboxing two numbers, it just has to do a comparison directly.

Assuming you're storing your values in an array in the first place, how does it stack up?

In [1607]: flist = np.random.random(5000) * 10
In [1608]: %timeit a = np.sort(flist); a = a[3 <= a]; a = a[a < 8]
1000 loops, best of 3: 742 µs per loop
In [1611]: %timeit c = b[3 <= b]; d = c[c < 8]
10000 loops, best of 3: 29.8 µs per loop

So, it's about 4x faster than filter-and-sort for the "different lists" case, even using a clunky algorithm (I was looking for something I could cram onto one %timeit line, rather than the fastest or most readable…). And for the "same list over and over" case, it's almost as fast as the bisect solution even without bisecting (but of course you can bisect with NumPy, too).