The task by example:
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
idx = np.array([2, 0, 1, 1, 2, 0, 1, 1, 2])
Expected result:
binned = np.array([2, 6, 3, 4, 7, 8, 1, 5, 9])
Constraints:
Should be fast.
Should be
O(n+k)where n is the length of data and k is the number of bins.Should be stable, i.e. order within bins is preserved.
Obvious solution
data[np.argsort(idx, kind='stable')]
is O(n log n).
O(n+k) solution
def sort_to_bins(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
cnts = np.zeros(mx + 1, int)
for i in range(idx.size):
cnts[idx[i] + 1] += 1
for i in range(1, cnts.size):
cnts[i] += cnts[i-1]
res = np.empty_like(data)
for i in range(data.size):
res[cnts[idx[i]]] = data[i]
cnts[idx[i]] += 1
return res
is loopy and slow.
Is there a better method in pure numpy < scipy < pandas < numba/pythran?
Here are a few solutions:
Use
np.argsortanyway, after all it is fast compiled code.Use
np.bincountto get the bin sizes andnp.argpartitionwhich isO(n)for fixed number of bins. Downside: currently, no stable algorithm is available, thus we have to sort each bin.Use
scipy.ndimage.measurements.labeled_comprehension. This does roughly what is required, but no idea how it is implemented.Use
pandas. I'm a completepandasnoob, so what I cobbled together here usinggroupbymay be suboptimal.Use
scipy.sparseswitching between compressed sparse row and compressed sparse column formats happens to implement the exact operation we are looking for.Use
pythran(I'm surenumbaworks as well) on the loopy code in the question. All that is required is to insert at the top after numpy import.
and then compile
Benchmarks 100 bins, variable number of items:
Take home:
If you are ok with
numba/pythranthat is the way to go, if notscipy.sparsescales rather well.Code: