There are 2 NumPy arrays groups
and selectors
, where
selectors
is an array containing integers that needs to be grouped
import numpy as np
np.random.seed(0)
selectors = np.random.randint(0, 300, 5)
# [172 47 117 192 251]
groups
is a structured array containing the first index (int) of a group (str)
# Generate groups `a` to `t` and their first index
start = ord('a')
groups = []
for i in range(20):
e = (i*i, chr(start+i))
groups.append(e)
groups = np.array(groups, dtype=[('index', np.uint32), ('selector', '|U1')])
groups = np.sort(groups, order='index')
# [( 0, 'a') ( 1, 'b') ( 4, 'c') ( 9, 'd') ( 16, 'e') ( 25, 'f')
# ( 36, 'g') ( 49, 'h') ( 64, 'i') ( 81, 'j') (100, 'k') (121, 'l')
# (144, 'm') (169, 'n') (196, 'o') (225, 'p') (256, 'q') (289, 'r')
# (324, 's') (361, 't')]
Given these example arrays, the desired result after grouping will be a dictionary of np.ndarrays
/lists
{
"g": [47] ,
"k": [117],
"n": [172, 192],
"p": [251]
}
Is there a quicker way to perform this grouping in Numpy instead of nesting 2 loops, as shown below? This will be useful for large selectors
arrays with 10-100 million rows using groups
array with 100-1000 rows.
Using Nested Loops
results = {}
for s in selectors:
for i in range(len(groups)-1):
if s >= groups[i][0] and s < groups[i+1][0]:
j = i
break
else:
j = i + 1
try:
results[groups[j][1]].append(s)
except KeyError:
results[groups[j][1]] = [s]
print(results)
# {'n': [172, 192], 'g': [47], 'k': [117], 'p': [251]}
If you use binary search on each selector, you are effectively changing the time of your routine from
O(len(groups) * len(selectors))
toO(log2(len(groups)) * len(selectors))
The Python documentation on the
bisect
module explains how to use it to find the right-most element less than or equal to a specified value.