Fast way to extract submatrix from numpy matrix

3.2k views Asked by At

I have a huge .csv file (~2GB) that I import in my program with read_csvand then convert to an numpy matrix with as_matrix. The generated matrix has the form like the data_mat in the given example below. My problem is now, that I need to extract the blocks with the same uuid4 (entry in the first column of the matrix). The submatrices are then processed by another function. It seems that my example below is not the best way doing this. Faster methods are welcome.

import numpy as np
data_mat = np.array([['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 4, 3, 1],\
                     ['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 3, 1, 1],\
                     ['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 3, 3, 1],\
                     ['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 6, 1, 1],\
                     ['f35fb25b-dddc-458a-9f71-0a9c2c202719', 3, 4, 1],\
                     ['f35fb25b-dddc-458a-9f71-0a9c2c202719', 3, 1, 1],\
                     ['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 2, 1],\
                     ['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 9, 0],\
                     ['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 1, 0],\
                     ['a4cf92fc-0624-4a00-97f6-0d21547e3183', 5, 1, 1],\
                     ['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 1, 1],\
                     ['d3a8a9d0-4380-42e3-b35f-733a9f9770da', 3, 6, 10]],dtype=object)

unique_ids, indices = np.unique(data_mat[:,0],return_index=True,axis=None)
length = len(data_mat)
i=0
for idd in unique_ids:
    index = indices[i]
    k=0
    while ((index+k)<length and idd == data_mat[index+k,0]):        
        k+=1
    tmp_mat=data_mat[index:(index+k),:]
    # do something with tmp_mat ...
    print(tmp_mat)
    i+=1
3

There are 3 answers

0
Divakar On BEST ANSWER

To optimize the idea would be to minimize the computations once we are inside the loop. So, with that in mind, we would rearrange the rows of the array, sorted by the first column. Then, get the indices that define the boundaries. Finally, start our loop and simply slice for each group to get a submatrix at each iteration. Slicing is virtually free when working with arrays, so that should help us.

Thus, one implementation would be -

a0 = data_mat[:,0]
sidx = a0.argsort()
sd = data_mat[sidx] # sorted data_mat
idx = np.flatnonzero(np.concatenate(( [True], sd[1:,0] != sd[:-1,0], [True] )))
for i,j in zip(idx[:-1], idx[1:]):
    tmp_mat = sd[i:j]
    print tmp_mat

If you are looking to store each submatrix as an array to have a list of arrays as the final output, simply do -

[sd[i:j] for i,j in zip(idx[:-1], idx[1:])]

For sorted data_mat

For a case with data_mat already being sorted as shown in the sample, we could avoid sorting the entire array and directly use the first column, like so -

a0 = data_mat[:,0]
idx = np.flatnonzero(np.concatenate(( [True], a0[1:] != a0[:-1], [True] )))

for i,j in zip(idx[:-1], idx[1:]):
    tmp_mat = data_mat[i:j]
    print(tmp_mat)

Again, to get all those submatrices as a list of arrays, use -

[data_mat[i:j] for i,j in zip(idx[:-1], idx[1:])]

Note that the submatrices that we would get with this one would be in a different order than with the sorting done in the previous approach.


Benchmarking for sorted data_mat

Approaches -

# @Daniel F's soln-2
def split_app(data_mat):
    idx = np.flatnonzero(data_mat[1:, 0] != data_mat[:-1, 0]) + 1
    return np.split(data_mat, idx)

# Proposed in this post
def zip_app(data_mat):
    a0 = data_mat[:,0]
    idx = np.flatnonzero(np.concatenate(( [True], a0[1:] != a0[:-1], [True] )))
    return [data_mat[i:j] for i,j in zip(idx[:-1], idx[1:])]

Timings -

In the sample we had a submatrix of max length 6. So, let's extend to a bigger case keeping it with the same pattern -

In [442]: a = np.random.randint(0,100000,(6*100000,4)); a[:,0].sort()

In [443]: %timeit split_app(a)
10 loops, best of 3: 88.8 ms per loop

In [444]: %timeit zip_app(a)
10 loops, best of 3: 40.2 ms per loop

In [445]: a = np.random.randint(0,1000000,(6*1000000,4)); a[:,0].sort()

In [446]: %timeit split_app(a)
1 loop, best of 3: 917 ms per loop

In [447]: %timeit zip_app(a)
1 loop, best of 3: 414 ms per loop
0
Daniel F On

You can do this with boolean indexing.

unique_ids = np.unique(data_mat[:, 0])
masks = np.equal.outer(unique_ids, data_mat[:, 0])
for mask in masks:
    tmp_mat = data_mat[mask]
    # do something with tmp_mat ...
    print(tmp_mat)
4
Daniel F On

If the unique ids are already grouped, you could do this with np.split, similar to @Divakar

idx = np.flatnonzero(data_mat[1:, 0] != data_mat[:-1, 0]) + 1
for tmp_mat in np.split(data_mat, idx):
    # do something with tmp_mat ...
    print(tmp_mat)