I have a huge .csv
file (~2GB) that I import in my program with read_csv
and then convert to an numpy matrix with as_matrix
. The generated matrix has the form like the data_mat
in the given example below. My problem is now, that I need to extract the blocks with the same uuid4 (entry in the first column of the matrix). The submatrices are then processed by another function. It seems that my example below is not the best way doing this. Faster methods are welcome.
import numpy as np
data_mat = np.array([['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 4, 3, 1],\
['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 3, 1, 1],\
['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 3, 3, 1],\
['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 6, 1, 1],\
['f35fb25b-dddc-458a-9f71-0a9c2c202719', 3, 4, 1],\
['f35fb25b-dddc-458a-9f71-0a9c2c202719', 3, 1, 1],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 2, 1],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 9, 0],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 1, 0],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 5, 1, 1],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 1, 1],\
['d3a8a9d0-4380-42e3-b35f-733a9f9770da', 3, 6, 10]],dtype=object)
unique_ids, indices = np.unique(data_mat[:,0],return_index=True,axis=None)
length = len(data_mat)
i=0
for idd in unique_ids:
index = indices[i]
k=0
while ((index+k)<length and idd == data_mat[index+k,0]):
k+=1
tmp_mat=data_mat[index:(index+k),:]
# do something with tmp_mat ...
print(tmp_mat)
i+=1
To optimize the idea would be to minimize the computations once we are inside the loop. So, with that in mind, we would rearrange the rows of the array, sorted by the first column. Then, get the indices that define the boundaries. Finally, start our loop and simply slice for each group to get a submatrix at each iteration. Slicing is virtually free when working with arrays, so that should help us.
Thus, one implementation would be -
If you are looking to store each submatrix as an array to have a list of arrays as the final output, simply do -
For sorted
data_mat
For a case with
data_mat
already being sorted as shown in the sample, we could avoid sorting the entire array and directly use the first column, like so -Again, to get all those submatrices as a list of arrays, use -
Note that the submatrices that we would get with this one would be in a different order than with the sorting done in the previous approach.
Benchmarking for sorted
data_mat
Approaches -
Timings -
In the sample we had a submatrix of max length
6
. So, let's extend to a bigger case keeping it with the same pattern -