I want to use concurrent.futures
together with numpy
to manipulate two scipy.sparse
matrices:
matrix_A = scipy.sparse.lil_matrix((1000, 1000), dtype=np.float32)
matrix_B = scipy.sparse.lil_matrix((500, 1000), dtype=np.float32)
The algorithm works like this: every row in matrix_B
has a one-to-many relationship to rows in matrix_A
. For every row_B
in matrix_B
, I find its corresponding [row_A1, row_A2 ... row_An ]
in matrix_A
, sum them up and assign the sum to row_B
.
def update_values(row):
indices, values = find_rows_in_matrix_A(row)
matrix_B[row, indices] = values
The matrices are large (10^7 rows), and I'd like to run this operation in parallel:
with concurrent.futures.ProcessPoolExecutor(max_workers=32) as executor:
futures = {row : executor.submit(update_values, row)
for row in range(matrix_B.shape[0])}
But this doesn't work because changes made by child processes to global variables will be invisible to the parent process (as mentioned in this answer).
Another option would be to return the values from update_values
, but that would require merging the values in the parent process, which takes too long for my use case.
Using multiprocessing.Manager.Array
could be a solution, but that would create copies of the matrices at every write, and given their size, that's not an option.
Is there any way to make matrix_B
writeable from child processes? Or what would be a better approach to this problem?