Why does multi-processing slow down a nested for loop?

Question

Why does multi-processing slow down a nested for loop?

853 views Asked by William Smith At 10 May 2021 at 16:50

I have a lot of very large matrices AFeatures that I am comparing against some other very large matrices BFeatures, both of which have a shape of (878, 2, 4, 15, 17, 512), using the Euclidean distance. I am trying to parallelise this process to speed up the comparison. I am using Python 3 in a Conda environment and my original code uses an average of two CPU cores at 100%:

    per_slice_comparisons = np.zeros(shape=(878, 878, 2, 4))
    
    for i in range(878):
        for j in range(878):
            for k in range(2):
                for l in range(4):
                    per_slice_comparisons[i, j, k, l] = np.linalg.norm(AFeatures[i, k, l, :] - BFeatures[j, k, l, :])

I have tried two approaches for speeding up the code.

Using multi-processing

def fill_array(i):
    comparisons = np.zeros(shape=(878, 2, 4))

    for j in range(878):
        for k in range(2):
            for l in range(4):
                comparisons[j, k, l] = np.linalg.norm(AFeatures[i, k, l, :] -BFeatures[j, k, l, :])
         comparisons[j, k, l] = 0

         return comparisons

pool = Pool(processes=6)
list_start_vals = range(878)
per_slice_comparisons = np.array(pool.map(fill_array, list_start_vals))
pool.close()

This approach increases run time by around 5%, although all 8 CPU cores are now being used at 100%. I have tried a number of different processes, the more there are the slower it gets.

This is a slightly different approach where I use the numexpr library to do a faster linal.norm operation. For a single operation this approach reduces runtime by a factor of 10.

 os.environ['NUMEXPR_MAX_THREADS'] = '8'
 os.environ['NUMEXPR_NUM_THREADS'] = '4'
 import numexpr as ne

 def linalg_norm(a):
     sq_norm = ne.evaluate('sum(a**2)')
     return ne.evaluate('sqrt(sq_norm)')

 per_slice_comparisons = np.zeros(shape=(878, 878, 2, 4))
     for i in range(878):
         for j in range(878):
             for k in range(2):
                 for l in range(4):
                     per_slice_comparisons[i, j, k, l] = linalg_norm(AFeatures[i, k, l, :] - BFeatures[j, k, l, :])

However, for a nested for loop this approach increases total execution time by a factor of 3. I don't understand why simply putting this operation in a nested for loop would decrease performance so dramatically? If anyone has any ideas on how to fix this I would really appreciate it!

Original Q&A

There are 2 answers

William Smith On 20 August 2021 at 18:15

Just a quick update from me on this issue. I found that when calculating the Euclidean distance between different, high dimensional vectors I did get the best results using numpy within Anaconda. Using multiprocessing on top of that did not achieve any significant improvement.

However, I later found the recent Faiss library via a code example (https://github.com/QVPR/Patch-NetVLAD). Faiss (https://anaconda.org/pytorch/faiss-gpu) is a library that is used for clustering and calculating the distance between different vectors and can be used to calculate both the cosine and Euclidean distance. The speed up that can be achieved with this library is, to put it simply, gigantic - well in excess of a factor of 100 speed up for comparing large amounts of highly dimensional matrixes. It has been a total game changer for my research and I would highly recommend it, particularly for comparing large neural network descrciptors.

**Jérôme Richard** · Accepted Answer · 2021-05-10T22:41:48+00:00

Why does multi-processing slow down a nested for loop in python?

Creating a process is a very expensive system operation. The operating system has to remap a lot of pages (program, shared library, data, etc.) so that the newly created processes can access to the ones of the initial process. The multiprocessing package also use inter-process communication in order to share the work between processes. This is slow too. Not to mention the required final join operation. To be efficient (ie. reduce the overheads as much as possible), a Python program using the multiprocessing package should share a small amount of data and perform expensive computations. In your case, you do not need the multiprocessing package since you use only Numpy arrays (see later).

This is a slightly different approach where I use the numexpr library to do a faster linal.norm operation. For a single operation this approach reduces runtime by a factor of 10.

Numexpr use threads rather then processes and threads are light compared to processes (ie. less expensive). Numexpr also uses aggressive optimization to speed up the evaluated expression as much as possible (something CPython does not do).

I don't understand why simply putting this operation in a nested for loop would decrease performance so dramatically?

The default implementation of Python is CPython with is an interpreter. Interpreters are generally very slow (especially CPython). CPython perform almost no optimization of your code. If you want fast loops, then you need alternatives that compile them to native code or JIT them. You can use Cython or Numba for that. The two can provide simple ways to parallelize your program. Using Numba is probably the simplest solution in your case. You can start by looking the example programs.

Update: if the implementation of Numpy is multithreaded can, then the multiprocessing code can be much slower. Indeed, each process will create N threads on a machine with N cores. Consequently N*N threads will be run. This situation is called over-subscription and is known to be inefficient (due to preemptive multitasking and especially context-switches). One way to check this hypothesis is to simply look how many threads are created (eg. using the hwloc tool on Posix systems) or simply monitor the processor usage.

TechQA.

Why does multi-processing slow down a nested for loop?

There are 2 answers

Related Questions in PYTHON

Related Questions in PERFORMANCE

Related Questions in NUMPY

Related Questions in MULTIPROCESSING

Related Questions in NUMEXPR

Popular Questions

Popular Tags

Trending Questions