Why does multi-processing slow down a nested for loop?

843 views Asked by At

I have a lot of very large matrices AFeatures that I am comparing against some other very large matrices BFeatures, both of which have a shape of (878, 2, 4, 15, 17, 512), using the Euclidean distance. I am trying to parallelise this process to speed up the comparison. I am using Python 3 in a Conda environment and my original code uses an average of two CPU cores at 100%:

    per_slice_comparisons = np.zeros(shape=(878, 878, 2, 4))
    
    for i in range(878):
        for j in range(878):
            for k in range(2):
                for l in range(4):
                    per_slice_comparisons[i, j, k, l] = np.linalg.norm(AFeatures[i, k, l, :] - BFeatures[j, k, l, :])

I have tried two approaches for speeding up the code.

  1. Using multi-processing

    def fill_array(i):
        comparisons = np.zeros(shape=(878, 2, 4))
    
        for j in range(878):
            for k in range(2):
                for l in range(4):
                    comparisons[j, k, l] = np.linalg.norm(AFeatures[i, k, l, :] -BFeatures[j, k, l, :])
             comparisons[j, k, l] = 0
    
             return comparisons
    
    pool = Pool(processes=6)
    list_start_vals = range(878)
    per_slice_comparisons = np.array(pool.map(fill_array, list_start_vals))
    pool.close()
    

This approach increases run time by around 5%, although all 8 CPU cores are now being used at 100%. I have tried a number of different processes, the more there are the slower it gets.

  1. This is a slightly different approach where I use the numexpr library to do a faster linal.norm operation. For a single operation this approach reduces runtime by a factor of 10.

     os.environ['NUMEXPR_MAX_THREADS'] = '8'
     os.environ['NUMEXPR_NUM_THREADS'] = '4'
     import numexpr as ne
    
     def linalg_norm(a):
         sq_norm = ne.evaluate('sum(a**2)')
         return ne.evaluate('sqrt(sq_norm)')
    
     per_slice_comparisons = np.zeros(shape=(878, 878, 2, 4))
         for i in range(878):
             for j in range(878):
                 for k in range(2):
                     for l in range(4):
                         per_slice_comparisons[i, j, k, l] = linalg_norm(AFeatures[i, k, l, :] - BFeatures[j, k, l, :])
    

However, for a nested for loop this approach increases total execution time by a factor of 3. I don't understand why simply putting this operation in a nested for loop would decrease performance so dramatically? If anyone has any ideas on how to fix this I would really appreciate it!

2

There are 2 answers

10
Jérôme Richard On BEST ANSWER

Why does multi-processing slow down a nested for loop in python?

Creating a process is a very expensive system operation. The operating system has to remap a lot of pages (program, shared library, data, etc.) so that the newly created processes can access to the ones of the initial process. The multiprocessing package also use inter-process communication in order to share the work between processes. This is slow too. Not to mention the required final join operation. To be efficient (ie. reduce the overheads as much as possible), a Python program using the multiprocessing package should share a small amount of data and perform expensive computations. In your case, you do not need the multiprocessing package since you use only Numpy arrays (see later).

This is a slightly different approach where I use the numexpr library to do a faster linal.norm operation. For a single operation this approach reduces runtime by a factor of 10.

Numexpr use threads rather then processes and threads are light compared to processes (ie. less expensive). Numexpr also uses aggressive optimization to speed up the evaluated expression as much as possible (something CPython does not do).

I don't understand why simply putting this operation in a nested for loop would decrease performance so dramatically?

The default implementation of Python is CPython with is an interpreter. Interpreters are generally very slow (especially CPython). CPython perform almost no optimization of your code. If you want fast loops, then you need alternatives that compile them to native code or JIT them. You can use Cython or Numba for that. The two can provide simple ways to parallelize your program. Using Numba is probably the simplest solution in your case. You can start by looking the example programs.


Update: if the implementation of Numpy is multithreaded can, then the multiprocessing code can be much slower. Indeed, each process will create N threads on a machine with N cores. Consequently N*N threads will be run. This situation is called over-subscription and is known to be inefficient (due to preemptive multitasking and especially context-switches). One way to check this hypothesis is to simply look how many threads are created (eg. using the hwloc tool on Posix systems) or simply monitor the processor usage.

0
William Smith On

Just a quick update from me on this issue. I found that when calculating the Euclidean distance between different, high dimensional vectors I did get the best results using numpy within Anaconda. Using multiprocessing on top of that did not achieve any significant improvement.

However, I later found the recent Faiss library via a code example (https://github.com/QVPR/Patch-NetVLAD). Faiss (https://anaconda.org/pytorch/faiss-gpu) is a library that is used for clustering and calculating the distance between different vectors and can be used to calculate both the cosine and Euclidean distance. The speed up that can be achieved with this library is, to put it simply, gigantic - well in excess of a factor of 100 speed up for comparing large amounts of highly dimensional matrixes. It has been a total game changer for my research and I would highly recommend it, particularly for comparing large neural network descrciptors.