python pathos - main process runs very slow and sub-processes run serially

464 views Asked by At

I need to speed up processing time for tasks which involve rather large data sets that are loaded from up to 1.5GB large pickle files with CSV data. I started with python's multiprocessing but for unpickleable class objects I had to switch to pathos. I got some parallel code running and replicated the results obtained from the usual serial runs : so far so good. but processing speed is far from anything useful with the central main node running full throttle while the actual sub-processes (sometimes hundreds in total) run serially and not parallel, separated by huge gaps of time with the main node again running crazy - for what? It 'best' worked with ProcessPool, while apipe and amap were alike without a difference.

below my code excerpt, first with the parallel attempt, second with the serial portion. both give the same result, but the parallel approach is way slower. Importantly each parallel sub-process uses approx the same amount of time as in the serial loop. All variables are preloaded much earlier in a long processing pipeline.

#### PARALLEL multiprocessing
if me.MP:

    import pathos
    cpuN     = pathos.multiprocessing.cpu_count() - 1
    pool     = pathos.pools.ProcessPool( cpuN)  # ThreadPool ParallelPool

    argsIn1  = []  # a mid-large complex dictionary
    argsIn2  = []  # a very large complex dictionary (CSV pickle of 400MB or more)
    argsIn3  = []  # a list of strings
    argsIn4  = []  # a brief string
    for Uscr in UID_lst:
        argsIn1.append( me)
        argsIn2.append( Db)
        argsIn3.append( UID_lst)
        argsIn4.append( Uscr)

    result_pool  = pool.amap( functionX, argsIn1, argsIn2, argsIn3, argsIn4)

    results      = result_pool.get()

    for result in results:
        [ meTU, U]  = result
        me.v[T][U]  = meTU[U]   # insert result !

#### SERIAL processing
else:

    for Uscr in UID_lst:

        meTU, U     = functionX( me, Db, UID_lst, Uscr)
        me.v[T][U]  = meTU[U]   # insert result !


I tested this code on two linux machines, an i3 CPU (with 32GB RAM, slackware 14.2, python 3.7) and on a 2*Xeon box (with 64GB RAM, slackware current, python 3.8). pathos 0.2.6 was installed with pip3. as said, both machines show the same speed problem with the code shown here.

What do I miss here ?

ADDENDUM : it appears that only the very first PID is doing the entire job through all items in UID_lst - while the other 10 sub-processes are idle waiting for nothing, as seen with top and with os.getpid(). cpuN is 11 in this example.

ADDENDUM 2 : sorry about this new revision but running this code under different loads (many more jobs to solve) involved finally more than just one sub-process busy, but just after a long time ! here a top output :

top - 14:09:28 up 19 days,  4:04,  3 users,  load average: 6.75, 6.20, 5.08
Tasks: 243 total,   6 running, 236 sleeping,   0 stopped,   1 zombie
%Cpu(s): 48.8 us,  1.2 sy,  0.0 ni, 49.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64061.6 total,   2873.6 free,  33490.9 used,  27697.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29752.0 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 5441 userx     20   0 6597672   4.9g  63372 S 100.3   7.9  40:29.08 python3 do_Db_job
 5475 userx     20   0 6252176   4.7g   8828 R 100.0   7.5   9:24.46 python3 do_Db_job
 5473 userx     20   0 6260616   4.7g   8828 R 100.0   7.6  17:02.44 python3 do_Db_job
 5476 userx     20   0 6252432   4.7g   8828 R 100.0   7.5   5:37.52 python3 do_Db_job
 5477 userx     20   0 6252432   4.7g   8812 R 100.0   7.5   1:48.18 python3 do_Db_job
 5474 userx     20   0 6253008   4.7g   8828 R  99.7   7.5  13:13.38 python3 do_Db_job
 1353 userx     20   0    9412   4128   3376 S   0.0   0.0   0:59.63 sshd: userx@pts/0
 1354 userx     20   0    7960   4692   3360 S   0.0   0.0   0:00.20 -bash
 1369 userx     20   0    9780   4212   3040 S   0.0   0.0  31:16.80 sshd: userx@pts/1
 1370 userx     20   0    7940   4632   3324 S   0.0   0.0   0:00.16 -bash
 4545 userx     20   0    5016   3364   2296 R   0.0   0.0   3:01.76 top
 5437 userx     20   0   19920  13280   6544 S   0.0   0.0   0:00.07 python3
 5467 userx     20   0       0      0      0 Z   0.0   0.0   0:00.00 [git] <defunct>
 5468 userx     20   0 3911460   2.5g   9148 S   0.0   4.0  17:48.90 python3 do_Db_job
 5469 userx     20   0 3904568   2.5g   9148 S   0.0   4.0  16:13.19 python3 do_Db_job
 5470 userx     20   0 3905408   2.5g   9148 S   0.0   4.0  16:34.32 python3 do_Db_job
 5471 userx     20   0 3865764   2.4g   9148 S   0.0   3.9  18:35.39 python3 do_Db_job
 5472 userx     20   0 3872140   2.5g   9148 S   0.0   3.9  20:43.44 python3 do_Db_job
 5478 userx     20   0 3844492   2.4g   4252 S   0.0   3.9   0:00.00 python3 do_Db_job
27052 userx     20   0    9412   3784   3052 S   0.0   0.0   0:00.02 sshd: userx@pts/2
27054 userx     20   0    7932   4520   3224 S   0.0   0.0   0:00.01 -bash

it appears to me that max 6 sub-processes will run at any given time which may correspond with the psutil.cpu_count(logical=False) = 6, instead of the pathos.multiprocessing.cpu_count() = 12... ?

1

There are 1 answers

3
pisti On

Actually, the problem got solved - which it turns out never was one in first place at this development stage of my code. The problem was elsewhere : the variables the worker processes are supplied with are very large, many gigabytes at times. This scenario will keep the main/central node busy for ever with variable dill/undill (like pickle/unpickle) even on the new dual-xeon machine, not to speak about the old i3 CPU box. With the former I saw up to 6 or 7 workers active (out of 11) while the latter never even got to more than 1 active worker, and even for the former machine it took a huge amount of time, dozens of minutes, to see a few workers accumulating in top.

so, I will need to adjust the code so that each worker will have to re-read the huge variables from disk/network - which also takes some time but it make sense to free the central node from this silly repeated task but instead give it a chance to do the job it was designed for, which is scheduling and organizing the workers' show.

I am also happy to say that the resulting output, a CSV file (wc: 36722 1133356 90870757), is identical for the parallel run as compared to the traditional serial version.

Having said this, I am truly surprised how convenient it is to use python/pathos - and not having to change the relevant worker code between the serial and parallel runs !