I am trying to to make use of multiprocessing across several different computers, which pathos
seems geared towards: "Pathos is a framework for heterogenous computing. It primarily provides the communication mechanisms for configuring and launching parallel computations across heterogenous resources." In looking at the documentation, however, I am at a loss as to how to get a cluster up and running. I am looking to:
- Set up a remote server or set of remote servers with secure authentication.
- Securely connect the the remote server(s).
- Map a task across all CPUs in both the remote servers and my local machine using a straightforward API like
pool.map
in the standard multiprocessing package (like the pseudocode in this related question).
I do not see an example for (1) and I do not understand the tunnel example provided for (2). The example does not actually connect to an existing service on the localhost. I would also like to know if/how I can require this communication to come with a password/key of some kind that would prevent someone else from connecting to the server. I understand this uses SSH authentication, but absent a preexisting key that only insures that the traffic is not read as it passes over the Internet, but does nothing to prevent someone else from hijacking the server.
I'm the
pathos
author. Basically, for (1) you can usepathos.pp
to connect to another computer through a socket connection.pathos.pp
has almost exactly the same API aspathos.multiprocessing
, although withpathos.pp
you can give the address and port of a remote host to connect to, using the keywordservers
when setting up thePool
.However, if you want to make a secure connection with SSH, it's best to establish a SSH-tunnel connection (as in the example you linked to), and then pass
localhost
and the local port number to theservers
keyword inPool
. This will then connect to the remotepp-worker
through the ssh tunnel. See: https://github.com/uqfoundation/pathos/blob/master/examples/test_ppmap2.py and http://www.cacr.caltech.edu/~mmckerns/pathos.htmlLastly, if you are using
pathos.pp
with a remote server, as above, you should be already doing (3). However, it can be more efficient (for an embarrassingly parallel enough set of jobs), that you nest the parallel maps… so first usepathos.pp.ParallelPythonPool
to build a parallel map across servers, then call aN
-way job using a parallel map inpathos.multiprocessing.ProcessingPool
inside the function you are mapping withpathos.pp
. This will minimize the communication across the remote connection.Also, you don't need to give a SSH password, if you have ssh-agent working for you. See: http://mah.everybody.org/docs/ssh. Pathos assumes for parallel maps across remote servers, you will have ssh-agent working and you won't need to type your password every time there's a connection.
EDIT: added example code on your question here: Python Multiprocessing with Distributed Cluster