How to package python script with dependencies into zip/tar?

1.1k views Asked by At

I've got a hadoop cluster that I'm doing data analytics on using Numpy, SciPy, and Pandas. I'd like to be able to submit my hadoop jobs as a zip/tar file using the '--file ' argument to a command. That zip file should have EVERYTHING that my python program needs to execute such that no matter what node my script executes on in the cluster, I won't run into an ImportError at runtime.

Due to company policies, installing these libraries on every node isn't exactly feasible, especially for exploratory/agile development. I do have pip and virtualenv installed to create sandboxes as needed though.

I've looked at zipimport and python packaging but none of that seems to fulfill my needs/I'm having difficulty using these tools.

Has anyone had any luck doing this? I can't seem to find any success stories online.

Thanks!

1

There are 1 answers

2
NikoNyrh On

I have solved similar problem in Apache Spark and Python context by creating a Docker image which has needed python libraries and Spark slave script installed. The image is distributed to other machines and when the container is started it auto-joins to to the cluster, we have only one such image / host machine.

Our ever-changing python project is submitted as a zip file along with the job and imports work transparently form there. Luckily we rarely need to re-create those slave images and we don't run jobs with conflicting requirements.

I'm not sure how applicable this in your scenario, especially since (in my understanding) some python libraries must be compiled.