Importing modules for code that runs in the workers

890 views Asked by At

I wrote a simple job that filters an rdd using a custom function that uses a module.

Where is the correct place to put the import statement?

  • putting the import in the driver code doesn't help
  • putting the import inside the filter function works, but doesn't look very good
1

There are 1 answers

1
AudioBubble On

You can submit jobs as batch operations with dependent modules using command line spark-submit interface. From the Spark 1.6.1 documentation, it has the following signature ...

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

If your python script is called python_job.py and the module upon which it depends is other_module.py, you'd call

 ./bin/spark-submit python_job.py --py-files other_module.py

This will make sure that other_module.py is on the worker nodes. It is more common that you'll submit a full package so you'd submit other_module_library.egg or even .zip. These should all be acceptable in --py-files.

If, however, you want to work in the interactive shell, I believe that you'll have to stick with importing the module within the function.