Importing modules for code that runs in the workers

Question

Importing modules for code that runs in the workers

861 views Asked by Ophir Yoktan At 09 June 2015 at 18:18

I wrote a simple job that filters an rdd using a custom function that uses a module.

Where is the correct place to put the import statement?

putting the import in the driver code doesn't help
putting the import inside the filter function works, but doesn't look very good

Original Q&A

There are 1 answers

**AudioBubble** · Answer 1 · 2016-03-17T21:41:50+00:00

You can submit jobs as batch operations with dependent modules using command line spark-submit interface. From the Spark 1.6.1 documentation, it has the following signature ...

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

If your python script is called python_job.py and the module upon which it depends is other_module.py, you'd call

 ./bin/spark-submit python_job.py --py-files other_module.py

This will make sure that other_module.py is on the worker nodes. It is more common that you'll submit a full package so you'd submit other_module_library.egg or even .zip. These should all be acceptable in --py-files.

If, however, you want to work in the interactive shell, I believe that you'll have to stick with importing the module within the function.

TechQA.

Importing modules for code that runs in the workers

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Popular Questions

Popular Tags

Trending Questions