I wrote a simple job that filters an rdd using a custom function that uses a module.
Where is the correct place to put the import statement?
- putting the import in the driver code doesn't help
- putting the import inside the filter function works, but doesn't look very good
You can submit jobs as batch operations with dependent modules using command line
spark-submit
interface. From the Spark 1.6.1 documentation, it has the following signature ...If your python script is called
python_job.py
and the module upon which it depends isother_module.py
, you'd callThis will make sure that other_module.py is on the worker nodes. It is more common that you'll submit a full package so you'd submit
other_module_library.egg
or even.zip
. These should all be acceptable in--py-files
.If, however, you want to work in the interactive shell, I believe that you'll have to stick with importing the module within the function.