I have deployed a HDInsight 3.5 Spark (2.0) cluster on Microsoft Azure with the standard configurations (Location = US East, Head Nodes = D12 v2 (x2), Worker Nodes = D4 v2 (x4)). When the cluster is running, I connect to Jupyter notebook and I try to import an own created module.
import own_module
This unfortunately does not work, so I tried to 1) upload own_module.py in Jupyter Notebook home and 2) added own_module.py to /home/sshuser via ssh connection. Afterwards I added /home/sshuser to the sys.path and PYTHONPATH:
sys.path.append('/home/sshuser')
os.environ['PYTHONPATH'] = os.environ['PYTHONPATH'] + ':/home/sshuser'
This manipulation also does not work. And the error still shows:
No module named own_module
Traceback (most recent call last):
ImportError: No module named own_module
Could someone tell how to I can import own modules? Preferably by putting them in Azure blob storage and afterwards transferring them to the HDInsight cluster.
You can use spark context's addPyFile method. First put the file into Azure blob storage, then copy the public http/https address and use this URL into
addPyFile
function. The module will be accesible on driver and all executors.