I'm trying to run a python script using a custom python and deploy --deploy-mode cluster
on an Enterprise 4.2 cluster.
[biadmin@bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin@bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext()
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I then try to run the script like this:
export PYSPARK_PYTHON=/home/biadmin/anaconda2/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/home/biadmin/anaconda2/bin/python2.7
spark-submit --master yarn \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar,/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar,/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/iop/current/spark-client/conf/hive-site.xml \
test_pokes.py
This runs on the yarn cluster but does NOT use the PYSPARK_PYTHON
variable.
However, if I use --deploy-mode client
, PYSPARK_PYTHON
is used ok.
Update
I have tried adding this before the SparkContext is initialised:
os.environ["PYSPARK_PYTHON"] = '/home/biadmin/anaconda2/bin/python2.7'
os.environ["PYSPARK_DRIVER_PYTHON"] = '/home/biadmin/anaconda2/bin/python2.7'
Also, tried setting --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/home/biadmin/anaconda2/bin/python2.7
Container: container_e09_1477084339086_0508_02_000001 on bi4c-xxxxxx-data-1.bi.services.bluemix.net_45454
==========================================================================================================
LogType:stderr
...
java.io.IOException: Cannot run program "/home/biadmin/anaconda2/bin/python2.7": error=2, No such file or directory
However,
[biadmin@bi4c-xxxxxx-mastermanager ~]$ ssh bi4c-xxxxxx-data-1.bi.services.bluemix.net
[biadmin@bi4c-xxxxxx-data-2 ~]$ ls /home/biadmin/anaconda2/bin/python2.7
/home/biadmin/anaconda2/bin/python2.7
You are right, the PYSPARK_PYTHON will not work that way.
You can try to add this command line at your script before start the Spark Context:
Of course, if your anaconda path is this above, if not you need to add this to your workers to work, or change the path that anaconda is in each work.