PyArrow >= 0.8.0 must be installed; however, it was not found

848 views Asked by At

I am on the Cloudera platform, I am trying to use pandas UDF in pyspark.I am getting below error. PyArrow >= 0.8.0 must be installed; however, it was not found.

Installing pyarrow 0.8.0 on the platform will take time.

Is there any workaround to use pandas udf without installing pyarrow? I can install on my personal anaconda environment, is it possible to export conda and use it in pyspark?

1

There are 1 answers

2
E.ZY. On
  • I can install on my personal anaconda environment, is it possible to export conda and use it in pyspark? No you cant simply install in your machine and use it, as pyspark is distributed.

But you can pack your venv and ship to your pyspark worker without install custom package like pyarrow on every machine of your platform.
To use virtualenv, simply follow venv-pack package's instruction. https://jcristharif.com/venv-pack/spark.html