I am on the Cloudera platform, I am trying to use pandas UDF in pyspark.I am getting below error. PyArrow >= 0.8.0 must be installed; however, it was not found.
Installing pyarrow 0.8.0 on the platform will take time.
Is there any workaround to use pandas udf without installing pyarrow? I can install on my personal anaconda environment, is it possible to export conda and use it in pyspark?
But you can pack your venv and ship to your pyspark worker without install custom package like pyarrow on every machine of your platform.
To use virtualenv, simply follow
venv-pack
package's instruction. https://jcristharif.com/venv-pack/spark.html