I'm trying to run a processing job for machine learning using the new Sagemaker Spark container. The cluster launches, but I immediately run into an ImportError - my dependencies are missing.
I get that the Spark container doesn't have those dependencies, and I've tried to follow steps outlined on SO to install them - namely, using the submit_py_files parameter in PySparkProcessor.run() to submit a .zip file of all my dependencies. However, it doesn't seem to be installing them.
Is there a way to use the Sagemaker PySparkProcessor class to execute a bootstrap script when a cluster launches? I'm currently trying to run a processing workload that uses pandas_udfs, and seeing ImportError when the cluster tries to use PyArrow:
Traceback (most recent call last):
File "/opt/ml/processing/input/code/spark_preprocess.py", line 35 in <module>
@pandas_udf("float", PandasUDFType.GROUPED_AGG)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 47, in _create_udf
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 149 in require_minimum_pyarrow_version
ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.
The dependencies.zip contains PyArrow 0.16.0, and I'm using the latest version of the Sagemaker Python SDK.
I know with EMR you can submit a bootstrap action script to install dependencies - is there a similar option here? Thanks!
Instead of directly using the PySparkProcessor, use SageMaker script mode. This mode allows you to use a script as an entry point, and you can specify dependencies and configurations through the requirements.txt file.
The code example is
)
In the same directory as your entry point script, create a requirements.txt file listing your dependencies. This will include PyArrow and any other required packages.