How best to install dependencies in a Sagemaker PySpark cluster

964 views Asked by At

I'm trying to run a processing job for machine learning using the new Sagemaker Spark container. The cluster launches, but I immediately run into an ImportError - my dependencies are missing.

I get that the Spark container doesn't have those dependencies, and I've tried to follow steps outlined on SO to install them - namely, using the submit_py_files parameter in PySparkProcessor.run() to submit a .zip file of all my dependencies. However, it doesn't seem to be installing them.

Is there a way to use the Sagemaker PySparkProcessor class to execute a bootstrap script when a cluster launches? I'm currently trying to run a processing workload that uses pandas_udfs, and seeing ImportError when the cluster tries to use PyArrow:

Traceback (most recent call last):
    File "/opt/ml/processing/input/code/spark_preprocess.py", line 35 in <module>
    @pandas_udf("float", PandasUDFType.GROUPED_AGG)
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 47, in _create_udf
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 149 in require_minimum_pyarrow_version
    ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

The dependencies.zip contains PyArrow 0.16.0, and I'm using the latest version of the Sagemaker Python SDK.

I know with EMR you can submit a bootstrap action script to install dependencies - is there a similar option here? Thanks!

1

There are 1 answers

0
Akbari On

Instead of directly using the PySparkProcessor, use SageMaker script mode. This mode allows you to use a script as an entry point, and you can specify dependencies and configurations through the requirements.txt file.

The code example is

from sagemaker.script import ScriptProcessor

script_processor = ScriptProcessor(
    base_job_name="your-job-name",
    image_uri="sagemaker-spark-your-region",  # Use the appropriate Spark container URI
    command=["/bin/bash"],
    role="your-role",
    instance_count=1,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=3600,
)

script_processor.run(
    submit_app="your-entry-point-script.py",
    arguments=["arg1", "arg2"],
    # Add any other parameters as needed

)

In the same directory as your entry point script, create a requirements.txt file listing your dependencies. This will include PyArrow and any other required packages.

pandas==your_pandas_version
pyarrow==0.16.0