How to Pass Arguments (EntryPointArguments) in spark JOB using EMR Serverless?

3.2k views Asked by At

**I'm trying to pass some arguments to run my pyspark script by the parameter of boto3 (emr-serverless client) EntryPointArguments, however, it doesn't work at all, I would like to know if I'm doing it the right way. **

**my python code is like this:**
`

import argparse


parser = argparse.ArgumentParser()

parser.add_argument('-env', nargs='?', metavar='Environment', type=str,
                help='String: Environment to run. Options: [dev, prd]',
                choices=['dev', 'prd'],
                required=True,
                default="prd")

# Capture args
args = parser.parse_args()
env = args.env

print(f"HELLO WOLRD FROM {env}")`
**and my script that runs emr-serverless looks like this:**
jobDriver={
        "sparkSubmit": {
            "entryPoint": "s3://example-bucket-us-east-1-codes-prd/hello_world.py",
            "entryPointArguments": ["-env prd"],
            "sparkSubmitParameters": 
                "--conf spark.executor.cores=2 \
                 --conf spark.executor.memory=4g \
                 --conf spark.driver.cores=2 \
                 --conf spark.driver.memory=8g \
                 --conf spark.executor.instances=1 \
                 --conf spark.dynamicAllocation.maxExecutors=12 \
                ",
        }
**I've already tried putting single quotes, double quotes, I've tried to pass along these parameters in the "sparkSubmitParameters" and so far, nothing works, there aren't many examples of how to do this on the internet, so my hope is that someone has already done it, and achieved, thank you!**
2

There are 2 answers

1
Leoads99 On

I was testing it out, and I ended up figuring out how to do this. From what I understand, when it's a param like this:

-env prd

you have to pass in the EntryPointArguments like this:

["-env", "prd"]

separating the arg, then passing the value, each one separately.

0
MeHow89 On

To pass some parameters into the application there should be a configuration specified in the sparkSubmit part of the command named entryPointArguments.

Below I pasted a full AWS CLI command for EMR Serverless application to run a job, passing named arguments into a python script containing pySpark code. Additional parameters in Spark Submit part of the command let to pass packages (utilities.zip) and jar files (JDBC_Driver.jar) to Spark executors in order to allow the application using it. --execution-role-arn value should come from IAM, --application-id is EMR Serverless application (must be created beforehand) which will run the job .

aws emr-serverless start-job-run --execution-role-arn arn:aws:iam::123456:role/RoleName \
 --application-id 1234567 --job-driver \
'{
  "sparkSubmit": {
    "entryPoint": "s3://MyS3Bucket/dir/pyspark/spark_app.py",
    "entryPointArguments": [
      "--s3",
      "MyS3Bucket",
      "--prefix",
      "dir/pyspark",
      "--env",
      "dev"
    ],
    "sparkSubmitParameters": "--conf spark.submit.pyFiles=s3://MyS3Bucket/dir/pyspark/utilities.zip, --jars s3://MyS3Bucket/dir/drivers/JDBC_Driver.jar"
  }
}' \
--configuration-overrides \
'{
  "monitoringConfiguration": {
    "s3MonitoringConfiguration": {
      "logUri": "s3://MyS3Bucket/logs/"
    }
  }
}'