I have a question with EMR serverless. I want to create a script that reads data from S3 and then upload the data to a dynamodb table using EMR Serverless.
And as a Normal EMR, I want to use this package com.audienceproject:spark-dynamodb_2.12:1.1.1
But when I set in spark properties
My step in my EMR never stops and when I manually stop no error appears but it seems that It never reach the package. The role that I'm using has dynamodb:* in * resources and my code the spark part is
spark = SparkSession.builder.appName("EMR_SERVERLESS")\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config("spark.executor.extraJavaOptions", "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35")\
.config("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED").config("spark.sql.avro.datetimeRebaseModeInWrite", "CORRECTED")\
.config("spark.jars.packages", "com.audienceproject:spark-dynamodb_2.12:1.1.1")\
.config("yarn.nodemanager.vmem-check-enabled", "false")\
.config("yarn.nodemanager.pmem-check-enabled", "false").getOrCreate()
df = spark.read.format("csv").option("header","true").load(f'MYS3')
##TODO CODE TO PROCESS FILE
df.write.mode("append").option("tableName", f'MYTABLE').option("targetCapacity","0.99").option("region","MYREGION").format("dynamodb").save()
Can someone help me, please?
While it's impossible to help without access to your cluster logs, I would suggest using an alternative package.
com.audienceproject:spark-dynamodb_2.12:1.1.1
is an archived package and has not been updated in several years, last time it was updated EMR Serverless did not exist.My suggestion is to use the official AWS connector for DynamoDB and Spark which is actively maintained:
https://github.com/awslabs/emr-dynamodb-connector