AWS Dynamodb package problem - EMR Serverless

63 views Asked by At

I have a question with EMR serverless. I want to create a script that reads data from S3 and then upload the data to a dynamodb table using EMR Serverless.

And as a Normal EMR, I want to use this package com.audienceproject:spark-dynamodb_2.12:1.1.1

But when I set in spark properties enter image description here

My step in my EMR never stops and when I manually stop no error appears but it seems that It never reach the package. The role that I'm using has dynamodb:* in * resources and my code the spark part is

spark = SparkSession.builder.appName("EMR_SERVERLESS")\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config("spark.executor.extraJavaOptions", "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35")\
.config("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED").config("spark.sql.avro.datetimeRebaseModeInWrite", "CORRECTED")\
.config("spark.jars.packages", "com.audienceproject:spark-dynamodb_2.12:1.1.1")\
.config("yarn.nodemanager.vmem-check-enabled", "false")\
.config("yarn.nodemanager.pmem-check-enabled", "false").getOrCreate()

df = spark.read.format("csv").option("header","true").load(f'MYS3')

##TODO CODE TO PROCESS FILE

df.write.mode("append").option("tableName", f'MYTABLE').option("targetCapacity","0.99").option("region","MYREGION").format("dynamodb").save()

Can someone help me, please?

1

There are 1 answers

1
Leeroy Hannigan On

While it's impossible to help without access to your cluster logs, I would suggest using an alternative package. com.audienceproject:spark-dynamodb_2.12:1.1.1 is an archived package and has not been updated in several years, last time it was updated EMR Serverless did not exist.

My suggestion is to use the official AWS connector for DynamoDB and Spark which is actively maintained:

https://github.com/awslabs/emr-dynamodb-connector