How to optimize writing to a large table in Hive/HDFS using Spark

48 views Asked by Mamaf At 27 March 2024 at 09:47

I face a common problem when trying to write a large table in Hive/HDFS from a source table containing 24 billion rows.

The Spark job applies a Window function to retrieve only the latest states for each key in the source table. I end up having to insert around 11 billion rows in the target table.

The tables are partitioned on a dated field in YYYYMM format named dt_part.

The picture below is the data writing Stage:

This Stage is very long and gets bogged down at the end without any data being written down.

The data writing function is as follows:

df
  .write
  .partitionBy("dt_part")
  .mode(SaveMode.Overwrite)
  .saveAsTable(targetTable)

For information, the Spark job has the following resources (dynamic allocation):

Driver memory = 15 G
Executor memory = 16 G
Executor cores = 8
Max number of executors = 80
spark.sql.shuffle.partitions = 10000

I have tested several values of the spark.sql.shuffle.partitions (250, 3000 and 20000) but the result remains the same, I can't write the data in a resonable timeframe.

Do you have any suggestions for optimising the writing process?

Original Q&A

TechQA.

How to optimize writing to a large table in Hive/HDFS using Spark

There are 0 answers

Related Questions in APACHE-SPARK

Related Questions in HADOOP

Related Questions in OPTIMIZATION

Related Questions in HIVE

Related Questions in HDFS

Popular Questions

Trending Questions