How to optimize writing to a large table in Hive/HDFS using Spark

48 views Asked by At

I face a common problem when trying to write a large table in Hive/HDFS from a source table containing 24 billion rows.

The Spark job applies a Window function to retrieve only the latest states for each key in the source table. I end up having to insert around 11 billion rows in the target table.

The tables are partitioned on a dated field in YYYYMM format named dt_part.

The picture below is the data writing Stage:

enter image description here

This Stage is very long and gets bogged down at the end without any data being written down.

The data writing function is as follows:

df
  .write
  .partitionBy("dt_part")
  .mode(SaveMode.Overwrite)
  .saveAsTable(targetTable)

For information, the Spark job has the following resources (dynamic allocation):

  • Driver memory = 15 G
  • Executor memory = 16 G
  • Executor cores = 8
  • Max number of executors = 80
  • spark.sql.shuffle.partitions = 10000

I have tested several values of the spark.sql.shuffle.partitions (250, 3000 and 20000) but the result remains the same, I can't write the data in a resonable timeframe.

Do you have any suggestions for optimising the writing process?

0

There are 0 answers