Gap between the job duration and stage duration in Spark

31 views Asked by Chujun Song At 26 December 2023 at 03:29

I am trying to use Spark to execute TPC-H queries on a scale factor of 1TB, the partsupp table from TPC-H is stored at Postgres, and Spark will create a temp view for the partsupp table using the following script:

spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://postgreshost:5432/postgres") \
.option("dbtable", "partsupp") \
.option("driver", "org.postgresql.Driver") \
.option("user", "xxx")\
.option("password", "yyy") \
.option("numPartitions", "100") \
.option("lowerBound", "0") \
.option("upperBound", "200000001") \
.option("partitionColumn", "PS_PARTKEY") \
.load().createOrReplaceTempView("partsupp")

When I try to execute query Q11, the execution will have a job(1.4min) to scan the partsupp table with only one stage(1.4min) shown in the following figure:

But when I try to execute query Q2, the execution will have a similar job(2.4min) to scan the partsupp table with only one stage(1.2min) shown in the following figure:

My question is: the two jobs are similar and would do a similar job to scan partsupp table, but why does one have a gap between stage time(2.4min) and job time(1.2min), but the other does not(1.4min vs 1.4min)? Thanks very much!

Original Q&A

TechQA.

Gap between the job duration and stage duration in Spark

There are 0 answers

Related Questions in APACHE-SPARK

Related Questions in BENCHMARKING

Related Questions in TPC

Popular Questions

Trending Questions