How are writes managed in Spark with speculation enabled?

Question

How are writes managed in Spark with speculation enabled?

527 views Asked by pri At 11 December 2020 at 08:45

Let's say I have a Spark 2.x application, which has speculation enabled (spark.speculation=true), which writes data to a specific location on HDFS.

Now if the task (which writes data to HDFS) takes long, Spark would create a copy of the same task on another executor, and both the jobs would be running in parallel.

How does Spark handle this? Obviously both the tasks shouldn't be trying to write data at the same file location at the same time (which seems to be happening in this case).

Any help would be appreciated.

Thanks

Original Q&A

There are 1 answers

**Evgenii Glotov** · Answer 1 · 2021-07-07T13:51:20+00:00

As I understand what is happening in my tasks:

If one of the speculative tasks is finished, the other is killed
When spark kills this task, it deletes temporary file written by this task
So no data will be duplicated
If you choose mode overwrite, some specilative tasks may fail with this exception:

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to CREATE_FILE /<hdfs_path>/.spark-staging-<...>///part-00191-.c000.snappy.parquet for DFSClient_NONMAPREDUCE_936684547_1 on 10.3.110.14 because this file lease is currently owned by DFSClient_NONMAPREDUCE_-1803714432_1 on 10.0.14.64 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2629)

I will continue to study this situation, so maybe the answer will be more helpful some day

TechQA.

How are writes managed in Spark with speculation enabled?

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-2.0

Related Questions in SPECULATIVE-EXECUTION

Popular Questions

Popular Tags

Trending Questions