Spark 2.x Dataframe write consistency check in Append Mode

101 views Asked by At

I am reading the data in Spark inside for loop and performing joins and writing the data into the path in append mode.

for (partition <- partitionlist) {
    var df = spark.read.parquet("path")
    var df2 = df.join(anotherdf, col("col1") === col("col1"))
    df2.write.mode("SaveMode.Append").partitionBy("partitionColumn").format("parquet").save("anotherpath")
}

In the above sample code, we are using spark 2.X version. Since spark 2 write APIs are not consistent, Is it possible that in case of any iteration, if the stages/task go in retries(in writing to the path) and get successful after a few retries, Is it possible that we see the data redundancy in the written data of that for loop's iteration where retry happened?

EDIT: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 is being used.

0

There are 0 answers