Spark 3.3.1 picking up current date automatically in data frame if date is missing from given timestamp and not marking it as _corrupt record

Question

Spark 3.3.1 picking up current date automatically in data frame if date is missing from given timestamp and not marking it as _corrupt record

75 views Asked by mayur kandekar At 08 September 2024 at 15:54

I am using Spark 3.3.1 to read input CSV file having below header and value

ID, CREATE_DATE
1,  14:42:23.0

I'm passing only time(HH:MM:SS.SSS) where as DATE(YYYY-MM-DD) is missing in CREATE_DATE field and reading CREATE_DATE field as TimestampType() with defined schema with below code.

# Define schema
sparkCustomSchema = StructType([StructField('ID', IntegerType(), True), StructField('CREATE_DATE', TimestampType(), True),StructField('_corrupt_record', StringType(), True)])

# Read CSV with Schema
input_df = sparkSession.read.format("csv").schema(sparkCustomSchema).option("sep", ",").option("header", "true").option(
    "mode", "PERMISSIVE").option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS").option(
    "ignoreLeadingWhiteSpace", "true").option("ignoreTrailingWhiteSpace", "true").option(
    "treatEmptyValuesAsNulls", "true").csv(csv_input_path)

I'm gettig below result when input_df.show()

ID	CREATE_DATE	_corrupt_record
1	2023-07-09 14:42:23	null

In the above result, current date 2023-07-09 automatically got inserted along with timestamp even though date is missing in input csv record. Spark 3.3.1 not considering it as corrupt record for input 14:42:23.0. but in Spark 2.4 It is considering as corrupt record and giving below output.

Spark 2.4 Output:

ID	CREATE_DATE	_corrupt_record
null	null	1, 14:42:23

I tried by setting spark.sql.legacy.timeParserPolicy property to LEGACY and CORRECTED but no change in result. Still Inserting current date in given timestamp and not considering it as corrupt record. Is there a way, how to mark this record as corrupt and achieve same result as spark 2.4 for missing date record in spark 3.3.1

I tried to run the code by setting up below property but still not considering it as corrupt record and inserting current_date along with give timestamp

sparkSession.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
# sparkSession.conf.set("spark.sql.legacy.timeParserPolicy", "CORRECTED")

How to achieve the same result as spark 2.4 in spark 3.3.1

Original Q&A

TechQA.

Spark 3.3.1 picking up current date automatically in data frame if date is missing from given timestamp and not marking it as _corrupt record

There are 0 answers

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-3.0

Related Questions in SPARK2.4.4

Popular Questions

Popular Tags

Trending Questions