I am using Spark 3.3.1 to read input CSV file having below header and value

ID, CREATE_DATE
1,  14:42:23.0

I'm passing only time(HH:MM:SS.SSS) where as DATE(YYYY-MM-DD) is missing in CREATE_DATE field and reading CREATE_DATE field as TimestampType() with defined schema with below code.

# Define schema
sparkCustomSchema = StructType([StructField('ID', IntegerType(), True), StructField('CREATE_DATE', TimestampType(), True),StructField('_corrupt_record', StringType(), True)])
# Read CSV with Schema
input_df = sparkSession.read.format("csv").schema(sparkCustomSchema).option("sep", ",").option("header", "true").option(
    "mode", "PERMISSIVE").option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS").option(
    "ignoreLeadingWhiteSpace", "true").option("ignoreTrailingWhiteSpace", "true").option(
    "treatEmptyValuesAsNulls", "true").csv(csv_input_path)

I'm gettig below result when input_df.show()

ID CREATE_DATE _corrupt_record
1 2023-07-09 14:42:23 null

In the above result, current date 2023-07-09 automatically got inserted along with timestamp even though date is missing in input csv record. Spark 3.3.1 not considering it as corrupt record for input 14:42:23.0. but in Spark 2.4 It is considering as corrupt record and giving below output.

Spark 2.4 Output:

ID CREATE_DATE _corrupt_record
null null 1, 14:42:23

I tried by setting spark.sql.legacy.timeParserPolicy property to LEGACY and CORRECTED but no change in result. Still Inserting current date in given timestamp and not considering it as corrupt record. Is there a way, how to mark this record as corrupt and achieve same result as spark 2.4 for missing date record in spark 3.3.1

I tried to run the code by setting up below property but still not considering it as corrupt record and inserting current_date along with give timestamp

sparkSession.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
# sparkSession.conf.set("spark.sql.legacy.timeParserPolicy", "CORRECTED")

How to achieve the same result as spark 2.4 in spark 3.3.1

0

There are 0 answers