ValueError: Unable to parse datatype from schema. Could not parse datatype: interval year

165 views Asked by At

We use data type dependent logic in Spark 3.2. For interval year data type, DataFrame methods schema and dtypes don't seem to work.

Without interval year type column, the methods work well:

df1 = spark.range(1)

df1.printSchema()
# root
#  |-- id: long (nullable = false)

print(df1.schema)
# StructType(List(StructField(id,LongType,false)))

print(df1.dtypes)
# [('id', 'bigint')]

But when I add a new column, schema and dtypes methods start to throw the parsing error:

df2 = df1.withColumn('col_interval_y', F.expr("INTERVAL '2021' YEAR"))

df2.printSchema()
# root
#  |-- id: long (nullable = false)
#  |-- col_interval_y: interval year (nullable = false)

print(df2.schema)
# ValueError: Unable to parse datatype from schema. Could not parse datatype: interval year

print(df2.dtypes)
# ValueError: Unable to parse datatype from schema. Could not parse datatype: interval year

For our logic to work, we need to access column data types of a dataframe. How can we access the type interval year in Spark 3.2? (Spark 3.5 doesn't throw errors, but we cannot use it yet)

1

There are 1 answers

0
ZygD On

I have found that it's possible to use underlying _jdf.

The following recreates the result of dtypes:

jdtypes = [(x.name(), x.dataType().typeName()) for x in df2._jdf.schema().fields()]
print(jdtypes)
# [('id', 'long'), ('col_interval_y', 'interval year')]