We use data type dependent logic in Spark 3.2. For interval year
data type, DataFrame
methods schema
and dtypes
don't seem to work.
Without interval year
type column, the methods work well:
df1 = spark.range(1)
df1.printSchema()
# root
# |-- id: long (nullable = false)
print(df1.schema)
# StructType(List(StructField(id,LongType,false)))
print(df1.dtypes)
# [('id', 'bigint')]
But when I add a new column, schema
and dtypes
methods start to throw the parsing error:
df2 = df1.withColumn('col_interval_y', F.expr("INTERVAL '2021' YEAR"))
df2.printSchema()
# root
# |-- id: long (nullable = false)
# |-- col_interval_y: interval year (nullable = false)
print(df2.schema)
# ValueError: Unable to parse datatype from schema. Could not parse datatype: interval year
print(df2.dtypes)
# ValueError: Unable to parse datatype from schema. Could not parse datatype: interval year
For our logic to work, we need to access column data types of a dataframe. How can we access the type interval year
in Spark 3.2? (Spark 3.5 doesn't throw errors, but we cannot use it yet)
I have found that it's possible to use underlying
_jdf
.The following recreates the result of
dtypes
: