Having a date, I create a column with ISO 8601 week date format:
from pyspark.sql import functions as F
df = spark.createDataFrame([('2019-03-18',), ('2019-12-30',), ('2022-01-03',), ('2022-01-10',)], ['date_col'])
df = df.withColumn(
'iso_from_date',
F.concat_ws(
'-',
F.expr('extract(yearofweek from date_col)'),
F.lpad(F.weekofyear('date_col'), 3, 'W0'),
F.expr('weekday(date_col) + 1')
)
)
df.show()
# +----------+-------------+
# | date_col|iso_from_date|
# +----------+-------------+
# |2019-03-18| 2019-W12-1|
# |2019-12-30| 2020-W01-1|
# |2022-01-03| 2022-W01-1|
# |2022-01-10| 2022-W02-1|
# +----------+-------------+
Using Spark 3, how to get back the date, given ISO 8601 week date?
I tried the following, but it is both, incorrect and uses LEGACY
configuration which I don't like.
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
df.withColumn('date_from_iso', F.to_date('iso_from_date', "YYYY-'W'ww-uu")).show()
# +----------+-------------+-------------+
# | date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18| 2019-W12-1| 2019-03-18|
# |2019-12-30| 2020-W01-1| 2019-12-30|
# |2022-01-03| 2022-W01-1| 2021-12-27|
# |2022-01-10| 2022-W02-1| 2022-01-03|
# +----------+-------------+-------------+
I am aware of the possibility to create a udf
which works:
import datetime
@F.udf('date')
def iso_to_date(iso_date):
return datetime.datetime.strptime(iso_date, '%G-W%V-%u')
df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()
But I am looking for a more efficient option. The ideal option should not use LEGACY
configuration and be translatable to SQL or Scala (no inefficient udf
).
In PySpark, I have found a nicer than
udf
option. This will usepandas_udf
which is vectorized (more efficient):It works in Spark 3 without the
LEGACY
configuration. So it's acceptable.However, there is room for improvement, as this option is not transferable to SQL or Scala.