Date from week date format: 2022-W02-1 (ISO 8601)

474 views Asked by At

Having a date, I create a column with ISO 8601 week date format:

from pyspark.sql import functions as F
df = spark.createDataFrame([('2019-03-18',), ('2019-12-30',), ('2022-01-03',), ('2022-01-10',)], ['date_col'])

df = df.withColumn(
    'iso_from_date',
    F.concat_ws(
        '-',
        F.expr('extract(yearofweek from date_col)'),
        F.lpad(F.weekofyear('date_col'), 3, 'W0'),
        F.expr('weekday(date_col) + 1')
    )
)
df.show()
# +----------+-------------+
# |  date_col|iso_from_date|
# +----------+-------------+
# |2019-03-18|   2019-W12-1|
# |2019-12-30|   2020-W01-1|
# |2022-01-03|   2022-W01-1|
# |2022-01-10|   2022-W02-1|
# +----------+-------------+

Using Spark 3, how to get back the date, given ISO 8601 week date?

I tried the following, but it is both, incorrect and uses LEGACY configuration which I don't like.

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
df.withColumn('date_from_iso', F.to_date('iso_from_date', "YYYY-'W'ww-uu")).show()
# +----------+-------------+-------------+
# |  date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18|   2019-W12-1|   2019-03-18|
# |2019-12-30|   2020-W01-1|   2019-12-30|
# |2022-01-03|   2022-W01-1|   2021-12-27|
# |2022-01-10|   2022-W02-1|   2022-01-03|
# +----------+-------------+-------------+

I am aware of the possibility to create a udf which works:

import datetime
@F.udf('date')
def iso_to_date(iso_date):
    return datetime.datetime.strptime(iso_date, '%G-W%V-%u')

df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()

But I am looking for a more efficient option. The ideal option should not use LEGACY configuration and be translatable to SQL or Scala (no inefficient udf).

1

There are 1 answers

0
ZygD On BEST ANSWER

In PySpark, I have found a nicer than udf option. This will use pandas_udf which is vectorized (more efficient):

import pandas as pd
@F.pandas_udf('date')
def iso_to_date(iso_date: pd.Series) -> pd.Series:
    return pd.to_datetime(iso_date, format='%G-W%V-%u')

df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()
# +----------+-------------+-------------+
# |  date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18|   2019-W12-1|   2019-03-18|
# |2019-12-30|   2020-W01-1|   2019-12-30|
# |2022-01-03|   2022-W01-1|   2022-01-03|
# |2022-01-10|   2022-W02-1|   2022-01-10|
# +----------+-------------+-------------+

It works in Spark 3 without the LEGACY configuration. So it's acceptable.

However, there is room for improvement, as this option is not transferable to SQL or Scala.