How to use date_sub in pyspark to filter data between dates by a given date

80 views Asked by At

I have a column 'calendar_date' which contains different dates from 2010 to 2024. And I have to filter between dates from a given date. For example I have the date '2023-09-15' and i have to filter the 7 days before from that day on the calendar_date column. I tried to use date_sub but it did not work. I tried something like this but it didnt work.

spark = SparkSession.builder.appName("DateFilterExample").getOrCreate()


data = [("2023-09-01",), ("2023-09-02",), ("2023-09-03",), ("2023-09-04",), ("2023-09-05",),
        ("2023-09-06",), ("2023-09-07",), ("2023-09-08",), ("2023-09-09",), ("2023-09-10",),
        ("2023-09-11",), ("2023-09-12",), ("2023-09-13",), ("2023-09-14",), ("2023-09-15",),
        ("2023-09-16",), ("2023-09-17",), ("2023-09-18",), ("2023-09-19",), ("2023-09-20",)]
df = spark.createDataFrame(data, ["date_string"])

given_date = date_sub(col('date_string'), 0)  
start_date = date_sub(given_date, 8)  
end_date = date_sub(given_date, 1)   


# Filter the DataFrame
filter_df = df.filter(
    (col("date_string") >= start_date) & 
    (col("date_string") <= end_date) 
).where(col('date_string') <= '2023-09-15' )
filter_df.show()

prints nothing

+-----------+----+
|date_string|date|
+-----------+----+
+-----------+----+
1

There are 1 answers

0
Tomás Jullier On BEST ANSWER

I solve it, as date sub only takes strings and column as parameters, when you try to filter by certain date it doesnt find any coincidence, so you have to use "literal Value" to be able to understand. like this

given_date = '2023-09-15'  
start_date = date_sub(to_date(lit(given_date)), 7)  
end_date = date_sub(to_date(lit(given_date)), 1)    

# Filter the DataFrame
filter_df = df.filter(
    (col("date_string") >= start_date) & 
    (col("date_string") <= end_date)
)
filter_df.show()