Not all dates are captured when filtering by dates. Python Pandas

478 views Asked by At

I am filtering a dataframe by dates to produce two seperate versions:

  1. Data from only today's date
  2. Data from the last two years

However, when I try to filter on the date, it seems to miss dates that are within the last two years.

date_format = '%m-%d-%Y'  # desired date format

today = dt.now().strftime(date_format)  # today's date. Will always result in today's date
today = dt.strptime(today, date_format).date()  # converting 'today' into a datetime object

today = today.strftime(date_format)
two_years = today - relativedelta(years=2)  # date is today's date minus two years. 
two_years = two_years.strftime(date_format)

# normalizing the format of the date column to the desired format 
df_data['date'] = pd.to_datetime(df_data['date'], errors='coerce').dt.strftime(date_format)

df_today = df_data[df_data['date'] == today]
df_two_year = df_data[df_data['date'] >= two_years]

Which results in:

all dates ['07-17-2020' '07-15-2020' '08-01-2019' '03-25-2015']
today df ['07-17-2020']
two year df ['07-17-2020' '08-01-2019']

The 07-15-2020 date is missing from the two year, even though 08-01-2019 is captured.

2

There are 2 answers

3
FObersteiner On BEST ANSWER

you don't need to convert anything to string, simply work with datetime dtype. Ex:

import pandas as pd

df = pd.DataFrame({'date': pd.to_datetime(['07-17-2020','07-15-2020','08-01-2019','03-25-2015'])})

today = pd.Timestamp('now')

print(df[df['date'].dt.date == today.date()])
#         date
# 0 2020-07-17

print(df[(df['date'].dt.year >= today.year-1) & (df['date'].dt.date != today.date())])
#         date
# 1 2020-07-15
# 2 2019-08-01

What you get from the comparison operations (adjust them as needed...) are boolean masks - you can use them nicely to filter the df.

3
Kay On

Your datatype conversions are the problem here. You could do this:

today = dt.now()  # today's date. Will always result in today's date
two_years = today - relativedelta(years=2)  # date is today's date minus two years. 

This prints '2018-07-17 18:40:42.704395'. You can then convert it to the date only format.

two_years = two_years.strftime(date_format)
two_years = dt.strptime(two_years, date_format).date()