Compare dates with null values

3.2k views Asked by At

I have two columns. I want to check if the difference between them is between 0 and 10 days. One of the fields often contains null values.

df['Diff'] = (df['Dt1'] - df['Dt2'])

def wdw(x):
    if pd.notnull(x):
        if type(x) !=long:
            if type(timedelta(days=10)) != long:
                if x > timedelta(days=10):
                    return 1
    else:
        return 0

df['Diff'].df(wdw)

When I run this I get the following error.

TypeError: can't compare datetime.timedelta to long

When I look at the values of df['Diff'] They all appear to be timedeltas. Any idea what is going on here? It seems like creating an indicator based on the difference between two date fields should be easier than this...

3

There are 3 answers

4
unutbu On BEST ANSWER

The values in df['Diff'] are numpy timedelta64s. You can compare them with pd.Timedeltas; see below.

Moreover, you do not need to call df['Diff'].apply(wdw), which calls wdw for each value in the Series; you can compare whole Series with a pd.Timedelta:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Dt1':pd.date_range('2010-1-1', freq='5D', periods=10),
                   'Dt2':pd.date_range('2010-1-2', freq='3D', periods=10)})
df.iloc[::3, 1] = np.nan

df['Diff'] = df['Dt1'] - df['Dt2']
print(df)

#          Dt1        Dt2    Diff
# 0 2010-01-01        NaT     NaT
# 1 2010-01-06 2010-01-05  1 days
# 2 2010-01-11 2010-01-08  3 days
# 3 2010-01-16        NaT     NaT
# 4 2010-01-21 2010-01-14  7 days
# 5 2010-01-26 2010-01-17  9 days
# 6 2010-01-31        NaT     NaT
# 7 2010-02-05 2010-01-23 13 days
# 8 2010-02-10 2010-01-26 15 days
# 9 2010-02-15        NaT     NaT

mask = (df['Diff'] < pd.Timedelta(days=10)) & (pd.Timedelta(days=0) < df['Diff'])
print(mask)

yields

0    False
1     True
2     True
3    False
4     True
5     True
6    False
7    False
8    False
9    False
Name: Diff, dtype: bool

pd.Timedeltas were introduced in Pandas v.0.15. Here is a workaround for older version of Pandas which uses np.timedela64s:

mask = ((df['Diff'] / np.timedelta64(10, 'D') < 10)
        & (df['Diff'] / np.timedelta64(10, 'D') > 0))
0
Chris On

This also worked but isn't as good as the answer unutbu provided.

def wdw(x):
    if pd.notnull(x):
        if x/np.timedelta64(1,'D') <= 10:
            if x/np.timedelta64(1,'D') >0:
                return 1
    else:
        return 0

df['Diff'].df(wdw)
0
Golden Lion On

use assign to create a diff date dt1 and dt2 column. Then use timedelta to get a 0 day and 10 day variable to compare diff then mask the results for output.

df = pd.DataFrame({'Dt1':pd.date_range('2010-1-1', freq='5D', periods=10),
                'Dt2':pd.date_range('2010-1-2', freq='3D', periods=10)})
df.iloc[::3, 1] = np.nan

print(df)
zero_days=timedelta(days=0)
ten_days=timedelta(days=10)
print(zero_days,ten_days)

df['Diff']=np.empty(len(df))
df=df.assign(Diff=lambda row:  row['Dt1']-row['Dt2'])
mask=(df['Diff'] >=zero_days)&(df['Diff'] <=ten_days)  
print(df[mask])

output:

  Dt1        Dt2   Diff
1 2010-01-06 2010-01-05 1 days
2 2010-01-11 2010-01-08 3 days
4 2010-01-21 2010-01-14 7 days
5 2010-01-26 2010-01-17 9 days