How can I work better with dates in Python to remove NaNs and identify workdays and holidays between two intervals?

93 views Asked by At

I have a dataframe with two date fields as shown below. I want to be able to use this data to calculate 'adjusted pay' for an employee - if the employee joined after the 15th of a month, they are paid for 15 days of March + April on the 10th of the month (payday), and equally if they leave in April, the calculation should only consider the days worked in April.

Hire_Date  | Leaving_Date
_________________________
01/02/2007 | NaN
02/03/2007 | NaN
23/03/2020 | Nan
01/01/1999 | 04/04/2020

Oh and the above data didn't pull through in datetime format, and there are of course plenty of NaNs in the leaving_date field :)

Therefore, I did the following:

Converted the data to datetime format, retained the date, and filled N/As with a random date (not too happy about this, but this is only missing in a few records so not worried about the impact).

df['Hire_Date'] = pd.to_datetime(df['Hire_Date'])
df['Hire_Date'] = [a.date() for a in df['Hire_Date']]
df['Hire_Date'] = df['Hire_Date'].fillna('1800-01-01')

Repeated for Leaving date. Only difference here is that I've filled the NaNs with 0, given that we don't have that many leavers.

    df['Leaving_Date'] = pd.to_datetime(df['Leaving_Date'])
    df['Leaving_Date'] = [a.date() for a in df['Leaving_Date']]
    df['Leaving_Date'] = df['Leaving_Date'].fillna('0')

I then ended up creating a fresh column to capture workdays, and here's where I run into the issue. My code is given below.

I identified the first day of the hire month, and attempted to work out the number of days worked in March, using a np.where() function.

df['z_First_Day_H_Month'] = df['Hire_Date'].values.astype('datetime64[M]')
df['March_Workdays'] = np.where((df['z_First_Day_H_Month'] >= '2020-03-01'), 
(np.busday_count(df['z_First_Day_H_Month'], '2020-03-31')), 'N/A') 

Similar process repeated, albeit a simpler calculation to work out the number of days worked in the termination month.

df['z_First_Day_T_Month'] = df.apply(lambda x: '2020-04-01').astype('datetime64[M]')
df['T_Mth_Workdays'] = df.apply(lambda x: np.busday_count(x['z_First_Day_T_Month'], 
x['Leaving_Date'])

However, the above process returns the following error:

iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] according to rule 'safe' ')

Please can I get some help to fix this issue? Thanks!

1

There are 1 answers

2
Jae On BEST ANSWER

I did a bit of research and seems like that the datetime format might be a problem. The [ns] format has precision of nanoseconds and np.busday_count asks for date format, which is [D], causing error. Take a look at this numpy document and check Datetime Units Section.

Numpy, TypeError: Could not be cast from dtype('<M8[us]') to dtype('<M8[D]')

Take a look at this post. It is exact same problem as yours!