Calculating the difference in dates in a Pandas GroupBy object

1.9k views Asked by At

I have a Pandas DataFrame with the following format:

In [0]: df
Out[0]: 
       col1  col2       date
 0     1     1          2015-01-01
 1     1     2          2015-01-09
 2     1     3          2015-01-10
 3     2     1          2015-02-10
 4     2     2          2015-02-10
 5     2     3          2015-02-25

In [1]: df.dtypes
Out[1]:
 col1             int64
 col2             int64
 date    datetime64[ns]
 dtype: object

We want to find the value for col2 corresponding to the greatest difference in date (between consecutive elements in the sorted-by-dates groups), grouped by col1. Assume there are no groups of size 1.

Desired Output

In [2]: output
Out[2]:
col1   col2
1      1         # This is because the difference between 2015-01-09 and 2015-01-01 is the greatest
2      2         # This is because the difference between 2015-02-25 and 2015-02-10 is the greatest

The real df has many values for col1 that we need to groupby to do calculations. Is this possible by applying a function to the following? Please note, the dates are already in ascending order.

gb = df.groupby(col1)
gb.apply(right_maximum_date_difference)
2

There are 2 answers

5
Ami Tavory On BEST ANSWER

Here's something that's almost your dataframe (I avoided copying the dates):

df = pd.DataFrame({
    'col1': [1, 1, 1, 2, 2, 2],
    'col2': [1, 2, 3, 1, 2, 3],
    'date': [1, 9, 10, 10, 10, 25]
})

With this, define:

def max_diff_date(g):
    g = g.sort(columns=['date'])
    return g.col2.ix[(g.date.ix[1: ] - g.date.shift(1)).argmax() - 1]

and you have:

>> df.groupby(df.col1).apply(max_diff_date)
col1
1    1
2    2
dtype: int64
0
JoeCondron On

I would try a slightly differnt tack: Pivot the table so that you have a column for each value in col2 containing the dates and the values of col1 as the index. Then you can use the .diff method to get the differences between consecutive cells. This might not work if there are duplicate col1, col2 pairs though, which is not clear from the question.

df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 2],
          'col2': [1, 2, 3, 1, 2, 3],
          'date': pd.to_datetime(['2015-01-01', '2015-01-09', '2015-01-10', 
                                  '2015-02-10', '2015-02-10', '2015-02-25'])})
p = df.pivot(columns='col1', index='col2', values='date')
p
    col1    1   2
col2        
1   2015-01-01  2015-02-10
2   2015-01-09  2015-02-10
3   2015-01-10  2015-02-25

p.diff().shift(-1).idxmax() 

col1
1       1
2       2

The .shift(-1) takes care of the fact that you want the first of the two consecutive dates with the largest difference.