Calculating the difference in dates in a Pandas GroupBy object

Question

Calculating the difference in dates in a Pandas GroupBy object

2k views Asked by invoker At 08 June 2015 at 18:07

I have a Pandas DataFrame with the following format:

In [0]: df
Out[0]: 
       col1  col2       date
 0     1     1          2015-01-01
 1     1     2          2015-01-09
 2     1     3          2015-01-10
 3     2     1          2015-02-10
 4     2     2          2015-02-10
 5     2     3          2015-02-25

In [1]: df.dtypes
Out[1]:
 col1             int64
 col2             int64
 date    datetime64[ns]
 dtype: object

We want to find the value for col2 corresponding to the greatest difference in date (between consecutive elements in the sorted-by-dates groups), grouped by col1. Assume there are no groups of size 1.

Desired Output

In [2]: output
Out[2]:
col1   col2
1      1         # This is because the difference between 2015-01-09 and 2015-01-01 is the greatest
2      2         # This is because the difference between 2015-02-25 and 2015-02-10 is the greatest

The real df has many values for col1 that we need to groupby to do calculations. Is this possible by applying a function to the following? Please note, the dates are already in ascending order.

gb = df.groupby(col1)
gb.apply(right_maximum_date_difference)

Original Q&A

There are 2 answers

JoeCondron On 08 June 2015 at 19:14

I would try a slightly differnt tack: Pivot the table so that you have a column for each value in col2 containing the dates and the values of col1 as the index. Then you can use the .diff method to get the differences between consecutive cells. This might not work if there are duplicate col1, col2 pairs though, which is not clear from the question.

df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 2],
          'col2': [1, 2, 3, 1, 2, 3],
          'date': pd.to_datetime(['2015-01-01', '2015-01-09', '2015-01-10', 
                                  '2015-02-10', '2015-02-10', '2015-02-25'])})
p = df.pivot(columns='col1', index='col2', values='date')
p
    col1    1   2
col2        
1   2015-01-01  2015-02-10
2   2015-01-09  2015-02-10
3   2015-01-10  2015-02-25

p.diff().shift(-1).idxmax() 

col1
1       1
2       2

The .shift(-1) takes care of the fact that you want the first of the two consecutive dates with the largest difference.

**Ami Tavory** · Accepted Answer · 2015-06-08T18:23:40+00:00

Here's something that's almost your dataframe (I avoided copying the dates):

df = pd.DataFrame({
    'col1': [1, 1, 1, 2, 2, 2],
    'col2': [1, 2, 3, 1, 2, 3],
    'date': [1, 9, 10, 10, 10, 25]
})

With this, define:

def max_diff_date(g):
    g = g.sort(columns=['date'])
    return g.col2.ix[(g.date.ix[1: ] - g.date.shift(1)).argmax() - 1]

and you have:

>> df.groupby(df.col1).apply(max_diff_date)
col1
1    1
2    2
dtype: int64

TechQA.

Calculating the difference in dates in a Pandas GroupBy object

There are 2 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in TIME-SERIES

Popular Questions

Trending Questions