I would like to calculate timedelta between groups in my data.

My dataframe has two grouping variable columns and then a timecode column.

id      g1      timecode                g2          lot         
...
6753    AX8     2018-12-11 21:06:00     A911928     1153
6754    AX8     2018-12-12 06:55:00     A912152     1154
6755    AX8     2018-12-12 16:35:00     A912152     1154
6756    AX8     2018-12-13 07:10:00     A912159     1155
6757    AX8     2018-12-13 12:28:00     A911404     1156
6758    AX8     2018-12-14 06:11:00     A911927     1157
6759    AX8     2018-12-15 20:00:00     A911928     1158
6760    AX8     2018-12-16 09:59:00     A911928     1158
6761    AX8     2018-12-17 01:20:00     A911914     1159
6762    AX8     2018-12-18 20:00:00     A911928     1160
6763    AX8     2018-12-19 09:59:00     A911928     1160
...

The data is sorted in order by g1 and then within that group by timecode.

If the g1 and g2 in successive rows is the same, I have generated a same lot number for those. This is just for information, probably is not needed for this.

Now what I would like to have is a the timedelta for all g2 groups. So basically for the g2 item A911928 for example:

...
6753    AX8     2018-12-11 21:06:00     A911928     1153
...

6759    AX8     2018-12-15 20:00:00     A911928     1158
6760    AX8     2018-12-16 09:59:00     A911928     1158
...
6762    AX8     2018-12-18 20:00:00     A911928     1160
6763    AX8     2018-12-19 09:59:00     A911928     1160
...

I would like to generate new column to the dataframe (df["diff"]) that every group would have the timedelta interval to the next group, so that the last row in that group would get the value. And otherwise it would just input np.nan. In terms of the above example:

  • row 6753 would get (2018-12-15 20:00:00)-(2018-12-11 21:06:00)
  • row 6759 would get np.nan
  • row 6760 would get (2018-12-18 20:00:00)-(2018-12-16 09:59:00)
  • row 6762 would get np.nan
  • row 6763 would get the timedelta to next (if any, otherwise np.nan)

I have tried different groupbys and diff() functions, but I just don't seem to be able to quite get to my goal.

Most promising try was probably below code, but I just quite cannot cross the bridge from that to calculating the deltas and keeping the right order to then return the diffs to the original dataframe.

df.groupby(["g1", "g2", "lot"]).agg({"timecode": ["min", "max"]})

                                    timecode
g1  g2      lot             min                     max
...
AX8     A911404     1156.0  2018-12-13 12:28:00     2018-12-13 12:28:00
        A911927     1157.0  2018-12-14 06:11:00     2018-12-14 06:11:00
        ...
        A911928     1153.0  2018-04-25 05:57:00     2018-04-25 05:57:00
                    1158.0  2018-12-15 20:00:00     2018-12-16 09:59:00
                    1160.0  2018-12-18 20:00:00     2018-12-19 09:59:00
...

I have also played around with iterrows() solution, but at least my code seems to be quite slow with the data.

0 Answers