Pandas resample and aggregate with condition

586 views Asked by At

I have a DataFrame which have a column with statuses like this:

datetime               |    session    |    try       |    status
2020-09-17 10:00:01    |    '1a'       |    '1a_1'    |    'success'
2020-09-17 10:00:02    |    '2a'       |    '2a_1'    |    'fail'
2020-09-17 10:00:03    |    '2a'       |    '2a_2'    |    'success'
2020-09-17 10:00:03    |    '3a'       |    '3a_1'    |    'interrupted'
2020-09-18 13:00:04    |    '4a'       |    '4a_1'    |    'fail'

I want to resample data by day with counting status types by condition in sessions (not tries).

I can resample it by tries easyly like this:

df['date'] = df['datetime'].dt.date
df['ones'] = np.ones(df.shape[0])
piv = df.pivot_table(index='date', columns='status', values='ones', aggfunc=len).fillna(0)

And have:

day           |    success    |    fail    |    interrupted
2020-09-17    |    2          |    2       |    1
2020-09-18    |    0          |    1       |    0

But I want to aggregate it by session with condition no matter how much tries in session.:

  • if 'success' status in session tries, then success +1, fail +0, interrupted +0;
  • if 'interrupted' status in session AND NO 'success' status in session, then success +0, fail +0, interrupted +1;
  • if NO 'interrupted' AND 'success' statuses in session, then then success +0, fail +1, interrupted +0.

So I shoul get something like this:

day           |    success    |    fail    |    interrupted
2020-09-17    |    2          |    0       |    1
2020-09-18    |    0          |    1       |    0

I stuck with function and all I come up with ends with "ValueError: The truth value of a Series is ambiguous". I will be very greatfull for any ideas.

2

There are 2 answers

2
jezrael On BEST ANSWER

My idea is convert values of statust to ordered categories, sorting and get row only with most important value which is defined in list passed to parameter categories:

print (df)
             datetime session   try       status
0 2020-09-17 10:00:01      1a  1a_1      success
1 2020-09-17 10:00:02      2a  2a_1         fail
2 2020-09-17 10:00:03      2a  2a_2      success
3 2020-09-17 10:00:03      3a  3a_1  interrupted
4 2020-09-18 13:00:04      4a  4a_1         fail
5 2020-09-19 10:00:01      1a  1a_1  interrupted
6 2020-09-19 10:00:02      1a  2a_1         fail
7 2020-09-19 10:00:03      2a  2a_2      success
8 2020-09-19 10:00:03      2a  3a_1  interrupted

df['status'] = pd.Categorical(df['status'], 
                              ordered=True, 
                              categories=['success','interrupted','fail'])
df['date'] = df['datetime'].dt.date

df1 = df.sort_values(['date','status']).drop_duplicates(['date','session'])
print (df1)
             datetime session   try       status        date
0 2020-09-17 10:00:01      1a  1a_1      success  2020-09-17
2 2020-09-17 10:00:03      2a  2a_2      success  2020-09-17
3 2020-09-17 10:00:03      3a  3a_1  interrupted  2020-09-17
4 2020-09-18 13:00:04      4a  4a_1         fail  2020-09-18
7 2020-09-19 10:00:03      2a  2a_2      success  2020-09-19
5 2020-09-19 10:00:01      1a  1a_1  interrupted  2020-09-19

piv = pd.crosstab(df1['date'], df1['status'])
print (piv)
status      success  interrupted  fail
date                                  
2020-09-17        2            1     0
2020-09-18        0            0     1
2020-09-19        1            1     0
0
Paddy Alton On

pandas has quite a lot of different ways to aggregate information. One option here would be to use a groupby. I like this approach as it's quite robust - another answer outlines a very elegant approach using crosstab but I wonder how extensible that would be.

NB your description of what you want doesn't seem to match the example output - you mentioned that you wanted to aggregate by session, rather than date. It is easy to do either of these with this approach.

def count_successes(x):

    s = list(x)

    return len([i for i in s if i == "success"])

def count_interrupteds(x):

    s = list(x)

    if "success" in s:
        return 0

    return len([i for i in s if i == "interrupted"])

def count_failures(x):

    s = list(x)

    if "success" in s:
        return 0
    
    if "interrupted" in s:
        return 0

    return len([i for i in s if i == "fail"])
    
    

df["date"] = df.datetime.dt.date

result = (
  df.groupby("date", as_index=False)
    .status
    .agg({ 
      "success": count_successes,
      "fail": count_failures,
      "interrupted": count_interrupteds,
    })
)

This produces your example output. You can see how the functions I've defined to generate each column are quite simple but could become arbitrarily complex.

To group by session each day instead, use groupby(["date", "session"] in place of groupby("date".