Pandas: add values to two new columns in table if values in two previous columns are continuous?

127 views Asked by At

So, I was trying to accomplish this in SQL but was advised there would be a simple way to do this in Pandas... I would appreciate your help/hints!

I currently have the table on the left with two columns (begin subsession and end subsession), and I would like to add the two left columns "session start" and "session end". I know how to simply add the columns, but I can't figure out the query that would allow me to identify the continuous values in the two original columns (ie the end sub-session value is the same as the next rows begin sub-session value) and then add the first begin session value, and last end session value (for continuous rows) to the respective rows in my new columns. Please refer to the image.. for example, for the first three rows the "end subsession" value is the same as the next rows "begin subsession" values, so the first three "session start" and "session end" would be the same, with the minimum of the "begin subsession" values and the maximum "end sub session" value.

I was trying something along these lines in SQL, obviously didn't work, and I realized the aggregate function doesn't work in this case...

SELECT
FROM viewershipContinuous =
    CASE 
        WHEN endSubsession.ROWID = beginSubession.ROWID+1
        THEN MIN(beginSubsession)
        ELSE beginSubsession.ROWID+1
        END;

The table on the left is what I have, the table on the right is what I want to achieve

enter image description here

1

There are 1 answers

0
jezrael On BEST ANSWER

You can first compare next value by shifted column esub with column bsub if not equal (!=) and then create groups by cumsum:

s = df['bsub'].ne(df['esub'].shift()).cumsum()
print (s)
0    1
1    1
2    1
3    2
4    2
5    2
6    2
7    3
8    3
dtype: int32

Then groupby by Series s with transform min and max:

g = df.groupby(s)
df['session start'] = g['bsub'].transform('min')
df['session end'] = g['esub'].transform('max')
print (df)
   bsub  esub  session start  session end
0  1700  1705           1700         1800
1  1705  1730           1700         1800
2  1730  1800           1700         1800
3  1900  1920           1900         1965
4  1920  1950           1900         1965
5  1950  1960           1900         1965
6  1960  1965           1900         1965
7  2000  2001           2000         2002
8  2001  2002           2000         2002