find user sessions by grouping nearby timestamps

536 views Asked by At

I would like to GROUP BY some rows in SQL based on "time of inactivity", that means notice large gaps between timestamps in order to recognize user sessions.

Assume this actions table in a sqlite database:

_actions_
| id | userID | timestamp | actionType |

Now I can view all user actions and its types grouped by hour, month, year etc.

SELECT 
    userID, strftime('%H', timestamp), group_concat(actionType)
FROM 
    actions 
GROUP BY userID, strftime('%H', timestamp);

However, that is not completely what I want. Consider this:

| 1 | 333 | 13.11.2014-20:59 | action6 |
| 2 | 333 | 13.11.2014-21:01 | action3 |
| 3 | 333 | 13.11.2014-21:47 | action5 |

My example merges into row 1 and row 2+3, however if we assume a session timeout of 30 minutes we should get row 1+2 and row 3. Any sql-ideas?

Sure one could load the database and in R, Pandas etc. and check for each user, if current_timestamp - last_timestamp > 30min, but is this really the way to go?

1

There are 1 answers

0
CL. On

To compute where a row is the first row in a session, check whether any other row exists in the previous 30 minutes:

SELECT *,
       NOT EXISTS (SELECT 1
                   FROM actions AS prev
                   WHERE userid = actions.userid
                     AND timestamp < actions.timestamp
                     AND timestamp >= datetime(actions.timestamp, '-30 minutes')
                  ) AS first
FROM actions;

id          userid      timestamp         actiontype  first     
----------  ----------  ----------------  ----------  ----------
1           333         2014-11-13 20:59  action6     1         
2           333         2014-11-13 21:01  action3     0         
3           333         2014-11-13 21:47  action5     1         

But his does not help directly with grouping; we need some value that is unique for all rows in a group.

Let's take the ID of the first row in the group. To find this first row of the group of the current row, we take the last row that is the first row of some group and that is not after the current row:

SELECT *,
       (SELECT id
        FROM actions AS first
        WHERE userid = actions.userid
          AND timestamp <= actions.timestamp
          AND NOT EXISTS (SELECT 1
                          FROM actions AS prev
                          WHERE userid = first.userid
                            AND timestamp < first.timestamp
                            AND timestamp >= datetime(first.timestamp, '-30 minutes')
                         )
        ORDER BY timestamp DESC
        LIMIT 1
       ) AS sessionid
FROM actions;

id          userid      timestamp         actiontype  sessionid 
----------  ----------  ----------------  ----------  ----------
1           333         2014-11-13 20:59  action6     1         
2           333         2014-11-13 21:01  action3     1         
3           333         2014-11-13 21:47  action5     3         

To make this query efficient, the timestamp column must be indexed.