I have dataset for one bus line every day with 32 buses and two route_direction(0,1), in the first direction there are 18 stations each one have a seq from 1 to 18 and the other direction has 15 station with seq(1-15) and recorded time when enter/exit each station. each record contains bus_id, route_direction, station_seq, in_time, out_time, station_id. enter image description here

route_id    route_direction bus_id  station_seq schdeule_date   in_time out_time

0   59  1   1349508393  2   2021-01-01  05:04:31    05:04:58

1   59  1   1349508393  2   2021-01-01  05:04:27    05:04:58

2   59  1   1349508393  2   2021-01-01  05:04:31    05:06:31

3   59  1   1349508393  2   2021-01-01  05:04:27    05:06:31

4   59  1   1349508393  1   2021-01-01  05:00:35    05:00:56

first I have tried to groupby some column to give index to each trip with this:

grouped = df.groupby(['bus_id', 'route_direction'])

I get something like in this imageenter image description here:

index   route_id    route_direction bus_id  station_seq schdeule_date   in_time out_time

654 59  0   1349508329  1   2021-01-01  NaN 06:34:10

663 59  0   1349508329  2   2021-01-01  06:33:34    06:34:04

664 59  0   1349508329  2   2021-01-01  06:33:33    06:34:04

677 59  0   1349508329  2   2021-01-01  06:33:34    06:35:34

678 59  0   1349508329  2   2021-01-01  06:33:33    06:35:34

... ... ... ... ... ... ... ...

12133   59  0   1349508329  12  2021-01-01  NaN NaN

As you can see there is also duplicates in the same station enter exit for the same bus_id in almost the same date and time: I have tried drop duplicates but no luck to work well:

df = df.drop_duplicates(subset=['bus_id', 'route_direction', 'station_seq', 'station_id', 'in_time'], keep='first').reset_index(drop=True)

also there is some NaN values in in_time or out_time so if I dropna then I will may miss records for one of the stations along the bus line.

Any help to group each bus records in one trip to give it id and how can I drop the duplicated records in this case(small different in entering time)? Any help will be appreciated.

1

There are 1 answers

10
Ferris On
  1. sort_values with 'bus_id' and 'in_time'
  2. groupby 'bus_id', for every bus_id, calculate time-diff for every records with it's previous record
  3. if the time-diff is less than 60s, then tag with 0, else tag with 1, in order to set some groups to ignore the time-diff < 60s
  4. use cumsum on the tag, to create grouptag
  5. groupby grouptag, for every grouptag keep min(in_time) and max(out_time)
# convert the in_time to dateTime first, then sorted the values
df['in_time_t'] = pd.to_datetime(df['schdeule_date'] + ' ' + df['in_time'])
df.sort_values(['bus_id', 'in_time_t'], inplace=True)

# calculate the time difference for every bus_id
df['t_diff'] = df.groupby('bus_id')['in_time_t'].diff()

# set group_tag
cond = df['t_diff'].dt.seconds < 60
df['tag'] = np.where(cond, 0, 1).cumsum()

# for every grouptag keep min(in_time) and max(out_time)
df_result = df.groupby(['route_id', 'route_direction', 'bus_id', 'station_seq', 'schdeule_date',
       'tag']).agg({'in_time':'min', 'out_time':'max'}).reset_index()
df
        route_id    route_direction bus_id  station_seq schdeule_date   in_time out_time
    0   59  1   1349508393  2   2021-01-01  05:04:31    05:04:58
    1   59  1   1349508393  2   2021-01-01  05:04:27    05:04:58
    2   59  1   1349508393  2   2021-01-01  05:04:31    05:06:31
    3   59  1   1349508393  2   2021-01-01  05:04:27    05:06:31
    4   59  1   1349508393  1   2021-01-01  05:00:35    05:00:56
    654 59  0   1349508329  1   2021-01-01  NaN 06:34:10
    663 59  0   1349508329  2   2021-01-01  06:33:34    06:34:04
    664 59  0   1349508329  2   2021-01-01  06:33:33    06:34:04
    677 59  0   1349508329  2   2021-01-01  06:33:34    06:35:34
    678 59  0   1349508329  2   2021-01-01  06:33:33    06:35:34
    12133   59  0   1349508329  12  2021-01-01  NaN NaN

df_result
        route_id    route_direction bus_id  station_seq schdeule_date   tag in_time out_time
    0   59  0   1349508329  1   2021-01-01  2   NaN 06:34:10
    1   59  0   1349508329  2   2021-01-01  1   06:33:33    06:35:34
    2   59  0   1349508329  12  2021-01-01  3   NaN NaN
    3   59  1   1349508393  1   2021-01-01  4   05:00:35    05:00:56
    4   59  1   1349508393  2   2021-01-01  5   05:04:27    05:06:31

df with tag
|       |   route_id |   route_direction |     bus_id |   station_seq | schdeule_date   | in_time   | out_time   | in_time_t           | t_diff          |   tag |
|------:|-----------:|------------------:|-----------:|--------------:|:----------------|:----------|:-----------|:--------------------|:----------------|------:|
|   664 |         59 |                 0 | 1349508329 |             2 | 2021-01-01      | 06:33:33  | 06:34:04   | 2021-01-01 06:33:33 | NaT             |     1 |
|   678 |         59 |                 0 | 1349508329 |             2 | 2021-01-01      | 06:33:33  | 06:35:34   | 2021-01-01 06:33:33 | 0 days 00:00:00 |     1 |
|   663 |         59 |                 0 | 1349508329 |             2 | 2021-01-01      | 06:33:34  | 06:34:04   | 2021-01-01 06:33:34 | 0 days 00:00:01 |     1 |
|   677 |         59 |                 0 | 1349508329 |             2 | 2021-01-01      | 06:33:34  | 06:35:34   | 2021-01-01 06:33:34 | 0 days 00:00:00 |     1 |
|   654 |         59 |                 0 | 1349508329 |             1 | 2021-01-01      | nan       | 06:34:10   | NaT                 | NaT             |     2 |
| 12133 |         59 |                 0 | 1349508329 |            12 | 2021-01-01      | nan       | nan        | NaT                 | NaT             |     3 |
|     4 |         59 |                 1 | 1349508393 |             1 | 2021-01-01      | 05:00:35  | 05:00:56   | 2021-01-01 05:00:35 | NaT             |     4 |
|     1 |         59 |                 1 | 1349508393 |             2 | 2021-01-01      | 05:04:27  | 05:04:58   | 2021-01-01 05:04:27 | 0 days 00:03:52 |     5 |
|     3 |         59 |                 1 | 1349508393 |             2 | 2021-01-01      | 05:04:27  | 05:06:31   | 2021-01-01 05:04:27 | 0 days 00:00:00 |     5 |
|     0 |         59 |                 1 | 1349508393 |             2 | 2021-01-01      | 05:04:31  | 05:04:58   | 2021-01-01 05:04:31 | 0 days 00:00:04 |     5 |
|     2 |         59 |                 1 | 1349508393 |             2 | 2021-01-01      | 05:04:31  | 05:06:31   | 2021-01-01 05:04:31 | 0 days 00:00:00 |     5 |