I have dataset for one bus line every day with 32 buses and two route_direction(0,1)
, in the first direction there are 18 stations each one have a seq from 1 to 18 and the other direction has 15 station with seq(1-15) and recorded time when enter/exit each station.
each record contains bus_id, route_direction, station_seq, in_time, out_time, station_id.
enter image description here
route_id route_direction bus_id station_seq schdeule_date in_time out_time
0 59 1 1349508393 2 2021-01-01 05:04:31 05:04:58
1 59 1 1349508393 2 2021-01-01 05:04:27 05:04:58
2 59 1 1349508393 2 2021-01-01 05:04:31 05:06:31
3 59 1 1349508393 2 2021-01-01 05:04:27 05:06:31
4 59 1 1349508393 1 2021-01-01 05:00:35 05:00:56
first I have tried to groupby some column to give index to each trip with this:
grouped = df.groupby(['bus_id', 'route_direction'])
I get something like in this imageenter image description here:
index route_id route_direction bus_id station_seq schdeule_date in_time out_time
654 59 0 1349508329 1 2021-01-01 NaN 06:34:10
663 59 0 1349508329 2 2021-01-01 06:33:34 06:34:04
664 59 0 1349508329 2 2021-01-01 06:33:33 06:34:04
677 59 0 1349508329 2 2021-01-01 06:33:34 06:35:34
678 59 0 1349508329 2 2021-01-01 06:33:33 06:35:34
... ... ... ... ... ... ... ...
12133 59 0 1349508329 12 2021-01-01 NaN NaN
As you can see there is also duplicates in the same station enter exit for the same bus_id in almost the same date and time: I have tried drop duplicates but no luck to work well:
df = df.drop_duplicates(subset=['bus_id', 'route_direction', 'station_seq', 'station_id', 'in_time'], keep='first').reset_index(drop=True)
also there is some NaN values in in_time or out_time so if I dropna then I will may miss records for one of the stations along the bus line.
Any help to group each bus records in one trip to give it id and how can I drop the duplicated records in this case(small different in entering time)? Any help will be appreciated.
cumsum
on the tag, to create grouptag