I have large csv files of traffic data similar to the sample below, for which I need to calculate the total bytes and the duration of each data transfer. The time ranges are overlapping, but they must be merged:
first_packet_ts last_packet_ts bytes_uplink bytes_downlink service user_id
1441901695012 1441901696009 165 1212 facebook 3
1441901695500 1441901696212 23 4321 facebook 3
1441901698000 1441901698010 242 3423 youtube 4
1441901698400 1441901698500 423 2344 youtube 4
Desired output:
duration bytes_uplink bytes_downlink service user_id
1200 188 5533 facebook 3
110 665 5767 youtube 4
I currently use something like the following lines:
df = pd.read_csv(input_file_path)
df = df.groupby(['service', 'user_id'])
durations = df.apply(calculate_duration)
df = df[['bytes_uplink', 'bytes_downlink']].sum()
df = df.reset_index()
The calculate_duration function (below) iterates the contents of each group, merges the overlapping time intervals and then returns a dataframe which is then concatenated to the summed dataframe df.
def calculate_duration(group):
ranges = group[['first_packet_ts', 'last_packet_ts']].itertuples()
duration = 0
for i,current_start, current_stop in ranges:
for i, start, stop in ranges:
if start > current_stop:
duration += current_stop - current_start
current_start, current_stop = start, stop
else:
current_stop = max(current_stop, stop)
duration += current_stop - current_start
return duration
This approach is very slow as it involves iteration and invoking the apply method for each group.
Is there a more efficient way to calculate the duration of the data transfer, merging the overlapping intervals, using pandas (avoid iteration somehow?) preferably without resorting to cython?
How about this? (having timed it, might bit slower...)
Edit: I don't think this is any more valid than yours but you could try something along these lines: