Calculate the duration of overlapping time ranges using pandas

Question

Calculate the duration of overlapping time ranges using pandas

2.6k views Asked by George Vakras At 11 September 2015 at 09:26

I have large csv files of traffic data similar to the sample below, for which I need to calculate the total bytes and the duration of each data transfer. The time ranges are overlapping, but they must be merged:

first_packet_ts last_packet_ts  bytes_uplink bytes_downlink service    user_id
1441901695012   1441901696009       165             1212    facebook    3
1441901695500   1441901696212        23             4321    facebook    3
1441901698000   1441901698010       242             3423    youtube     4
1441901698400   1441901698500       423             2344    youtube     4

Desired output:

 duration     bytes_uplink      bytes_downlink    service          user_id
   1200             188             5533          facebook            3
   110              665             5767          youtube             4

I currently use something like the following lines:

df = pd.read_csv(input_file_path)
df = df.groupby(['service', 'user_id'])
durations = df.apply(calculate_duration) 
df = df[['bytes_uplink', 'bytes_downlink']].sum()
df = df.reset_index()

The calculate_duration function (below) iterates the contents of each group, merges the overlapping time intervals and then returns a dataframe which is then concatenated to the summed dataframe df.

def calculate_duration(group):
    ranges = group[['first_packet_ts', 'last_packet_ts']].itertuples()
    duration = 0
    for i,current_start, current_stop in ranges:
        for i, start, stop in ranges:
            if start > current_stop:
                duration += current_stop - current_start
                current_start, current_stop = start, stop
            else:
                current_stop = max(current_stop, stop)
        duration += current_stop - current_start
    return duration

This approach is very slow as it involves iteration and invoking the apply method for each group.

Is there a more efficient way to calculate the duration of the data transfer, merging the overlapping intervals, using pandas (avoid iteration somehow?) preferably without resorting to cython?

Original Q&A

There are 2 answers

Alexander On 12 September 2015 at 18:13

The code below reproduces your output given the sample data. Is that what you're looking for?

>>> df.groupby(['service', 'user_id'])['bytes_uplink', 'bytes_downlink'].sum().reset_index()
    service  user_id  bytes_uplink  bytes_downlink
0  facebook        3           188            5533
1   youtube        4           665            5767

**matt_s** · Accepted Answer · 2015-09-11T14:51:14+00:00

How about this? (having timed it, might bit slower...)

pd.pivot_table(df, columns='user_id', index='service',
               values=['bytes_uplink', 'bytes_downlink'], aggfunc=sum)

Edit: I don't think this is any more valid than yours but you could try something along these lines:

# create dummy start/end dataframe
df = pd.DataFrame({'end':pd.Series([50, 100, 120, 150]), 'start':pd.Series([30, 0, 40, 130])})
df = df[['start', 'end']]
df = df.sort('start')

df['roll_end'] = df.end.cummax()
df.roll_end = df.roll_end.shift()

df['new_start'] = df.start
overlap = df.start - df.roll_end < 0
# if start is before rolling max end time then reset start to rolling max end time
df.new_start[overlap] = df.roll_end[overlap]

# if the new start is after end, then completely overlapping
print np.sum([x for x in df.end - df.new_start if x > 0])

TechQA.

Calculate the duration of overlapping time ranges using pandas

There are 2 answers

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in PANDAS

Related Questions in INTERVAL-ARITHMETIC

Popular Questions

Popular Tags

Trending Questions