Python - Time-weighted average Pandas, grouped by time interval

7.7k views Asked by At

I have a time-series in a Pandas DataFrame. The timestamps can be uneven (one every 1-5 minutes), but there will always be one every 5 minutes (timestamps with minutes ending in 0,5,10,15,20,25,30,35,40,45,50,55).

Example:

2017-01-01 2:05:00   32.90
2017-01-01 2:07:30   29.83
2017-01-01 2:10:00   45.76
2017-01-01 2:15:00   16.22
2017-01-01 2:20:00   17.33
2017-01-01 2:25:00   23.40
2017-01-01 2:28:45   150.12
2017-01-01 2:30:00   100.29
2017-01-01 2:35:00   38.45
2017-01-01 2:40:00   67.12
2017-01-01 2:45:00   20.00
2017-01-01 2:50:00   58.41
2017-01-01 2:55:00   58.32
2017-01-01 3:00:00   59.89

I want to get the time-weighted averages of blocks of 15 minutes. The rows with a time stamp that is directly on a 15 minute mark (timestamps with minutes ending in 0,15,30,45) end an interval, so the grouping is as follows:

Group 1 (interval 2017-01-01 2:00:00):
    2017-01-01 2:05:00   32.90
    2017-01-01 2:07:30   29.83
    2017-01-01 2:10:00   45.76
    2017-01-01 2:15:00   16.22

Group 2 (interval 2017-01-01 2:15:00):
    2017-01-01 2:20:00   17.33
    2017-01-01 2:25:00   23.40
    2017-01-01 2:28:45   150.12
    2017-01-01 2:30:00   100.29

Group 3 (interval 2017-01-01 2:30:00):
    2017-01-01 2:35:00   38.45
    2017-01-01 2:40:00   67.12
    2017-01-01 2:45:00   20.00

Group 4 (interval 2017-01-01 2:45:00):
    2017-01-01 2:50:00   58.41
    2017-01-01 2:55:00   58.32
    2017-01-01 3:00:00   59.89

The average MUST be time-weighted, so not just a standard average of all values in a group.

For instance, the time-weighted average of group 2 is NOT 72.785, which is the regular average of all 4 values. Rather, it should be:

 (5 minutes / 15 minutes) * 17.33 = 5.776667     ==> The 5 minutes is taken from the difference between this timestamp and the previous timestamp
+(5 minutes / 15 minutes) * 23.40 = 7.8
+(3.75 minutes / 15 minutes) * 150.12 = 37.53
+(1.25 minutes / 15 minutes) * 100.29 = 8.3575

= **59.46417**

Also ideally, the 15 minutes is parameterized, as this might change in the future to 60 minutes (hourly), but I don't think this is an issue here.

Also, performance is quite important in this. As my dataset will have around 10k rows, so iterating over each record one by one will be pretty slow.

I tried looking into the df.rolling() function of Pandas, but couldn't figure out how to apply it directly to my specific scenario.

Thanks a lot for the help!

UPDATE 1:

Following Simon's brilliant solution, I modified it a little bit.

I made a few tweaks to it to adapt it to my specific case:

def func(df):
    if df.size == 0: return
    timestep = 15*60
    indexes = df.index - (df.index[-1] - pd.Timedelta(seconds=timestep))
    seconds = indexes.seconds
    weight = [seconds[n]/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
          for n, k in enumerate(seconds)]
    return np.sum(weight*df.values)

This is to cope with possibly empty 15 minute intervals (missing rows in the DB)

5

There are 5 answers

3
Simon On BEST ANSWER

This one was tricky. I would love to see another commenter do this more efficiently, since I have a hunch there is a better way to do this.

I also skipped one part, which is parameterizing the 15 minute value, but I point out how you might do it in the comment. This is left as an exercise for the reader :D It should be parameterized though, as it stands now there are a lot of random '*15' and '*60' values scattered around the place, which looks clumsy.

I'm also tired, and my wife wants to watch a movie, so I didn't clean up my code. It's a little messy, and should be written cleaner -- which may or may not be worth doing, depending on if someone else can redo this all in 6 lines of code. If tomorrow morning it is still unanswered, I'll go back through and do this better.

Updated better solution 1

def func(df):
    timestep = 15*60
    seconds = (df.index.minute*60+df.index.second)-timestep
    weight = [k/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
              for n, k in enumerate(seconds)]
    return np.sum(weight*df.values)

df.resample('15min', closed='right').apply(func)
1
hugovdberg On

Another option is to multiply the values by the fractional time between the ticks and then add the results. The following function takes the series or dataframe with values and the requested index.:

import numpy as np
import pandas as pd


def resample_time_weighted_mean(x, target_index, closed=None, label=None):
    shift = 1 if closed == "right" else -1
    fill = "bfill" if closed == "right" else "ffill"
    # Determine length of each interval (daylight saving aware)
    extended_index = target_index.union(
        [target_index[0] - target_index.freq, target_index[-1] + target_index.freq]
    )
    interval_lengths = -extended_index.to_series().diff(periods=shift)

    # Create a combined index of the source index and target index and reindex to combined index
    combined_index = x.index.union(extended_index)
    x = x.reindex(index=combined_index, method=fill)
    interval_lengths = interval_lengths.reindex(index=combined_index, method=fill)

    # Determine weights of each value and multiply source values
    weights = -x.index.to_series().diff(periods=shift) / interval_lengths
    x = x.mul(weights, axis=0)

    # Resample to new index, the final reindex is necessary because resample 
    # might return more rows based on the frequency
    return (
        x.resample(target_index.freq, closed=closed, label=label)
        .sum()
        .reindex(target_index)
    )

Applying this to the sample data:

x = pd.Series(
    [
        32.9,
        29.83,
        45.76,
        16.22,
        17.33,
        23.4,
        150.12,
        100.29,
        38.45,
        67.12,
        20.0,
        58.41,
        58.32,
        59.89,
    ],
    index=pd.to_datetime(
        [
            "2017-01-01 2:05:00",
            "2017-01-01 2:07:30",
            "2017-01-01 2:10:00",
            "2017-01-01 2:15:00",
            "2017-01-01 2:20:00",
            "2017-01-01 2:25:00",
            "2017-01-01 2:28:45",
            "2017-01-01 2:30:00",
            "2017-01-01 2:35:00",
            "2017-01-01 2:40:00",
            "2017-01-01 2:45:00",
            "2017-01-01 2:50:00",
            "2017-01-01 2:55:00",
            "2017-01-01 3:00:00",
        ]
    ),
)

opts = dict(closed="right", label="right")
resample_time_weighted_mean(
    x, pd.DatetimeIndex(x.resample("15T", **opts).groups.keys(), freq="infer"), **opts
)

Which returns:

2017-01-01 02:15:00    18.005000
2017-01-01 02:30:00    59.464167
2017-01-01 02:45:00    41.856667
2017-01-01 03:00:00    58.873333
Freq: 15T, dtype: float64

Regarding the performance issues mentioned under simon's anwer, this method performs well on millions of rows as well as the weights are calculated at once instead of in a relatively slow python loop:

new_index = pd.date_range("2017-01-01", "2021-01-01", freq="1T")
new_index = new_index + pd.TimedeltaIndex(
    np.random.rand(*new_index.shape) * 60 - 30, "s"
)
values = pd.Series(np.random.rand(*new_index.shape), index=new_index)
print(values.shape)
(2103841,)


%%timeit
resample_time_weighted_mean(
    values, pd.date_range("2017-01-01", "2021-01-01", freq="15T"), closed="right"
)
4.93 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
0
Monirrad On

Let the label of the first column be ts and the next coulmn's label value

def tws(df, lenght):
df['ts'] = pd.to_datetime(df['ts'])
interval =[0]
df1 = df
for i in range(1,len(df1)):
    interval.append(((df1.loc[i, 'ts']-df1.loc[i-1, 'ts']).days * 24 * 60 +(df1.loc[i, 'ts']-df1.loc[i-1, 'ts']).seconds)/60)
df1['time_interval']= interval

start = pd.to_datetime('2017-01-01 2:00:00')
TWS = []
ave = 0
for i in range(1, len(df1)+1):
    try:
        if  df1.loc[i, 'ts']<= (start+timedelta(minutes = lenght)):
            ave = ave+df1.loc[i, 'value']*df1.loc[i,'time_interval']


        else:
            TWS.append(ave/lenght)
            ave = df1.loc[i, 'value']*df1.loc[i,'time_interval']
            start = df1.loc[i-1,'ts']
    except  :
        TWS.append(ave/lenght)

return TWS

tws(df,15)

The output is the list of the weighted time average of each interval

0
cheesus On

Tried all of these answers and I they all ended up breaking in certain situations.

This worked for me:

import pandas as pd

def resample_timeweighted(ts, freq="15T"):
    new_index = pd.date_range(start=ts.index[0], end=ts.index[-1], freq=freq)
    last_value = 0
    timeweighted_ts = pd.Series([])
    for t0, t1 in zip(new_index, new_index[1:]):
        s = ts.truncate(before=t0, after=t1)
        s = pd.concat([pd.Series(index=[t0], data=[last_value]), s])
        s = pd.concat([s, pd.Series(index=[t1], data=[s[-1]])])
        last_value = s[-1]
        delta_time = s.index[1:] - s.index[:-1]
        tw_mean = s[:-1].mul(delta_time.total_seconds()).sum() / (t1-t0).total_seconds()
        timeweighted_ts = pd.concat([timeweighted_ts, pd.Series(index=[t0], data=[tw_mean])])
    return timeweighted_ts

Given your input, it returns

2017-01-01 02:05:00    31.115000
2017-01-01 02:20:00    57.566667
2017-01-01 02:35:00    41.856667
dtype: float64
0
Draco On

The accepted solution provided by simon assumes the timestamps are time ending and thus would not take into account the value(s) prior to the opening of each 15-minute window. Not a method of data sampling typically done for raw datasets. The correct solution for time beginning timestamps requires the insertion of samples at each 15th minute in the time series data and the carrying forward of the last sample from the prior 15-minute window and the time-weighted calculation requires a little more data sample manipulation.

For time beginning, the value 32.90 present at minute 2017-01-01 2:05:00 was not present for 5/15 of the 15-minute window as it was sampled at that minute for the period from 2017-01-01 2:05:00 to 2017-01-01 2:07:30 and thus has a weight of 2.5/15. We do not know what value was present for the first 5 minutes of the 15-minute window. The accepted solution also fails to calculate a time-weighted average for the periods where no value was recorded when in actuality the last value prior to the 15-minut window with no samples would be the time-weighted average for each of the following windows until a new value was posted to the dataset.

The correct solution requires forcing samples at each 15th minute in the time series that carries forward the last value from the prior 15-minute window.

series = concat(
    [
        series,
        Series(index=date_range(
            Timestamp('2017-01-01 2:15:00'), 
            Timestamp('2017-01-01 3:00:00'), 
            freq='15T'
        ), dtype=np.float64)
    ]
).sort_index()
series = series[~series.index.duplicated(keep='first')].ffill()

The provided data already has samples at each 15th minute but this should be done as to guarantee data integrity when resampling. The time-weighted average function for samples with time-beginning timestamps is as follow:

def fifteen_minute_weighted_average(series):
    seconds = np.append(
        np.mod(temp.index.minute.values, np.array([15] * len(temp.index))) * 60 + temp.index.second.values,
        15 * 60
    )
    weights = np.diff(seconds) / (15 * 60)
    values = temp.values
    return np.sum(weights * values)

This solution puts the time delta weight with the correct sample for time beginning samples. Then when you resample the data, make sure to use closed='left' instead of closed='right' in order to capture the starting sample in each minute as the ending sample is useless in the window because it did not occur in the window. Groupings provided by Vincent are not how the time-weighted averages require the data to be grouped for time-beginning timestamps.

Resampling to 15-minutes is as follows:

weighted_averages = series.loc[
    Timestamp('2017-01-01 2:15:00'):Timestamp('2017-01-01 3:00:00')
].resample('15T', closed='left').apply(fifteen_minute_weighted_average)