What is difference between aggregation and resampling over historical data within pandas data frame?

259 views Asked by Mario At 26 January 2024 at 08:17

I'm experimenting with the characterization of data over time. Let's say I have the following time data:

import numpy as np
import pandas as pd
import scipy.signal as signal
import matplotlib.pyplot as plt

# Function to generate triangular wave
def generate_triangular_pulse(rise_duration, fall_duration, period, samples):
    control_points_x = np.array([0, rise_duration, rise_duration + fall_duration, period])
    control_points_y = np.array([0, 1, 0, 0])

    x = np.linspace(0, period, samples)
    return np.interp(x, control_points_x, control_points_y)

# Set common parameters
samples = 288
rise_duration = 1
fall_duration = 3
period = 5
num_triangles = 5

# Create a time array with 5-minute intervals
t_num = pd.date_range(start='2024-01-01', freq='5T', periods=samples)

# Plot 1: Positive Triangle Pulse
triangular_pulse = np.zeros(samples)
for i in range(num_triangles):
    start_index = i * (samples // num_triangles)
    end_index = (i + 1) * (samples // num_triangles)
    triangular_pulse[start_index:end_index] = generate_triangular_pulse(rise_duration, fall_duration, period, end_index - start_index)

# Convert data to a Pandas DataFrame
data = {'date': t_num, 'Positive Triangle Pulse': triangular_pulse }
df = pd.DataFrame(data)
print(df.head())

#               date  Positive Triangle Pulse
#0 2024-01-01 00:00:00                 0.000000
#1 2024-01-01 00:05:00                 0.089286
#2 2024-01-01 00:10:00                 0.178571
#3 2024-01-01 00:15:00                 0.267857
#4 2024-01-01 00:20:00                 0.357143
#288 rows × 2 columns

I want to downsample data from min to hour losing minimum information.

resampled_df = (df.set_index('datetime')          # Conform data by setting a datetime column as dataframe index needed for resample
                  .resample('1H')                 # resample with frequency of 1 hour
                  .mean()                         # used mean() to aggregate
                  .interpolate()                  # filling NaNs and missing values [just in case]
                )
resampled_df.shape                                # (24, 1)

if we plot the output and compare them:

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 4))

#  (nearly-)constant
axes[0].plot(   df['datetime'],  df['Positive Triangle Pulse'], "b.-" , label="constant" )
axes[0].set_title(f'Positive Triangle Pulse incl. {len(df)} observations')

# Resample of (nearly-)constant

axes[1].plot(   resampled_df.index,  resampled_df['Positive Triangle Pulse'], "b.-" , label="constant" )
axes[1].set_title(f'Positive Triangle Pulse (resampled frequency=1H) incl. {len(resampled_df)} observations')

#plt.tight_layout()
step_size = 12
selected_ticks = df['datetime'][::step_size]

for ax in axes:
    ax.set_xticks(selected_ticks)
    ax.set_xticklabels(selected_ticks, rotation=90)

plt.legend(loc="best")
plt.show()

I want to find out what is the best practice to achieve this that has minimum impact on data pattern behavior over time. If you check this post which is very close to my objective.

Questions:

what is the difference between aggregation and resampling if one translates by resample() and another one by agg() or groupby() methods?
- which methods use (de-)selecting between records and which one represents/reflects other observations that digest?
which of these methods has the least impact on the behavior of data over time?

Original Q&A

TechQA.

What is difference between aggregation and resampling over historical data within pandas data frame?

There are 0 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in TIME-SERIES

Related Questions in AGGREGATE

Related Questions in DOWNSAMPLING

Popular Questions

Trending Questions