I'm experimenting with the characterization of data over time. Let's say I have the following time data:
import numpy as np
import pandas as pd
import scipy.signal as signal
import matplotlib.pyplot as plt
# Function to generate triangular wave
def generate_triangular_pulse(rise_duration, fall_duration, period, samples):
control_points_x = np.array([0, rise_duration, rise_duration + fall_duration, period])
control_points_y = np.array([0, 1, 0, 0])
x = np.linspace(0, period, samples)
return np.interp(x, control_points_x, control_points_y)
# Set common parameters
samples = 288
rise_duration = 1
fall_duration = 3
period = 5
num_triangles = 5
# Create a time array with 5-minute intervals
t_num = pd.date_range(start='2024-01-01', freq='5T', periods=samples)
# Plot 1: Positive Triangle Pulse
triangular_pulse = np.zeros(samples)
for i in range(num_triangles):
start_index = i * (samples // num_triangles)
end_index = (i + 1) * (samples // num_triangles)
triangular_pulse[start_index:end_index] = generate_triangular_pulse(rise_duration, fall_duration, period, end_index - start_index)
# Convert data to a Pandas DataFrame
data = {'date': t_num, 'Positive Triangle Pulse': triangular_pulse }
df = pd.DataFrame(data)
print(df.head())
# date Positive Triangle Pulse
#0 2024-01-01 00:00:00 0.000000
#1 2024-01-01 00:05:00 0.089286
#2 2024-01-01 00:10:00 0.178571
#3 2024-01-01 00:15:00 0.267857
#4 2024-01-01 00:20:00 0.357143
#288 rows × 2 columns
I want to downsample data from min to hour losing minimum information.
resampled_df = (df.set_index('datetime') # Conform data by setting a datetime column as dataframe index needed for resample
.resample('1H') # resample with frequency of 1 hour
.mean() # used mean() to aggregate
.interpolate() # filling NaNs and missing values [just in case]
)
resampled_df.shape # (24, 1)
if we plot the output and compare them:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 4))
# (nearly-)constant
axes[0].plot( df['datetime'], df['Positive Triangle Pulse'], "b.-" , label="constant" )
axes[0].set_title(f'Positive Triangle Pulse incl. {len(df)} observations')
# Resample of (nearly-)constant
axes[1].plot( resampled_df.index, resampled_df['Positive Triangle Pulse'], "b.-" , label="constant" )
axes[1].set_title(f'Positive Triangle Pulse (resampled frequency=1H) incl. {len(resampled_df)} observations')
#plt.tight_layout()
step_size = 12
selected_ticks = df['datetime'][::step_size]
for ax in axes:
ax.set_xticks(selected_ticks)
ax.set_xticklabels(selected_ticks, rotation=90)
plt.legend(loc="best")
plt.show()

I want to find out what is the best practice to achieve this that has minimum impact on data pattern behavior over time. If you check this post which is very close to my objective.
Questions:
- what is the difference between aggregation and resampling if one translates by
resample()and another one byagg()orgroupby()methods?- which methods use (de-)selecting between records and which one represents/reflects other observations that digest?
- which of these methods has the least impact on the behavior of data over time?
- Historical Data Resample Issue
- Resample/aggregate intervals in pandas
- Pandas data frame: resample with linear interpolation
- Resampling timeseries Data Frame for different customized seasons and finding aggregates
- aggregate groups results in pandas data frame
- using resample to aggregate data with different rules for different columns in a pandas dataframe
- Need to aggregate data in pandas data frame
- pandas resample when cumulative function returns data frame
- Pandas resample data frame with fixed number of rows