I figured that statsmodel package's cross-correlation function sm.tsa.stattools.ccf (am testing two breathing signals recorded with different devices) only tests for forward lags and not for backward lags once it gave the results for one participant like below (says the best cross-correlation is achieved with zero lags but signal2 should clearly be shifted backwards):
Then, I added a backwards part to my cross-correlation calculation by just switching the places of input arguments: so the backwards one is calculated by first putting signal2 and then signal1 (See the code):
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Initialize a list to store cross-correlation values for each participant
cross_correlation_values = []
participants = [39161, 25919]
# Loop through each participant
for participant in participants:
print(f"Processing data for participant {participant}...")
# Access data for the current participant (replace this with your data access method)
df_rsp_imp5fs = results[participant]['df_rsp']
df_raw_rsp_imp5fs = results[participant]['df_raw_rsp']
df_rsp_hexoskinthorax = results_hexoskin[participant]['df_rsp_thorax']
# Convert 'exact_time' column to datetime index for df_rsp_imp5fs
df_rsp_imp5fscopy = df_rsp_imp5fs.copy()
df_rsp_imp5fscopy['exact_time'] = pd.to_datetime(df_rsp_imp5fscopy['exact_time'])
df_rsp_imp5fscopy.set_index('exact_time', inplace=True)
# Data preprocessing for df_rsp_imp5fs
# No need to check 'exact_time' column, it's already the index
data_type = df_rsp_imp5fscopy.index.dtype
print(f'Data type of index for participant {participant}: {data_type}')
# Upsample the DataFrame to 1000 Hz and interpolate data gaps
desired_frequency = 1000
upsampled_df_rsp_imp5fscopy = df_rsp_imp5fscopy.resample(f'{1000//desired_frequency}ms').asfreq()
upsampled_df_rsp_imp5fscopy = upsampled_df_rsp_imp5fscopy.interpolate(method='linear')
# Convert 'exact_time' column to datetime index for df_rsp_hexoskinthorax
df_rsp_hexoskinthoraxcopy = df_rsp_hexoskinthorax.copy()
df_rsp_hexoskinthoraxcopy['exact_time'] = pd.to_datetime(df_rsp_hexoskinthoraxcopy['exact_time'])
df_rsp_hexoskinthoraxcopy.set_index('exact_time', inplace=True)
# Data preprocessing for df_rsp_hexoskinthorax
# No need to check 'exact_time' column, it's already the index
data_type = df_rsp_hexoskinthoraxcopy.index.dtype
print(f'Data type of index for participant {participant}: {data_type}')
# Upsample the DataFrame to 1000 Hz and interpolate data gaps
desired_frequency = 1000
upsampled_df_rsp_hexoskinthoraxcopy = df_rsp_hexoskinthoraxcopy.resample(f'{1000//desired_frequency}ms').asfreq()
upsampled_df_rsp_hexoskinthoraxcopy = upsampled_df_rsp_hexoskinthoraxcopy.interpolate(method='linear')
# Determine the start and end times for both dataframes
start_time_hexoskin = upsampled_df_rsp_hexoskinthoraxcopy.index[0]
start_time_imp5fs = upsampled_df_rsp_imp5fscopy.index[0]
end_time_hexoskin = upsampled_df_rsp_hexoskinthoraxcopy.index[-1]
end_time_imp5fs = upsampled_df_rsp_imp5fscopy.index[-1]
# Identify which dataframe starts later and which one ends earlier
if start_time_hexoskin < start_time_imp5fs:
start_time = start_time_imp5fs
else:
start_time = start_time_hexoskin
if end_time_hexoskin < end_time_imp5fs:
end_time = end_time_hexoskin
else:
end_time = end_time_imp5fs
# Slice both dataframes to match the determined start and end times
cut_upsampled_df_rsp_hexoskinthoraxcopy = upsampled_df_rsp_hexoskinthoraxcopy[start_time:end_time]
cut_upsampled_df_rsp_imp5fscopy = upsampled_df_rsp_imp5fscopy[start_time:end_time]
# Now you have the two dataframes with matching time ranges
signal_1 = cut_upsampled_df_rsp_imp5fscopy['RSP_Clean'].to_numpy()
signal_2 = cut_upsampled_df_rsp_hexoskinthoraxcopy['RSP_Clean'].to_numpy()
# Normalize signal_1 using Z-score
mean1 = np.mean(signal_1)
std1 = np.std(signal_1)
signal1 = (signal_1 - mean1) / (std1)
# Normalize signal_2 using Min-Max scaling
mean2 = np.mean(signal_2)
std2 = np.std(signal_2)
signal2 = (signal_2 - mean2) / (std2)
len_signal1 = len(signal1)
len_signal2 = len(signal2)
print(len(signal1))
print(len(signal2))
# Calculate cross-correlation
corr = sm.tsa.stattools.ccf(signal1, signal2, adjusted=False)
corr_backwards = sm.tsa.stattools.ccf(signal2, signal1, adjusted=False)
# Find the index of the maximum correlation value
max_corr_index = np.argmax(corr)
max_corr_index_backwards = np.argmax(corr_backwards)
# Find the maximum correlation value using the index
max_corr_value = corr[max_corr_index]
max_corr_value_backwards = corr[max_corr_index_backwards]
print(f"Maximum Correlation Value for participant {participant}: {max_corr_value}")
print(f"Index of Maximum Correlation Value for participant {participant}: {max_corr_index}")
print(f"Maximum -backwards- Correlation Value for participant {participant}: {max_corr_value_backwards}")
print(f"Index of Maximum -backwards- Correlation Value for participant {participant}: {max_corr_index_backwards}")
# Append the maximum correlation value to the list
cross_correlation_values.append(max_corr_value)
# Define the shift value
shift = max_corr_index
# Define the sampling rate (Hz)
sampling_rate = 1000
# Calculate the length of the signals
total_samples = len_signal1
total_seconds = total_samples / sampling_rate
# Calculate the number of 1-minute periods
minutes_per_period = 1
total_periods = int(np.ceil(total_seconds / 60))
#################################
# Create subplots for each 1-minute period
# Create subplots for each 1-minute period
for period in range(total_periods):
start_sample = period * 60 * sampling_rate # Start of the period
end_sample = (period + 1) * 60 * sampling_rate # End of the period
try:
# Ensure that the end_sample does not exceed the length of the shorter signal
end_sample = min(end_sample, min(len_signal1, len_signal2))
# Plot signal1 and signal2 shifted with the specified range
plt.figure(figsize=(10, 6))
plt.plot(range(start_sample, end_sample), signal1[start_sample:end_sample], label='Signal 1', color='blue')
plt.plot(
range(start_sample + shift, end_sample + shift),
signal2[start_sample:end_sample],
label='Signal 2 (Shifted)',
color='red'
)
plt.xlabel('Sample Index')
plt.ylabel('Signal Value')
plt.title(f'Signal 1 and Signal 2 (Shifted) - Participant {participant} - Period {period + 1}')
plt.legend()
# Show or save the plot (you can save using plt.savefig)
plt.show()
except ValueError as e:
print(f"Error plotting for participant {participant}, period {period + 1}: {str(e)}")
continue
However when I do this, the output it gives for the first participant (whose second signal should be shifted backwards; so I would assume that the backwards portion results in a much higher cross-correlation value) an even lower cross-correlation value. This is the outpu:
Processing data for participant 39161... Data type of index for participant 39161: datetime64[ns] Data type of index for participant 39161: datetime64[ns] 3883001 3883001 Maximum Correlation Value for participant 39161: 0.4651858220211519 Index of Maximum Correlation Value for participant 39161: 0 Maximum -backwards- Correlation Value for participant 39161: 0.07670746799847256 Index of Maximum -backwards- Correlation Value for participant 39161: 492
Is my way of backwards calculation (simply switching the place of signal2 and signal1) not working? What am I missing here? I would really appreciate if someone can help me correct it.
Note that for the second participant, the forward lag (ordinary function) already worked well. I also asked for the backwards calculation for that and that gave a zero lag but still a reasonable cross-correlation. And in that case, perhaps, the code is correct but for some reason, the second signal for the first participant really shouldn't be shifted backwards? Below is the output for the second participant in case that helps:
Processing data for participant 25919... Data type of index for participant 25919: datetime64[ns] Data type of index for participant 25919: datetime64[ns] 3788001 3788001 Maximum Correlation Value for participant 25919: 0.6311162750258501 Index of Maximum Correlation Value for participant 25919: 266 Maximum -backwards- Correlation Value for participant 25919: 0.5906250524659878 Index of Maximum -backwards- Correlation Value for participant 25919: 0
To find the possibility that signal2 should be shifted backwards for the best cross-correlation, I first put signal2 and then signal1 into the function and that did not work out for the first participant (based on visualizations where, without adjusting for any lags yet, the second signal (Red) always comes a bit before signal1.
Also, in case anyone directs me to this question page https://stackoverflow.com/questions/63491991/how-to-use-the-ccf-method-in-the-statsmodels-library#:~:text=The%20statsmodels%20ccf%20function%20only,input%20series%20and%20the%20output.
I already tried
corr_backwards = sm.tsa.stattools.ccf(signal2, signal1, adjusted=False)[::-1]
and
corr_backwards = sm.tsa.stattools.ccf(signal1[::-1], signal2[::-1], adjusted=False)[::-1]
and these give cross-correlation values of 0 and does not make sense for any participant.