Is this code capturing both the forward and backward cross-correlation lags correctly?

51 views Asked by At

I figured that statsmodel package's cross-correlation function sm.tsa.stattools.ccf (am testing two breathing signals recorded with different devices) only tests for forward lags and not for backward lags once it gave the results for one participant like below (says the best cross-correlation is achieved with zero lags but signal2 should clearly be shifted backwards):

enter image description here

enter image description here

Then, I added a backwards part to my cross-correlation calculation by just switching the places of input arguments: so the backwards one is calculated by first putting signal2 and then signal1 (See the code):

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Initialize a list to store cross-correlation values for each participant
cross_correlation_values = []

participants = [39161, 25919]

# Loop through each participant
for participant in participants:
    print(f"Processing data for participant {participant}...")

    # Access data for the current participant (replace this with your data access method)
    df_rsp_imp5fs = results[participant]['df_rsp']
    df_raw_rsp_imp5fs = results[participant]['df_raw_rsp']
    df_rsp_hexoskinthorax = results_hexoskin[participant]['df_rsp_thorax']

    # Convert 'exact_time' column to datetime index for df_rsp_imp5fs
    df_rsp_imp5fscopy = df_rsp_imp5fs.copy()
    df_rsp_imp5fscopy['exact_time'] = pd.to_datetime(df_rsp_imp5fscopy['exact_time'])
    df_rsp_imp5fscopy.set_index('exact_time', inplace=True)

    # Data preprocessing for df_rsp_imp5fs
    # No need to check 'exact_time' column, it's already the index
    data_type = df_rsp_imp5fscopy.index.dtype
    print(f'Data type of index for participant {participant}: {data_type}')

    # Upsample the DataFrame to 1000 Hz and interpolate data gaps
    desired_frequency = 1000
    upsampled_df_rsp_imp5fscopy = df_rsp_imp5fscopy.resample(f'{1000//desired_frequency}ms').asfreq()
    upsampled_df_rsp_imp5fscopy = upsampled_df_rsp_imp5fscopy.interpolate(method='linear')

    # Convert 'exact_time' column to datetime index for df_rsp_hexoskinthorax
    df_rsp_hexoskinthoraxcopy = df_rsp_hexoskinthorax.copy()

    df_rsp_hexoskinthoraxcopy['exact_time'] = pd.to_datetime(df_rsp_hexoskinthoraxcopy['exact_time'])
    df_rsp_hexoskinthoraxcopy.set_index('exact_time', inplace=True)

    # Data preprocessing for df_rsp_hexoskinthorax
    # No need to check 'exact_time' column, it's already the index
    data_type = df_rsp_hexoskinthoraxcopy.index.dtype
    print(f'Data type of index for participant {participant}: {data_type}')

    # Upsample the DataFrame to 1000 Hz and interpolate data gaps
    desired_frequency = 1000
    upsampled_df_rsp_hexoskinthoraxcopy = df_rsp_hexoskinthoraxcopy.resample(f'{1000//desired_frequency}ms').asfreq()
    upsampled_df_rsp_hexoskinthoraxcopy = upsampled_df_rsp_hexoskinthoraxcopy.interpolate(method='linear')

    # Determine the start and end times for both dataframes
    start_time_hexoskin = upsampled_df_rsp_hexoskinthoraxcopy.index[0]
    start_time_imp5fs = upsampled_df_rsp_imp5fscopy.index[0]

    end_time_hexoskin = upsampled_df_rsp_hexoskinthoraxcopy.index[-1]
    end_time_imp5fs = upsampled_df_rsp_imp5fscopy.index[-1]

    # Identify which dataframe starts later and which one ends earlier
    if start_time_hexoskin < start_time_imp5fs:
        start_time = start_time_imp5fs
    else:
        start_time = start_time_hexoskin

    if end_time_hexoskin < end_time_imp5fs:
        end_time = end_time_hexoskin
    else:
        end_time = end_time_imp5fs

    # Slice both dataframes to match the determined start and end times
    cut_upsampled_df_rsp_hexoskinthoraxcopy = upsampled_df_rsp_hexoskinthoraxcopy[start_time:end_time]
    cut_upsampled_df_rsp_imp5fscopy = upsampled_df_rsp_imp5fscopy[start_time:end_time]
    


    # Now you have the two dataframes with matching time ranges

    signal_1 = cut_upsampled_df_rsp_imp5fscopy['RSP_Clean'].to_numpy()
    signal_2 = cut_upsampled_df_rsp_hexoskinthoraxcopy['RSP_Clean'].to_numpy()

    # Normalize signal_1 using Z-score
    mean1 = np.mean(signal_1)
    std1 = np.std(signal_1)
    signal1 = (signal_1 - mean1) / (std1)

    # Normalize signal_2 using Min-Max scaling
    mean2 = np.mean(signal_2)
    std2 = np.std(signal_2)
    signal2 = (signal_2 - mean2) / (std2)

    len_signal1 = len(signal1)
    len_signal2 = len(signal2)
    
    print(len(signal1))
    print(len(signal2))

    # Calculate cross-correlation
    corr = sm.tsa.stattools.ccf(signal1, signal2, adjusted=False)
    corr_backwards = sm.tsa.stattools.ccf(signal2, signal1, adjusted=False)

    # Find the index of the maximum correlation value
    max_corr_index = np.argmax(corr)
    max_corr_index_backwards = np.argmax(corr_backwards)
    
    # Find the maximum correlation value using the index
    max_corr_value = corr[max_corr_index]
    max_corr_value_backwards = corr[max_corr_index_backwards]

    print(f"Maximum Correlation Value for participant {participant}: {max_corr_value}")
    print(f"Index of Maximum Correlation Value for participant {participant}: {max_corr_index}")
    print(f"Maximum -backwards- Correlation Value for participant {participant}: {max_corr_value_backwards}")
    print(f"Index of Maximum -backwards- Correlation Value for participant {participant}: {max_corr_index_backwards}")
    

    # Append the maximum correlation value to the list
    cross_correlation_values.append(max_corr_value)

    # Define the shift value
    shift = max_corr_index

    # Define the sampling rate (Hz)
    sampling_rate = 1000

    # Calculate the length of the signals
    total_samples = len_signal1
    total_seconds = total_samples / sampling_rate

    # Calculate the number of 1-minute periods
    minutes_per_period = 1
    total_periods = int(np.ceil(total_seconds / 60))
    #################################
    # Create subplots for each 1-minute period
    
    # Create subplots for each 1-minute period
    for period in range(total_periods):
        start_sample = period * 60 * sampling_rate  # Start of the period
        end_sample = (period + 1) * 60 * sampling_rate  # End of the period

        try:
            # Ensure that the end_sample does not exceed the length of the shorter signal
            end_sample = min(end_sample, min(len_signal1, len_signal2))

            # Plot signal1 and signal2 shifted with the specified range
            plt.figure(figsize=(10, 6))
            plt.plot(range(start_sample, end_sample), signal1[start_sample:end_sample], label='Signal 1', color='blue')
            plt.plot(
                range(start_sample + shift, end_sample + shift),
                signal2[start_sample:end_sample],
                label='Signal 2 (Shifted)',
                color='red'
            )
            plt.xlabel('Sample Index')
            plt.ylabel('Signal Value')
            plt.title(f'Signal 1 and Signal 2 (Shifted) - Participant {participant} - Period {period + 1}')
            plt.legend()

            # Show or save the plot (you can save using plt.savefig)
            plt.show()
        except ValueError as e:
            print(f"Error plotting for participant {participant}, period {period + 1}: {str(e)}")
            continue



However when I do this, the output it gives for the first participant (whose second signal should be shifted backwards; so I would assume that the backwards portion results in a much higher cross-correlation value) an even lower cross-correlation value. This is the outpu:

Processing data for participant 39161... Data type of index for participant 39161: datetime64[ns] Data type of index for participant 39161: datetime64[ns] 3883001 3883001 Maximum Correlation Value for participant 39161: 0.4651858220211519 Index of Maximum Correlation Value for participant 39161: 0 Maximum -backwards- Correlation Value for participant 39161: 0.07670746799847256 Index of Maximum -backwards- Correlation Value for participant 39161: 492

Is my way of backwards calculation (simply switching the place of signal2 and signal1) not working? What am I missing here? I would really appreciate if someone can help me correct it.

Note that for the second participant, the forward lag (ordinary function) already worked well. I also asked for the backwards calculation for that and that gave a zero lag but still a reasonable cross-correlation. And in that case, perhaps, the code is correct but for some reason, the second signal for the first participant really shouldn't be shifted backwards? Below is the output for the second participant in case that helps:

Processing data for participant 25919... Data type of index for participant 25919: datetime64[ns] Data type of index for participant 25919: datetime64[ns] 3788001 3788001 Maximum Correlation Value for participant 25919: 0.6311162750258501 Index of Maximum Correlation Value for participant 25919: 266 Maximum -backwards- Correlation Value for participant 25919: 0.5906250524659878 Index of Maximum -backwards- Correlation Value for participant 25919: 0

To find the possibility that signal2 should be shifted backwards for the best cross-correlation, I first put signal2 and then signal1 into the function and that did not work out for the first participant (based on visualizations where, without adjusting for any lags yet, the second signal (Red) always comes a bit before signal1.

Also, in case anyone directs me to this question page https://stackoverflow.com/questions/63491991/how-to-use-the-ccf-method-in-the-statsmodels-library#:~:text=The%20statsmodels%20ccf%20function%20only,input%20series%20and%20the%20output.

I already tried

    corr_backwards = sm.tsa.stattools.ccf(signal2, signal1, adjusted=False)[::-1]

and

    corr_backwards = sm.tsa.stattools.ccf(signal1[::-1], signal2[::-1], adjusted=False)[::-1]

and these give cross-correlation values of 0 and does not make sense for any participant.

0

There are 0 answers