Why am I getting different autocorrelation results from different libraries?

Question

Why am I getting different autocorrelation results from different libraries?

155 views Asked by user366312 At 25 January 2024 at 19:55

Why am I getting different autocorrelation results from different libraries?
Which one is correct?

import numpy as np
from scipy import signal

# Given data
data = np.array([1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 3.33])

# Compute the autocorrelation using scipy's correlate function
autocorrelations = signal.correlate(data, data, mode='full')

# The middle of the autocorrelations array is at index len(data)-1
mid_index = len(data) - 1

# Show autocorrelation values for lag=1,2,3,4,...
print(autocorrelations[mid_index + 1:])

Output:

[21.2425 17.285  13.4525  9.8075  6.4125  3.33  ]

import pandas as pd

# Given data
data = [1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 3.33]

# Convert data to pandas Series
series = pd.Series(data)

# Compute and print autocorrelation for lags 1 to length of series - 1
for lag in range(0, len(data)):
    print(series.autocorr(lag=lag))

Output:

1.0
0.9374115462038415
0.9287843240596312
0.9260849979667674
0.9407970411588671
0.9999999999999999

from statsmodels.tsa.stattools import acf

# Your data
data = [1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 3.33]

# Calculate the autocorrelation using the acf function
autocorrelation = acf(data, nlags=len(data)-1, fft=True)

# Display the autocorrelation coefficients for lags 1,2,3,4,...
print(autocorrelation)

Output:

[ 1.       0.39072553  0.13718689 -0.08148897 -0.24787067 -0.3445268 -0.35402598]

Original Q&A

There are 3 answers

Cem Koçak On 25 January 2024 at 22:50

Your question got me really curious so I did some resarch and I want to share my finding with you:

different autocorrelation results from different libraries results from the differences in the algoritms and methods used to compute correlations in these libraries.
there is no universal formulation for autocorrelation computation thus each library hass their own approaches. I suggest you take a look at the following for better understanding: from: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html

from: https://github.com/pandas-dev/pandas/blob/v2.2.0/pandas/core/series.py#L3115-L3158

 def autocorr(self, lag: int = 1) -> float:
        """
        Compute the lag-N autocorrelation.

        This method computes the Pearson correlation between
        the Series and its shifted self.

        Parameters
        ----------
        lag : int, default 1
            Number of lags to apply before performing autocorrelation.

        Returns
        -------
        float
            The Pearson correlation between self and self.shift(lag).

        See Also
        --------
        Series.corr : Compute the correlation between two Series.
        Series.shift : Shift index by desired number of periods.
        DataFrame.corr : Compute pairwise correlation of columns.
        DataFrame.corrwith : Compute pairwise correlation between rows or
            columns of two DataFrame objects.

        Notes
        -----
        If the Pearson correlation is not well defined return 'NaN'.

        Examples
        --------
        >>> s = pd.Series([0.25, 0.5, 0.2, -0.05])
        >>> s.autocorr()  # doctest: +ELLIPSIS
        0.10355...
        >>> s.autocorr(lag=2)  # doctest: +ELLIPSIS
        -0.99999...

        If the Pearson correlation is not well defined, then 'NaN' is returned.

        >>> s = pd.Series([1, 0, 0, 0])
        >>> s.autocorr()
        nan
        """
        return self.corr(cast(Series, self.shift(lag)))

Josef On 25 January 2024 at 23:16

quoting the abstract from the first article a search finds

" We consider the estimation of the covariance structure in time series for which the classical conditions of both mean and variance stationary may not be satisfied. It is well known that the classical estimators of the autocovariance are biased even when the process is stationary; even for series of length 100–200 this bias can be surprisingly large. When the process is not mean stationary these estimators become hopelessly biased. When the process is not variance stationary the autocovariance is not even defined. "

from https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9884.00101

Small differences in how the estimator is defined can or will then have a large impact.

For example:
AFAIR, statsmodels subtracts the mean from the series before taking lags.
AFAICS, pandas subtracts the mean of the lagged series separately. Which means a series with a positive trend has lower means at lagged values, which distorts the covariance and correlation computation.

In a large sample with a stationary mean, the mean of lagged series will all converge to the same estimate.

We can always compute the empirical correlation between to samples, but without additional conditions we do not get the time series interpretation for autocorrelation.

**mrk** · Accepted Answer · 2024-02-03T18:26:31+00:00

"They are likely each correct according their chosen definition of autocorrelation. Edge discontinuity effects at start and end of the array dominates short runs of data." - @Martin Brown in the comments.

Scipy's correlate function: Documentation

Scipy's signal.correlate function computes the cross-correlation of two sequences. In this case, since you're providing the same data for both sequences, it calculates the autocorrelation. The output is a continuous sequence of autocorrelation values, and you are extracting the positive lags.

Pandas Series autocorr method: Documentation

Pandas' autocorr method computes the Pearson correlation coefficient between the Series and a lagged version of itself. It uses a formula that involves mean normalization. The output includes the autocorrelation at lag 0 (which is always 1.0) and positive lags.

Statsmodels acf function: Documentation

Statsmodels' acf function calculates the autocorrelation function (ACF) using either the biased or unbiased method. The default is the biased method (fft=True), which normalizes by the number of observations. The output includes the autocorrelation at lag 0 and positive lags.

Why you are seeing different results? Different implementations (see the docs for details)!
Now, which one is correct depends on what you mean by "correct" in the context of your analysis.

TechQA.

Why am I getting different autocorrelation results from different libraries?

There are 3 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in STATSMODELS

Related Questions in AUTOCORRELATION

Popular Questions

Trending Questions