Why am I getting different autocorrelation results from different libraries?

155 views Asked by At
  1. Why am I getting different autocorrelation results from different libraries?
  2. Which one is correct?
import numpy as np
from scipy import signal

# Given data
data = np.array([1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 3.33])

# Compute the autocorrelation using scipy's correlate function
autocorrelations = signal.correlate(data, data, mode='full')

# The middle of the autocorrelations array is at index len(data)-1
mid_index = len(data) - 1

# Show autocorrelation values for lag=1,2,3,4,...
print(autocorrelations[mid_index + 1:])

Output:

[21.2425 17.285  13.4525  9.8075  6.4125  3.33  ]

import pandas as pd

# Given data
data = [1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 3.33]

# Convert data to pandas Series
series = pd.Series(data)

# Compute and print autocorrelation for lags 1 to length of series - 1
for lag in range(0, len(data)):
    print(series.autocorr(lag=lag))

Output:

1.0
0.9374115462038415
0.9287843240596312
0.9260849979667674
0.9407970411588671
0.9999999999999999

from statsmodels.tsa.stattools import acf

# Your data
data = [1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 3.33]

# Calculate the autocorrelation using the acf function
autocorrelation = acf(data, nlags=len(data)-1, fft=True)

# Display the autocorrelation coefficients for lags 1,2,3,4,...
print(autocorrelation)

Output:

[ 1.       0.39072553  0.13718689 -0.08148897 -0.24787067 -0.3445268 -0.35402598]
3

There are 3 answers

0
mrk On BEST ANSWER

"They are likely each correct according their chosen definition of autocorrelation. Edge discontinuity effects at start and end of the array dominates short runs of data." - @Martin Brown in the comments.

Scipy's correlate function: Documentation

Scipy's signal.correlate function computes the cross-correlation of two sequences. In this case, since you're providing the same data for both sequences, it calculates the autocorrelation. The output is a continuous sequence of autocorrelation values, and you are extracting the positive lags.

Pandas Series autocorr method: Documentation

Pandas' autocorr method computes the Pearson correlation coefficient between the Series and a lagged version of itself. It uses a formula that involves mean normalization. The output includes the autocorrelation at lag 0 (which is always 1.0) and positive lags.

Statsmodels acf function: Documentation

Statsmodels' acf function calculates the autocorrelation function (ACF) using either the biased or unbiased method. The default is the biased method (fft=True), which normalizes by the number of observations. The output includes the autocorrelation at lag 0 and positive lags.

  1. Why you are seeing different results? Different implementations (see the docs for details)!
  2. Now, which one is correct depends on what you mean by "correct" in the context of your analysis.
2
Cem Koçak On

Your question got me really curious so I did some resarch and I want to share my finding with you:

  • different autocorrelation results from different libraries results from the differences in the algoritms and methods used to compute correlations in these libraries.
  • there is no universal formulation for autocorrelation computation thus each library hass their own approaches. I suggest you take a look at the following for better understanding: from: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html enter image description here

from: https://github.com/pandas-dev/pandas/blob/v2.2.0/pandas/core/series.py#L3115-L3158

 def autocorr(self, lag: int = 1) -> float:
        """
        Compute the lag-N autocorrelation.

        This method computes the Pearson correlation between
        the Series and its shifted self.

        Parameters
        ----------
        lag : int, default 1
            Number of lags to apply before performing autocorrelation.

        Returns
        -------
        float
            The Pearson correlation between self and self.shift(lag).

        See Also
        --------
        Series.corr : Compute the correlation between two Series.
        Series.shift : Shift index by desired number of periods.
        DataFrame.corr : Compute pairwise correlation of columns.
        DataFrame.corrwith : Compute pairwise correlation between rows or
            columns of two DataFrame objects.

        Notes
        -----
        If the Pearson correlation is not well defined return 'NaN'.

        Examples
        --------
        >>> s = pd.Series([0.25, 0.5, 0.2, -0.05])
        >>> s.autocorr()  # doctest: +ELLIPSIS
        0.10355...
        >>> s.autocorr(lag=2)  # doctest: +ELLIPSIS
        -0.99999...

        If the Pearson correlation is not well defined, then 'NaN' is returned.

        >>> s = pd.Series([1, 0, 0, 0])
        >>> s.autocorr()
        nan
        """
        return self.corr(cast(Series, self.shift(lag)))
0
Josef On

quoting the abstract from the first article a search finds

" We consider the estimation of the covariance structure in time series for which the classical conditions of both mean and variance stationary may not be satisfied. It is well known that the classical estimators of the autocovariance are biased even when the process is stationary; even for series of length 100–200 this bias can be surprisingly large. When the process is not mean stationary these estimators become hopelessly biased. When the process is not variance stationary the autocovariance is not even defined. "

from https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9884.00101

Small differences in how the estimator is defined can or will then have a large impact.

For example:
AFAIR, statsmodels subtracts the mean from the series before taking lags.
AFAICS, pandas subtracts the mean of the lagged series separately. Which means a series with a positive trend has lower means at lagged values, which distorts the covariance and correlation computation.

In a large sample with a stationary mean, the mean of lagged series will all converge to the same estimate.

We can always compute the empirical correlation between to samples, but without additional conditions we do not get the time series interpretation for autocorrelation.