What am I missing in the vector case?

Question

What am I missing in the vector case?

189 views Asked by user366312 At 31 January 2024 at 23:55

I obtained two sets of data from a Monte Carlo simulation of polymer movement.

One is a list of $r^2_{end-to-end}$,

#r-end-squared
58617.14512069625
55606.1082220041
51244.724846185505
65344.140277928964
63392.17236605068
62101.375689808614
58814.8832777033
60040.00627213609
52019.130880313125
56984.44642212785
... ... ... ... ...

and another is a list of $\overrightarrow{r}_{end-to-end}$

#   r_end-X     r_end-Y    r_end-Z
-177.236 100.309 -130.930
-184.354 88.047 -117.760
-172.577 87.168 -117.745
-197.651 103.270 -124.953
-190.053 104.223 -128.100
-187.985 102.387 -127.593
-190.839 91.210 -118.643
-193.851 98.069 -113.333
-177.084 83.960 -116.667
-178.312 92.759 -128.782
... ... ... ... ...

The autocorrelation formula used in the case of $r^2_{end-to-end}$ is as follows:

$$ R(t) = \frac{1}{(n - t) \cdot \sigma^2} \sum_{i=0}^{n-t-1} (X_i - \mu)(X_{i+t} - \mu) $$

where:

$( n )$ is the number of observations in the dataset,
$( t )$ is the lag, or the distance between the compared points,
$( \mu )$ is the mean of the dataset,
$( \sigma^2 )$ is the variance of the dataset,
$( X_i )$ and $( X_{i+t} )$ are the data points at indexes $( i )$ and $( i+t )$ respectively.

The autocorrelation value $( R(t) )$ for each lag $( t )$ is calculated by summing the product of the mean-adjusted data points separated by the lag, and then normalizing the sum by dividing it by the product of the variance and the number of terms in the sum (adjusted for the lag). This calculation is done for each lag from 0 up to the specified number of lags (numLag), but not exceeding the number of data points (n). The computed autocorrelation values are then filtered based on the specified thresholds (thresholdMin and thresholdMax).

public (List<double> lags, List<double> autocorrelationValues) AutoCorrelationPointsR2(
double thresholdMin = 0.0001, double thresholdMax = 0.9999, int numLag = 1000)
        {
            var autocorrelationValues = new List<double>();
            var lags = new List<double>();
            int n = data_.Count;
            double varianceValue = variance_.GetValueOrDefault();

            if (numLag > n)
            {
                numLag = n; // Ensure the number of lags does not exceed the data count.
            }

            // Subtract mean and prepare data for autocorrelation calculation
            double[] centeredData = data_.Select(x => x - mean_.Value).ToArray();
            double[] autocorrelations = new double[numLag];

            // Parallelize the autocorrelation calculations
            Parallel.For(0, numLag, t =>
            {
                double sumProduct = 0.0;

                // Use a running sum to calculate the sum of products for the current lag.
                if (t == 0)
                {
                    // For lag 0, the sum of products is simply the sum of squared centered data.
                    sumProduct = centeredData.Sum(x => x * x);
                }
                else
                {
                    // For lag t, use the previous sumProduct and adjust for the new lag.
                    for (int i = 0; i < n - t; i++)
                    {
                        sumProduct += centeredData[i] * centeredData[i + t];
                    }
                }

                double autocorrValue = sumProduct / (n - t);

                // Normalize by variance if required, and check thresholds
                if (varianceValue > 0)
                {
                    autocorrValue /= varianceValue;
                }

                autocorrelations[t] = autocorrValue;
            });

            for (int t = 0; t < numLag; t++)
            {
                double autocorrValue = autocorrelations[t];
                if (autocorrValue >= thresholdMin && autocorrValue <= thresholdMax)
                {
                    autocorrelationValues.Add(autocorrValue);
                    lags.Add(t);
                }
            }

            return (lags, autocorrelationValues);
        }

Here's how I adapted the autocorrelation function for the vector dataset:

Calculate the mean vector $\boldsymbol{\mu}$ by averaging each component of the vectors separately:

$$ \boldsymbol{\mu} = \left( \frac{1}{n}\sum_{i=0}^{n-1} x_i, \frac{1}{n}\sum_{i=0}^{n-1} y_i, \frac{1}{n}\sum_{i=0}^{n-1} z_i \right) $$

Calculate the autocorrelation function for the vector dataset using the dot product to get a scalar autocorrelation value for each lag $t$:

$$ R(t) = \frac{1}{(n - t) \cdot \sigma^2} \sum_{i=0}^{n-t-1} \left( \mathbf{X}i - \boldsymbol{\mu} \right) \cdot \left( \mathbf{X}{i+t} - \boldsymbol{\mu} \right) $$

where $\mathbf{X}_i$ is the vector at index $i$, and $\sigma^2$ is the variance of the magnitude squared of the end-to-end distance vectors. The variance in this case can be calculated as:

$$ \sigma^2 = \frac{1}{n} \sum_{i=0}^{n-1} \left( | \mathbf{X}_i - \boldsymbol{\mu} |^2 \right) - \left( | \boldsymbol{\mu} |^2 \right) $$

Here, $| \mathbf{X}_i - \boldsymbol{\mu} |^2$ is the squared magnitude of the vector difference $\mathbf{X}_i - \boldsymbol{\mu}$.

In the autocorrelation function, the dot product in the summands will give us a scalar value, as the dot product of two vectors is a scalar. This is appropriate since I am interested in the correlation of the scalar magnitudes of the end-to-end vectors.

Now, the correction to the formula is in how $\sigma^2$ is interpreted. In the scalar case, this is simply the variance of the dataset, but for vectors, you're dealing with magnitudes, so you need to find the average of the squared distances from the mean vector, as shown above.

public static (List<double> lags, List<double> autocorrelationValues) AutoCorrelationVec3(List<Vector3> data, double thresholdMin = 0.0001, double thresholdMax = 0.9999, int numLag = 1000)
        {
            int n = data.Count;
            Vector3 mean = new Vector3(0, 0, 0);
            double variance = 0;

            // Calculate the mean vector
            foreach (Vector3 vec in data)
            {
                mean.X += vec.X;
                mean.Y += vec.Y;
                mean.Z += vec.Z;
            }

            mean.X /= n;
            mean.Y /= n;
            mean.Z /= n;

            // Pre-calculate squared differences from the mean for each data point
            Vector3[] diff = new Vector3[n];
            for (int i = 0; i < n; i++)
            {
                diff[i] = data[i] - mean;
                variance += diff[i].Dot(diff[i]);
            }
            variance /= n;

            // Use a concurrent collection to hold the results
            ConcurrentBag<(double lag, double autocorrelation)> results = new ConcurrentBag<(double lag, double autocorrelation)>();

            // Use Parallel.For for parallel computation
            Parallel.For(0, Math.Min(numLag, n), t =>
            {
                double autocorrelation = 0;
                for (int i = 0; i < n - t; i++)
                {
                    autocorrelation += diff[i].Dot(diff[i + t]);
                }
                autocorrelation /= ((n - t) * variance);

                if (autocorrelation >= thresholdMin && autocorrelation <= thresholdMax)
                {
                    results.Add((t, autocorrelation));
                }
            });

            // Sort the results by lag
            var sortedResults = results.OrderBy(r => r.lag).ToList();

            // Extract lags and autocorrelationValues from the sorted results
            List<double> lags = sortedResults.Select(r => r.lag).ToList();
            List<double> autocorrelationValues = sortedResults.Select(r => r.autocorrelation).ToList();

            return (lags, autocorrelationValues);
        }

Both formulas were used for an exponential-decay $y = A0 * Exp(-x * 1/t0)$ curve fitting.

Curve-fitting are also giving different results:

scalar case:

0.882774268144487   157.919249303066
0.878763993621615   677.348343962392
0.911347264318643   1350.38650246438

vector case:

0.921929100391233   674.795111783111
0.945989572924983   2343.60182994056
0.963686736797716   4307.35583098765

public static (List<double> XFit, List<double> YFit, List<double> MinimizingPoint) Fit(List<double> xData, List<double> yData)
        {
            // example data
            var xDataDense = new DenseVector(xData.ToArray());
            var yDataDense = new DenseVector(yData.ToArray());

            Vector<double> Model(Vector<double> parameters, Vector<double> x)
            {
                var y = CreateVector.Dense<double>(x.Count);
                for (int i = 0; i < x.Count; i++)
                {
                    y[i] = parameters[0] * Math.Exp(parameters[1] * x[i]);
                }
                return y;
            }

            var start = new DenseVector(new double[] { 1.0, 0.1 });
            var objective = ObjectiveFunction.NonlinearModel(Model, xDataDense, yDataDense);
            var solver = new LevenbergMarquardtMinimizer(maximumIterations: 10000);
            var result = solver.FindMinimum(objective, start);

            Vector<double> points = result.MinimizedValues;

            Vector<double> minimizing = result.MinimizingPoint;

            Console.WriteLine($"Reason for exit: {result.ReasonForExit}");

            return (xData, new List<double>(points.ToArray()), new List<double>(minimizing.ToArray()));
        }

And, then drawing plots for polymer length vs. tau.

The autocorrelation formula used in the case of $r^2_{end-to-end}$ is working fine and giving plots as expected.

However, the formula used in the case of $\overrightarrow{r}_{end-to-end}$ is not giving similar plots.

So, what am I missing in the vector case?

Original Q&A

There are 1 answers

**Martin Brown** · Answer 1 · 2024-02-07T23:00:11+00:00

Your raw r^2 values are formatted f18.12 which means you have effectively used the full double precision of your data (assuming that it was double precision).

Your vectors r_x, r_y, r_z values are formatted f8.3 so that when you convert them into an r^2 or correlate them there is a rounding error. Quick test in a spreadsheet shows the two datasets are sort of consistent but that the vector form has inferior precision. I don't think you are missing anything much other than significant digits in the vector data.

This is what I get in a quick scratchpad calculation:

#r-end-squared	Vector(r^2)	r_end-X	r_end-Y	r_end-Z
58617.14512	58617.16008	-177.236	100.309	-130.93
55606.10822	55606.08913	-184.354	88.047	-117.76
51244.72485	51244.96618	-172.577	87.168	-117.745
65344.14028	65343.86291	-197.651	103.27	-124.953
63392.17237	63392.18654	-190.053	104.223	-128.1
62101.37569	62101.43164	-187.985	102.387	-127.593

You have only got 6 good decimal digits in the vector derived quantities (if you are working from the same table of saved values).

In addition something else is wrong with the fitted line in Figure 2 - Mk 1 eyeball says that line is not a fit to the points plotted on the graph.

TechQA.

What am I missing in the vector case?

There are 1 answers

Related Questions in C#

Related Questions in AUTOCORRELATION

Popular Questions

Trending Questions