Combining daily Welford computed variance into monthly

372 views Asked by At

I'm successfully using Welford's method to compute running variance and standard deviation as described many times on Stack Overflow and John D Cook's excellent blog post.

I store the timestamp, count, sum, and running calculation of "Sk" and stdev in a database table PER DAY.

I would like to combine or rollup the daily computed count,sum, and Sk values get a MONTHLY standard deviation.

John D Cook's blog has another post that provides an algorithm to combine two "RunningStats" into 1 (see operator+ method). This works for combining 2 days into 1. I could use this to iterate through the days to combine all days in the month. However, unlike calculating daily stdev where I have a large number of samples that must be dealt with in a streaming fashion, I have access to all the daily data, and would like a single formula to combine all days in the month at once. That would lend itself to creation of a database view.

It does not appear that simply summing up the Sk values and dividing by the total monthly count - 1 produces accurate variance.

Example Data:

DATE, COUNT, SUM, Sk, STDDEV
1-Jun-15, 60, 514, 1556.733336, 5.14
2-Jun-15, 51, 455, 1523.686274, 5.52
3-Jun-15, 61, 556, 1494.196722, 4.99
...
1

There are 1 answers

0
David Eisenstat On

Let x1, ..., xn be the values for one day. Then, as I understand it, Sk is defined (mathematically; the rolling implementation is different) as follows.

Mk = (sum_{i=1}^n xi) / n     [Mk = SUM / COUNT]
Sk = sum_{i=1}^n (xi - Mk)^2.

The problem with summing the Sk column is that each value is computed with respect to a different mean, so the overall variance is an underestimate. Instead, we should have a term like

sum_{i=1}^n (xi - (Mk + delta))^2,

which we rewrite in terms of the existing quantities.

  sum_{i=1}^n (xi - (Mk + delta))^2
= sum_{i=1}^n (xi - Mk - delta)^2
= sum_{i=1}^n ((xi - Mk)^2 - 2 (xi - Mk) delta + delta^2)
= Sk + n delta^2.

Here delta is the monthly mean minus the daily mean. Once again, I do not warrant numerical stability.