I'm successfully using Welford's method to compute running variance and standard deviation as described many times on Stack Overflow and John D Cook's excellent blog post.
I store the timestamp, count, sum, and running calculation of "Sk" and stdev in a database table PER DAY.
I would like to combine or rollup the daily computed count,sum, and Sk values get a MONTHLY standard deviation.
John D Cook's blog has another post that provides an algorithm to combine two "RunningStats" into 1 (see operator+ method). This works for combining 2 days into 1. I could use this to iterate through the days to combine all days in the month. However, unlike calculating daily stdev where I have a large number of samples that must be dealt with in a streaming fashion, I have access to all the daily data, and would like a single formula to combine all days in the month at once. That would lend itself to creation of a database view.
It does not appear that simply summing up the Sk values and dividing by the total monthly count - 1 produces accurate variance.
Example Data:
DATE, COUNT, SUM, Sk, STDDEV
1-Jun-15, 60, 514, 1556.733336, 5.14
2-Jun-15, 51, 455, 1523.686274, 5.52
3-Jun-15, 61, 556, 1494.196722, 4.99
...
Let
x1, ..., xn
be the values for one day. Then, as I understand it,Sk
is defined (mathematically; the rolling implementation is different) as follows.The problem with summing the
Sk
column is that each value is computed with respect to a different mean, so the overall variance is an underestimate. Instead, we should have a term likewhich we rewrite in terms of the existing quantities.
Here
delta
is the monthly mean minus the daily mean. Once again, I do not warrant numerical stability.