Get the Population Standard Deviation of streaming input data

292 views Asked by At

So let's say I have a sensor that's giving me a number, let's say the local temperature, or whatever really, every 1/100th of a second.

So in a second I've filled up an array with a hundred numbers.

What I want to do is, over time, create a statistical model, most likely a bell curve, of this streaming data so I can get the population standard deviation of this data.

Now on a computer with a lot of storage, this won't be an issue, but on something small like a raspberry pi, or any microprocessor, storing all the numbers generated from multiple sensors over a period of months becomes very unrealistic.

When I looked at the math of getting the standard deviation, I thought of simply storing a few numbers:

The total running sum of all the numbers so far, the count of numbers, and lastly a running sum of (each number - the current mean)^2.

Using this, whenever I would get a new number, I would simply add one to the count, add the number to the running sum, get the new mean, add the (new number - new mean)^2 to the running sum, divide that by the count and root that, to get the new standard deviation.

There are a few problems with this approach, however:

It would take 476 years to overflow the sum of numbers streaming in assuming the data type is temperature and the average temperature is 60 degrees Fahrenheit and the numbers are streamed at a 100hz.

The same level of confidence cannot be held for the sum of the (number - mean)^2 since it is a sum of squared numbers.

Most importantly, this approach is highly inaccurate since for each number a new mean is used, which completely obliterates the entire mathematical value of a standard deviation, especially a population standard deviation.

If you believe a population standard deviation is impossible to achieve, then how should I go about a sample standard deviation? Taking every nth number will still result in the same problems.

I also don't want to limit my data set to a time interval, (ie. a model for only the last 24 hours of sensor data is made) since I want my statistical model to be representative of the sensor data over a long period of time, namely a year, and if I have to wait a year to do testing and debugging or even getting a useable model, I won't be having fun.

Is there any sort of mathematical work around to get a population, or at least a sample standard deviation of an ever increasing set of numbers, without actually storing that set since that would be impossible, and still being able to accurately detect when something is multiple standard deviations away?

The closest answer I've seen is: wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm, however, but I have no idea what this is saying and if this requires storage of the set of numbers.

Thank you!

1

There are 1 answers

1
MBo On

Link shows code, and it is clear that you need to store only 3 variables: number of samples so far, current mean and sum of quadratic differences