Calculate Standard deviation incrementally one batch at a time

In this article, I would like to display a way to calculate standard deviation in an incremental way. Unlike existing articles, that shows how to update the calculation when adding one more element. I will display a method to calculate standard deviation (std.dev) from existing batch of data and additional batch of data.

Motivation - suppose you have an ETL system that processes batches of data every day like Apache Spark and you need to display statistical data for all the data like std.dev. If you calculate it on the entire set of data the calculation will become increasingly "harder" since the data set will always grow in size and the calculation will require more and more time.



The definition of standard deviation is:

So once additional data is added the entire formula should be re-calculated. And the "hard" parts (where we do a full scan of the data) is in the average and the squared difference from the mean should be recalculated.

However, if we have the average, standard deviation and count of the additional data (denoted as b) then we can do the following for the average:

From the average definition:
 
and for the existing data and the additional batch:
[1] Average for the total data
So the total average can be calculated as the sum of products between average and number of elements divided by the total number of elements.

For the standard deviation, let us denote:
Additionally
[2] Sum of square
 

The squared difference from the mean for the total data is
 
Using [2] we can write
[3] Squared difference from the mean for the total data.

The standard deviation for the total data is:

By plugging [1] and [2] we get the bellow expression for the total std.dev expressed in terms of the existing data, denoted as a, and the additional batch, denoted as b.

 
Conclusion - This approach enables us to do statistical calculation incrementally and keep the calculation time and resources bound to the batch size and not to the total data size.

 

Comments

Popular posts from this blog

5 ways to calculate max of row in Spark

Create Custom Datasource With Spark 3 - Part 1

Spark - How to read a JSON file