variance(), var_pop(), var_samp(), stddev(), stddev_pop(), stddev_samp()
This section describes the variance()
, var_pop()
, var_samp()
, stddev()
, stddev_pop()
, and stddev_samp()
aggregate functions. They provide a confidence measure for the computed arithmetic mean of a set of values.
Each of these aggregate functions is invoked by using the same syntax:
- either the simple syntax,
select aggregate_fun(expr) from t
- or the
GROUP BY
syntax - or the
OVER
syntax
Only the simple invocation is illustrated in this section. See, for example, the sections GROUP BY
syntax and OVER
syntax in the section avg(), count(), max(), min(), sum()
for how to use these syntax patterns.
Background
The notions "variance" and "standard deviation" are trivially related: the latter is the square root of the former. The variance of a set of N values, v, is defined, naïvely, in terms of the arithmetic mean, a of those values:
variance = ( sum over all "v" of (v - a)^2 ) / N
Statisticians distinguish between the variance and the standard deviation of an entire population and the variance and the standard deviation of a sample of a population. The formulas for computing the "population" variants use the naïve definition of variance. And the formulas for computing the "sample" variants divide by (N - 1) rather than by N.
This example demonstrates that the built-in functions for the "population" and the "sample" variants of variance and standard deviation produce the same values as the text-book formulas that define them. First create a small set of values:
drop table if exists t cascade;
create table t(v numeric primary key);
insert into t(v)
select 100 + s.v*0.01
from generate_series (-5, 5) as s(v);
select to_char(v, '999.99') as v from t order by v;
This is the result:
v
---------
99.95
99.96
99.97
99.98
99.99
100.00
100.01
100.02
100.03
100.04
100.05
Now create a function to test the equality between what the built-in functions produce and what the formulas that define them produce:
drop function if exists fmt(x in numeric) cascade;
drop function if exists f() cascade;
create function fmt(x in numeric)
returns text
language sql
as $body$
select to_char(x, '0.99999999');
$body$;
create function f()
returns table(t text)
language plpgsql
as $body$
declare
sum constant numeric not null := (
select count(v)::numeric from t);
avg constant numeric not null := (
select avg(v) from t);
s constant numeric not null := (
select sum((avg - v)^2) from t);
variance numeric not null := 0;
var_samp numeric not null := 0;
var_pop numeric not null := 0;
stddev numeric not null := 0;
stddev_samp numeric not null := 0;
stddev_pop numeric not null := 0;
begin
select variance(v), var_samp(v), var_pop(v), stddev(v), stddev_samp(v), stddev_pop(v)
into variance, var_samp, var_pop, stddev, stddev_samp, stddev_pop
from t;
assert variance = var_samp, 'unexpected';
assert stddev = stddev_samp, 'unexpected';
assert var_samp = s/(sum - 1), 'unexpected';
assert var_pop = s/sum, 'unexpected';
assert stddev_samp = sqrt(s/(sum - 1)), 'unexpected';
assert stddev_pop = sqrt(s/sum), 'unexpected';
t = 'var_samp: '||fmt(var_samp); return next;
t = 'var_pop: '||fmt(var_pop); return next;
t = 'stddev_samp: '||fmt(stddev_samp); return next;
t = 'stddev_pop: '||fmt(stddev_pop); return next;
t = 'stddev_samp/stddev_pop: '||fmt(stddev_samp/stddev_pop); return next;
end;
$body$;
\t on
select t from f();
\t off
Notice that the semantics of variance()
and var_samp()
are identical; and that the semantics of stddev()
and stddev_samp()
are identical. Each of the assert
statements succeeds and the function produces this result:
var_samp: 0.00110000
var_pop: 0.00100000
stddev_samp: 0.03316625
stddev_pop: 0.03162278
stddev_samp/stddev_pop: 1.04880885
This section assumes that you understand the distinction between the "population" and the "sample" variants and that you know which variant you need for your present purpose.
Signature:
Each one of the "confidence measure" aggregate functions has the same signature:
input value: smallint, int, bigint, numeric, double precision, real
return value: numeric, double precision
Notes: The lists of input and return data types give the distinct kinds. Because, the output of each function is computed by division, the return data type is never one whose values are constrained to be whole numbers. Here are the specific mappings:
INPUT OUTPUT
---------------- ----------------
smallint numeric
int numeric
bigint numeric
numeric numeric
double precision double precision
real double precision
variance()
Purpose: the semantics of variance()
and var_samp()
are identical.
var_pop()
Purpose: Returns the variance of a set of values using the naïve formula (i.e. the "population" variant) that divides by the number of values, N, as explained in the Background section. In other words, it treats the set of values as the entire population of interest.
var_samp()
Purpose: Returns the variance of a set of values using the "sample" variant of the formula that divides by (N - 1) where N is the number of values, as explained in the Background section. In other words, it treats the set of values as just a sample of the entire population of interest. The value produced by var_samp()
is bigger than that produced by var_pop()
, reflecting the fact that using only a sample is less reliable than using the entire population.
stddev()
Purpose: the semantics of stddev()
and stddev_samp()
are identical.
stddev_pop()
Purpose: Returns the standard deviation of a set of values using the naïve formula (i.e. the "population" variant) that divides by the number of values, N, as explained in the Background section. In other words, it treats the set of values as the entire population of interest.
stddev_samp()
Purpose: Returns the standard deviation of a set of values using the "sample" variant of the formula that divides by (N - 1) where N is the number of values, as explained in the Background section. In other words, it treats the set of values as just a sample of the entire population of interest. The value produced by stddev_samp()
is bigger than that produced by stddev_pop()`, reflecting the fact that using only a sample is less reliable than using the entire population.
Example
The example uses the function normal_rand()
, brought by the tablefunc extension, to populate the test table:
drop table if exists t cascade;
create table t(v double precision primary key);
do $body$
declare
no_of_rows constant int := 100000;
mean constant double precision := 0.0;
stddev constant double precision := 50.0;
begin
insert into t(v)
select normal_rand(no_of_rows, mean, stddev);
end;
$body$;
Of course, the larger is the value that you choose for "no_of_rows", the closer will be the values returned by the "sample" variants of the confidence measures to the values returned by the "population" variants.
Because the demonstration (for convenience) uses a table with a single double precision
column, "v", this must be the primary key. It's just possible that normal_rand()
will create some duplicate values. However, this is so very rare that it was never seen while the script was repeated, many times, during the development of this code example. If insert into t(v)
does fail because of this, just repeat the script by hand.
Now display the values for avg(v)
, stddev_samp(v)
, stddev_pop(v)
, and the value of stddev_samp(v)/stddev_pop(v)
.
with a as (
select
avg(v) as avg,
stddev_samp(v) as stddev_samp,
stddev_pop(v) as stddev_pop
from t)
select
to_char(avg, '0.999') as avg,
to_char(stddev_samp, '999.999999') as stddev_samp,
to_char(stddev_pop, '999.999999') as stddev_pop,
to_char(stddev_samp/stddev_pop, '90.999999') as "stddev_samp/stddev_pop"
from a;
Because of the pseudorandom nature of normal_rand()
, the values produced will change from run to run. Here are some typical values:
avg | stddev_samp | stddev_pop | stddev_samp/stddev_pop
--------+-------------+-------------+------------------------
0.138 | 49.880052 | 49.879802 | 1.000005