Home > APIs > YSQL > Functions and operators > Aggregate functions > Per function signature and purpose > linear regression >

regr_%()

This section describes these aggregate functions for linear regression analysis:

regr_avgy(), regr_avgx(), regr_count(), regr_slope(), regr_intercept(), regr_r2(), regr_syy(), regr_sxx(), regr_sxy().

Make sure that you have read the Functions for linear regression analysis parent section before reading this section. You will need the same data that the parent section shows you how to create. You will also need the function approx_equal() that was used in the Checking the formulas for covariance and correlation section. Create it like this.

drop function if exists approx_equal(double precision, double precision) cascade;
create function approx_equal(
  n1  in double precision,
  n2  in double precision)
  returns boolean
  language sql
as $body$
  select 2.0::double precision*abs(n1 - n2)/(n1 + n2) < (2e-15)::double precision;
$body$;

regr_avgy(), regr_avgx()

Purpose: regr_avgy() returns the average of the first argument for those rows where both arguments are NOT NULL. regr_avgx() returns the average of the second argument for those rows where both arguments are NOT NULL.

regr_count()

Purpose: returns the number of rows where both arguments are NOT NULL.

regr_avgy(), regr_avgx(), regr_count() semantics demonstration

Do this:

create or replace view v as select x, y from t;
create or replace view test as
with a as (
  select
    regr_avgy(y, x)                        as r_avgy,
    regr_avgx(y, x)                        as r_avgx,
    regr_count(y, x)                       as r_count,

    avg(y) filter (where x is not null)    as avgy,
    avg(x) filter (where y is not null)    as avgx,
    count(y) filter (where x is not null)  as county,
    count(x) filter (where y is not null)  as countx
  from v)
select
  approx_equal(r_avgy, avgy)::text     as "r_avgy = avgy",
  approx_equal(r_avgx, avgx)::text     as "r_avgx = avgx",
  approx_equal(r_count, county)::text  as "r_count = county",
  approx_equal(r_count, countx)::text  as "r_count = countx"
from a;

Test it like this on the noise-free data:

create or replace view v as select x, y from t;
select * from test;

This is the result:

 r_avgy = avgy | r_avgx = avgx | r_count = county | r_count = countx
---------------+---------------+------------------+------------------
 true          | true          | true             | true

Now test it on the noisy data:

create or replace view v as select x, (y + delta) as y from t;
select * from test;

The result is the same.

regr_slope(), regr_intercept()

Purpose: regr_slope() returns the slope of the straight line that linear regression analysis has determined best fits the "(y, x)" pairs. And regr_intercept() returns the point at which this line intercepts the "y"-axis.

Try this:

select
  to_char(regr_slope(y, x),               '0.99999999')  as "noise-free slope",
  to_char(regr_intercept(y, x),           '0.99999999')  as "noise-free intercept",

  to_char(regr_slope((y + delta), x),     '0.99999999')  as "noisy slope",
  to_char(regr_intercept((y + delta), x), '0.99999999')  as "noisy intercept"
from t;

This is a typical result:

 noise-free slope | noise-free intercept | noisy slope | noisy intercept
------------------+----------------------+-------------+-----------------
  5.00000000      |  3.00000000          |  4.99778685 |  3.70820546

Recall (see the Create the test table section) that the table was populated using this parameterization:

\set no_of_rows 100
call populate_t(
  no_of_rows  => :no_of_rows,

  mean        =>  0.0,
  stddev      => 20.0,
  slope       =>  5.0,
  intercept   =>  3.0);

The results from regr_slope() and regr_intercept() line up well with what you'd expect.

The section Scatter-plot for synthetic data shows a scatter plot of the data in table "t" (created with actual arguments for procedure "populate_t()" which that section specifies) with the straight line whose slope and y-axis intercept, as returned by regr_slope() and regr_intercept(), superimposed.

regr_r2()

Purpose: Returns the square of the correlation coefficient, corr().

Try this:

select
  approx_equal(regr_r2(y, x),           corr(y, x)^2          )::text  as "test for noise-free data",
  approx_equal(regr_r2((y + delta), x), corr((y + delta), x)^2)::text  as "test for noisy data"
from t;

This is the result:

 test for noise-free data | test for noisy data
--------------------------+---------------------
 true                     | true

regr_syy(), regr_sxx(), regr_sxy()

Purpose: These three measures are clearly defined in terms of other statistical aggregate functions described in this overall section. Statisticians will know when they need them.

regr_syy(y, x) returns regr_count(y, x)*var_pop(y) for NOT NULL pairs.
regr_sxx(y, x) returns regr_count(y, x)*var_pop(x) for NOT NULL pairs.
regr_sxy(y, x) returns regr_count(y, x)*covar_pop(y, x) for NOT NULL pairs.

regr_syy(), regr_sxx(), regr_sxy() basic semantics demonstration

Try this first:

create or replace view v as select x, y from t;
create or replace view test as
with a as (
  select x, y
  from v
  where x is not null and y is not null)
select
  approx_equal( regr_syy(y, x), (regr_count(y, x)*var_pop(y)) )     ::text  as "regr_syy() test",
  approx_equal( regr_sxx(y, x), (regr_count(y, x)*var_pop(x)) )     ::text  as "regr_sxx() test",
  approx_equal( regr_sxy(y, x), (regr_count(y, x)*covar_pop(y, x)) )::text  as "regr_sxy() test"
from a;

Now do this to test the semantics on the noise-free data:

create or replace view v as select x, y from t;
select * from test;

This is the result:

 regr_syy() test | regr_sxx() test | regr_sxy() test
-----------------+-----------------+-----------------
 true            | true            | true

Now do this to test the semantics on the noisy data:

create or replace view v as select x, (y + delta) as y from t;
select * from test;

The result is the same as the result for the noise-free data.

Checking the formulas for regr_syy(), regr_sxx(), regr_sxy()

Now try this:

create or replace view v as select x, y from t;

drop function if exists f() cascade;
create function f()
  returns table(t text)
  language plpgsql
as $body$
declare
  regr_syy constant double precision not null := (
    select regr_syy(y, x) from v);
  regr_sxx constant double precision not null := (
    select regr_sxx(y, x) from v);
  regr_sxy constant double precision not null := (
    select regr_sxy(y, x) from v);

  count  double precision not null := 0;
  sum_yy double precision not null := 0;
  sum_xx double precision not null := 0;
  sum_xy double precision not null := 0;
  sum_y  double precision not null := 0;
  sum_x  double precision not null := 0;
begin
  with a as (
    select x, y
    from v
    where x is not null and y is not null)
  select count(*), sum(y^2), sum(x^2), sum(x*y), sum(y), sum(x)
  into count, sum_yy, sum_xx, sum_xy, sum_y, sum_x
  from a;

  assert
    approx_equal(regr_syy, (sum_yy - sum_y^2)/count),
  'unexpected';

  assert
    approx_equal(regr_sxx, (sum_xx - sum_x^2)/count),
  'unexpected';

  assert
    approx_equal(regr_sxy, (sum_xy - sum_y*sum_x)/count),
  'unexpected';

  t := 'regr_syy: '||to_char(regr_syy, '99999990.99999999'); return next;
  t := 'regr_sxx: '||to_char(regr_sxx, '99999990.99999999'); return next;
  t := 'regr_sxy: '||to_char(regr_sxy, '99999990.99999999'); return next;
end;
$body$;

create or replace view v as select x, y from t;
select t as "noise-free data" from f();

create or replace view v as select x, (y + delta) as y from t;
select t as "noisy data" from f();

This is a typical result:

       noise-free data
------------------------------
 regr_syy:   2083125.00000000
 regr_sxx:     83325.00000000
 regr_sxy:    416625.00000000

          noisy data
------------------------------
 regr_syy:   2121737.65951556
 regr_sxx:     83325.00000000
 regr_sxy:    416440.58954855