Home > APIs > YSQL > Functions and operators > Aggregate functions > Per function signature and purpose >

Within-group hypothetical-set aggregate functions

This section describes the uses of rank(), dense_rank(), percent_rank(), and cume_dist() as "within-group hypothetical-set" aggregate functions.

The same functions can also be invoked as window functions. That use is described here:

The basic semantic definition of each function is the same in each invocation scenario. But the goals of the two invocation methods are critically different.

Syntax example: Suppose that table "t" can be populated like this:

insert into t(g1,... gN, v1,... vN) values (:a,... :b, :c,... :d);

The columns with names like "gN" represent the grouping columns and the columns with names like "vN" represent the columns whose values will be used to invoke the function. In the "within-group hypothetical-set" use, the function "f()" would be invoked like this:

select
  g1,... gN,
  f(:v1,... :vN) within group (order by v1,... vN)
from t
group by g1,... gN;

For each of these hypothetical-set aggregates, the list of actual arguments specified for the function must match the number and types of the column expressions specified following ORDER BY.

Consider this counter-example:

select
  g1
  f(:v1,... :vN) within group (order by v1,... vN)
from t;

Notice that "g1" is not mentioned in a group by clause. It causes this error:

42803: column "t.g1" must appear in the GROUP BY clause or be used in an aggregate function

Generic purpose: Suppose that "f()" is invoked as a window function like this:

select
  g1,... gN, v1,... vN,
  f() over(partition by g1,... gN order by v1,... vN)
from t;

Imagine, hypothetically, that, now, a row is inserted with a particular set of values for "(g1,... gN, v1,... vN)", that the same SELECT is repeated, and that the value returned by "f()" for the new row is "x". The data type of "x" is function-specific. It's bigint for rank() and dense_rank(). And it's double precision for percent_rank() and cume_dist(). This thought experiment can be trivially conducted in actuality simply by starting a transaction, inserting the new row, running the query, noting the result for the new row, and then rolling back.

Suppose that, instead of doing the hypothetical insert, the function is invoked using the "within-group hypothetical-set" syntax shown above using the values for "(v1,... vN)" that were hypothetically used for the thought experiment. The result will be, for the hypothesized values for "(g1,... gN)", the same as would have been seen if the "start transaction;... rollback;" approach had been used. But, beyond this, values will be produced for each possible combination of values for "(g1,... gN)" that occur in the table at the moment that the SELECT statement is executed. (The expressions "g1,... gN" are those that are mentioned in the GROUP BY clause.) See the Semantics demonstration section.

Unlike most built-in aggregate functions, these aggregate functions are not strict—that is they do not filter out rows containing NULLs. Rather, NULLs sort according to the rule specified in the ORDER BY clause.

rank()

Signature:

input value:       VARIADIC "any" ORDER BY VARIADIC "any"
                   
return value:      bigint

Purpose: Returns the integer ordinal rank of each row according to the emergent order that the window ORDER BY clause specifies. The series of values starts with 1 but, when the window contains ties, the series is not dense.

See the account of rank() in the Window functions section for more information.

dense_rank()

Signature:

input value:       VARIADIC "any" ORDER BY VARIADIC "any"
                   
return value:      bigint

Purpose: Returns the integer ordinal rank of the distinct value of each row according to what the window ORDER BY clause specifies. The series of values starts with 1 and, even when the window contains ties, the series is dense.

See the account of dense_rank() in the Window functions section for more information.

percent_rank()

Signature:

input value:       VARIADIC "any" ORDER BY VARIADIC "any"
                   
return value:      double precision

Purpose: Returns the percentile rank of each row within the window, with respect to the argument of the window_definition's window ORDER BY clause. The value p returned by percent_rank() is a number in the range 0 <= p <= 1. It is calculated like this:

percentile_rank = (rank - 1) / ("no. of rows in window" - 1)

See the account of percent_rank() in the Window functions section for more information.

cume_dist()

input value:       VARIADIC "any" ORDER BY VARIADIC "any"
                   
return value:      double precision

Purpose: Returns a value that represents the number of rows with values less than or equal to the current row’s value divided by the total number of rows—in other words, the relative position of a value in a set of values. The graph of all values of cume_dist() within the window is known as the cumulative distribution of the argument of the window_definition's window ORDER BY clause. The value c returned by cume_dist() is a number in the range 0 <= c <= 1. It is calculated like this:

cume_dist() =
  "no of rows with a value <= the current row's value" /
  "no. of rows in window"

See the account of cume_dist() in the Window functions section for more information.

Semantics demonstration

First, create and populate the test table:

drop table if exists t cascade;
create table t(
  k      int primary key,
  class  int not null,
  score  int);

insert into t(k, class, score)
with a as (
  select s.v from generate_series(1, 10) as s(v))
values(0, 1, null::int)
union all
select
  v,
  ntile(2) over (order by v),
  case v <= 5
    when true then v*2
              else (v - 5)*2
  end
from a;

\pset null <null>
select class, score
from t
order by class, score nulls first;

This is the result:

 class | score  
-------+--------
     1 | <null>
     1 |      2
     1 |      4
     1 |      6
     1 |      8
     1 |     10
     2 |      2
     2 |      4
     2 |      6
     2 |      8
     2 |     10

Next, create a view defined by a SELECT statement that invokes the four functions of interest as window functions:

create or replace view v as
select
  k,
  class,
  score,
  (rank()         over w) as r,
  (dense_rank()   over w) as dr,
  (percent_rank() over w) as pr,
  (cume_dist()    over w) as cd
from t
window w as (partition by class order by score nulls first);

Visualize the results that the view defines:

select
  class, score, r, dr, to_char(pr, '90.99') as pr, to_char(cd, '90.99') as cd
from v
order by class, r;

This is the result:

 class | score  | r | dr |   pr   |   cd   
-------+--------+---+----+--------+--------
     1 | <null> | 1 |  1 |   0.00 |   0.17
     1 |      2 | 2 |  2 |   0.20 |   0.33
     1 |      4 | 3 |  3 |   0.40 |   0.50
     1 |      6 | 4 |  4 |   0.60 |   0.67
     1 |      8 | 5 |  5 |   0.80 |   0.83
     1 |     10 | 6 |  6 |   1.00 |   1.00
     2 |      2 | 1 |  1 |   0.00 |   0.20
     2 |      4 | 2 |  2 |   0.25 |   0.40
     2 |      6 | 3 |  3 |   0.50 |   0.60
     2 |      8 | 4 |  4 |   0.75 |   0.80
     2 |     10 | 5 |  5 |   1.00 |   1.00

Now, simulate the hypothetical insert of two new rows, one in each class, and visualize the values that the four functions of interest produce for them. Do this within a transaction that you rollback.

start transaction;
insert into t(k, class, score) values (21, 1, 5), (22, 2, 6);

select
  class, score, r, dr, to_char(pr, '90.99') as pr, to_char(cd, '90.99') as cd
from v
where k in (21, 22)
order by class, r;

rollback;

This is the result:

 class | score | r | dr |   pr   |   cd   
-------+-------+---+----+--------+--------
     1 |     5 | 4 |  4 |   0.50 |   0.57
     2 |     6 | 3 |  3 |   0.40 |   0.67

Next, create a table function, parameterized by the value of the hypothetical to-be-inserted score, to show the result of the "Within-group hypothetical-set" invocation of the four functions of interest:

drop function if exists h(int) cascade;
create function h(hypothetical_score in int)
  returns table(
    class int,
    score int,
    r bigint,
    dr bigint,
    pr double precision,
    cd double precision)
  language sql
as $body$
  with v as (
    select
      class,
      rank(hypothetical_score)         within group (order by score nulls first) as r,
      dense_rank(hypothetical_score)   within group (order by score nulls first) as dr,
      percent_rank(hypothetical_score) within group (order by score nulls first) as pr,
      cume_dist(hypothetical_score)    within group (order by score nulls first) as cd
    from t
    group by class)
  select class, hypothetical_score as score, r, dr, pr, cd from v;
$body$;

First invoke it with a hypothetical score of 5:

select
  class, score, r, dr, to_char(pr, '90.99') as pr, to_char(cd, '90.99') as cd
from h(5)
order by class;

This is the result:

 class | score | r | dr |   pr   |   cd   
-------+-------+---+----+--------+--------
     1 |     5 | 4 |  4 |   0.50 |   0.57
     2 |     5 | 3 |  3 |   0.40 |   0.50

Now invoke it with a hypothetical score of 6:

select
  class, score, r, dr, to_char(pr, '90.99') as pr, to_char(cd, '90.99') as cd
from h(6)
order by class;

This is the result:

 class | score | r | dr |   pr   |   cd   
-------+-------+---+----+--------+--------
     1 |     6 | 4 |  4 |   0.50 |   0.71
     2 |     6 | 3 |  3 |   0.40 |   0.67

Notice that, as promised, each invocation produces two rows—one row for each of the two distinct values of "class".

It's easy to produce the identical result that simulating the hypothetical inserts within a rolled back transaction produced:

with v as (
  select class, score, r, dr, pr, cd from h(5) where class = 1
  union all
  select class, score, r, dr, pr, cd from h(6) where class = 2)
select
  class, score, r, dr, to_char(pr, '90.99') as pr, to_char(cd, '90.99') as cd
from v
order by class;

This is the result:

 class | score | r | dr |   pr   |   cd   
-------+-------+---+----+--------+--------
     1 |     5 | 4 |  4 |   0.50 |   0.57
     2 |     6 | 3 |  3 |   0.40 |   0.67

This is indeed identical to the result that the simulated two hypothetical inserts produced.