The bucket allocation scheme
It might seem that the built-in function width_bucket()
(here in the PostgreSQL documentation) is tailor-made for this task. But there's a snag.
The values in the column t4.dp_score" lie in this so-called closed-closed interval [0, 100]
—expressed like this in SQL:
dp_score is between 0 and 100
Note: This Wikipedia article explains the terminology used to characterize intervals. For example:
The open-closed interval (0,1] means greater than 0 and less than or equal to 1. And the closed-open interval [0,1) means greater than or equal to 0 and less than 1.
The values returned both by percent_rank()
and by cume_dist()
, when scaled so that they are percentages, also lie in this same range.
Here is the signature (using sensibly chosen names for the formal parameters) of the double precision
overload of width_bucket()
:
function width_bucket(
val in double precision, -- the value to be allocated
lb in double precision, -- the lower bound of the overall closed-open interval
ub in double precision, -- the upper bound of the overall closed-open interval
n_vals in int) -- the required number of buckets
returns int
If you invoke width_bucket()
to allocate a value $1
with lb = 0
, ub = 100
, and n_vals = 10
, like this:
width_bucket($1, 0, 100, 10)
then the buckets are defined as these closed-open intervals:
[ 0, 10) -> bucket 1
[10, 20) -> bucket 2
[20, 30) -> bucket 3
...
[80, 90) -> bucket 9
[90, 100) -> bucket 10
Any value that is less than 0 is allocated to the lower overflow bucket 0. And any value that is greater than or equal to the value 100 is allocated to the upper overflow bucket 11. Here is an example:
do $body$
declare
lb constant double precision := 0;
ub constant double precision := 100;
n_vals constant int := 10;
begin
assert width_bucket( -0.0000000001, lb, ub, n_vals) = 0, 'unexpected';
assert width_bucket( 0, lb, ub, n_vals) = 1, 'unexpected';
assert width_bucket( 99.9999999999, lb, ub, n_vals) = 10, 'unexpected';
assert width_bucket(100, lb, ub, n_vals) = 11, 'unexpected';
end;
$body$;
This means that attention has to be paid to the special case when the value to be allocated is 100 in order that it is allocated to bucket 10, and not to the upper overflow bucket 11. However, this is not the main snag. Rather, it's this:
- Pen-and-paper analysis, and some empirical testing, show that
ntile()
,percent_rank()
, andcume_dist()
can produce the same results only when these open-closed intervals are used:
( 0, 10] -> bucket 1
(10, 20] -> bucket 2
(20, 30] -> bucket 3
...
(80, 90] -> bucket 9
(90, 100] -> bucket 10
This means that attention has to be paid to the special case at the other end of the closed-closed interval of possible input values: when the value to be allocated is 0, it must be allocated to bucket 1, and not to the lower overflow bucket 0.
The consequence is that width_bucket()
cannot be used as is. This leaves two choices. Each relies on implementing a new function bucket()
with the same signature as width_bucket()
but with a different implementation.
-
The first choice is to implement
bucket
as a wrapper aroundwidth_bucket()
to change its behavior at the bucket boundaries, See cr_bucket_using_width_bucket.sql. -
The second choice is to implement
bucket
from scratch. Seecr_bucket_dedicated_code.sql
.
See do_assert_bucket_ok.sql for the acceptance test that each must pass.