Join us on YugabyteDB Community Slack
Star us on
Get Started
Slack
GitHub
Get Started
v2.13 (latest) v2.12 (stable) v2.8 (earlier version) v2.6 (earlier version) v2.4 (earlier version) v2.2 (earlier version) v2.1 (earlier version) v2.0 (earlier version) v1.3 (earlier version)
  • YUGABYTEDB CORE
    • Quick start
      • 1. Install YugabyteDB
      • 2. Create a local cluster
      • 3. Explore distributed SQL
      • 4. Build an application
        • Java
        • Node.js
        • Go
        • Python
        • Ruby
        • C#
        • PHP
        • C++
        • C
        • Scala
    • Explore
      • SQL features
        • Schemas and Tables
        • Data Types
        • Data Manipulation
        • Queries and Joins
        • Expressions and Operators
        • Stored Procedures
        • Triggers
        • Advanced features
          • Cursors
          • Table Partitioning
          • Views
          • Savepoints
          • Collations
          • Extensions
        • Going beyond SQL
          • Follower reads
          • Tablespaces
      • Fault tolerance
      • Horizontal scalability
        • Scaling Transactions
        • Sharding Data
      • Transactions
        • Distributed Transactions
        • Isolation Levels
        • Explicit Locking
      • Indexes and Constraints
        • Overview
        • Unique Indexes
        • Partial Indexes
        • Expression Indexes
        • Generalized Inverted Indexes
        • Primary Key
        • Foreign Key
        • Other Constraints
      • JSON support
      • Multi-region deployments
        • Sync replication (3+ regions)
        • Async Replication (2+ regions)
        • Row-Level Geo-Partitioning
        • Read replicas
      • Query tuning
        • Introduction
        • Get query statistics using pg_stat_statements
        • Viewing live queries with pg_stat_activity
        • Analyzing queries with EXPLAIN
        • Optimizing YSQL queries using pg_hint_plan
      • Cluster management
        • Point-in-time recovery
      • Security
      • Observability
        • Prometheus Integration
        • Grafana Dashboard
    • Develop
      • Learn app development
        • 1. SQL vs NoSQL
        • 2. Data modeling
        • 3. Data types
        • 4. ACID transactions
        • 5. Aggregations
        • 6. Batch operations
        • 7. Date and time
        • 8. Strings and text
        • 9. TTL for data expiration
      • Real-world examples
        • E-Commerce app
        • IoT fleet management
      • Explore sample apps
      • Best practices
      • Cloud-native development
        • Codespaces
        • Gitpod
    • Migrate
      • Migration process overview
      • Migrate from PostgreSQL
        • Convert a PostgreSQL schema
        • Migrate a PostgreSQL application
        • Export PostgreSQL data
        • Prepare a cluster
        • Import PostgreSQL data
        • Verify Migration
    • Deploy
      • Deployment checklist
      • Manual deployment
        • 1. System configuration
        • 2. Install software
        • 3. Start YB-Masters
        • 4. Start YB-TServers
        • 5. Verify deployment
      • Kubernetes
        • Single-zone
          • Open Source
          • Amazon EKS
          • Google Kubernetes Engine
          • Azure Kubernetes Service
        • Multi-zone
          • Amazon EKS
          • Google Kubernetes Engine
        • Multi-cluster
          • Google Kubernetes Engine
        • Best practices
        • Connect Clients
      • Docker
      • Public clouds
        • Amazon Web Services
        • Google Cloud Platform
        • Microsoft Azure
      • Multi-DC deployments
        • Three+ data center (3DC)
        • Asynchronous Replication
        • Read replica clusters
    • Benchmark
      • TPC-C
      • sysbench
      • YCSB
      • Key-value workload
      • Large datasets
      • Scalability
        • Scaling queries
      • Resilience
        • Jepsen testing
      • Performance Troubleshooting
    • Secure
      • Security checklist
      • Enable Authentication
        • Enable User Authentication
        • Configure ysql_hba_conf_csv
      • Authentication Methods
        • Password Authentication
        • LDAP Authentication
        • Host-Based Authentication
        • Trust Authentication
      • Role-Based Access Control
        • Overview
        • Manage Users and Roles
        • Grant Privileges
        • Row-Level Security (RLS)
        • Column-Level Security
      • Encryption in Transit
        • Create server certificates
        • Enable server-to-server encryption
        • Enable client-to-server encryption
        • Connect to Clusters
        • TLS and authentication
      • Encryption at rest
      • Column-level encryption
      • Audit Logging
        • Configure Audit Logging
        • Session-Level Audit Logging
        • Object-Level Audit Logging
      • Vulnerability disclosure policy
    • Manage
      • Back up and restore
        • Back up data
        • Restore data
        • Point-in-time recovery
        • Snapshot and restore data
      • Migrate data
        • Bulk import
        • Bulk export
      • Change cluster configuration
      • Diagnostics reporting
      • Upgrade a deployment
      • Grow cluster
    • Troubleshoot
      • Troubleshooting
      • Cluster level issues
        • YCQL connection issues
        • YEDIS connection Issues
        • Recover tserver/master
        • Replace a failed YB-TServer
        • Replace a failed YB-Master
        • Manual remote bootstrap when a majority of peers fail
      • Node level issues
        • Check servers
        • Inspect logs
        • System statistics
        • Disk failure
        • Common error messages
    • Contribute
      • Core database
        • Contribution checklist
        • Build the source
        • Configure a CLion project
        • Run the tests
        • Coding style
  • YUGABYTE PLATFORM
    • Overview
      • Install
      • Configure
    • Install Yugabyte Platform
      • Prerequisites
      • Prepare the environment
      • Install software
      • Prepare nodes (on-prem)
      • Uninstall software
    • Configure Yugabyte Platform
      • Create admin user
      • Configure the cloud provider
      • Configure the backup target
      • Configure alerts
    • Create deployments
      • Multi-zone universe
      • Multi-region universe
      • Multi-cloud universe
      • Read replica cluster
      • Asynchronous replication
    • Manage deployments
      • Start and stop processes
      • Add a node
      • Eliminate an unresponsive node
      • Enable high availability
      • Edit configuration flags
      • Edit a universe
      • Delete a universe
      • Configure instance tags
      • Upgrade YugabyteDB software
      • Migrate to Helm 3
    • Back up universes
      • Configure backup storage
      • Back up universe data
      • Restore universe data
      • Schedule data backups
    • Security
      • Security checklist
      • Customize ports
      • LDAP authentication
      • Authorization platform
      • Create a KMS configuration
      • Enable encryption at rest
      • Enable encryption in transit (TLS)
      • Network security
    • Alerts and monitoring
      • Alerts
      • Live Queries dashboard
      • Slow Queries dashboard
    • Troubleshoot
      • Install and upgrade issues
      • Universe issues
    • Administer Yugabyte Platform
      • Back Up Yugabyte Platform
      • Authenticate with LDAP
    • Upgrade Yugabyte Platform
      • Upgrade using Replicated
  • YUGABYTE CLOUD
    • Overview
    • Quick start
      • Create a free cluster
      • Connect to the cluster
      • Create a database
      • Explore distributed SQL
      • Build an application
        • Before you begin
        • Java
        • Go
        • Python
        • Node.js
        • C
        • C++
        • C#
        • Ruby
        • Rust
        • PHP
    • Deploy clusters
      • Planning a cluster
      • Create a free cluster
      • Create a standard cluster
      • VPC network
        • Overview
        • Set up a VPC network
        • VPCs
        • Peering Connections
    • Secure clusters
      • IP allow lists
      • Database authorization
      • Add database users
      • Encryption in transit
      • Audit cloud activity
    • Connect to clusters
      • Cloud Shell
      • Client shell
      • Connect applications
    • Alerts and monitoring
      • Alerts
      • Performance metrics
      • Live queries
      • Slow YSQL queries
      • Cluster activity
    • Manage clusters
      • Backup and restore
      • Scale and configure clusters
      • Create extensions
    • Administer Yugabyte Cloud
      • Manage cloud users
      • Manage billing
      • Cluster costs
    • Example applications
      • Connect a Spring application
      • Connect a YCQL Java application
      • Hasura Cloud
      • Deploy a GraphQL application
    • Security architecture
      • Security architecture
      • Shared responsibility model
    • Troubleshoot
    • Yugabyte Cloud FAQ
    • What's new
  • INTEGRATIONS
    • Apache Kafka
    • Apache Spark
    • JanusGraph
    • KairosDB
    • Presto
    • Metabase
    • WSO2 Identity Server
    • YSQL Loader
    • Yugabyte JDBC Driver
    • Prisma
    • Hasura
      • Application Development
      • Benchmarking
    • Spring Framework
      • Spring Data YugabyteDB
      • Spring Data Cassandra
    • Flyway
    • GORM
    • Liquibase
    • Sequelize
    • SQLAlchemy
    • Entity Framework
    • Django REST framework
  • REFERENCE
    • Architecture
      • Design goals
      • Key concepts
        • Universe
        • YB-TServer Service
        • YB-Master Service
      • Core functions
        • Universe creation
        • Table creation
        • Write IO path
        • Read IO path
        • High availability
      • Layered architecture
      • Query layer
        • Overview
      • DocDB transactions layer
        • Transactions overview
        • Transaction isolation levels
        • Explicit locking
        • Read Committed
        • Single-row transactions
        • Distributed transactions
        • Transactional IO path
      • DocDB sharding layer
        • Hash & range sharding
        • Tablet splitting
        • Colocated tables
      • DocDB replication layer
        • Replication
        • xCluster replication
        • Read replicas
        • Change data capture (CDC)
      • DocDB storage layer
        • Persistence
        • Performance
    • APIs
      • YSQL
        • The SQL language
          • SQL statements
            • ABORT
            • ALTER DATABASE
            • ALTER DEFAULT PRIVILEGES
            • ALTER DOMAIN
            • ALTER GROUP
            • ALTER POLICY
            • ALTER ROLE
            • ALTER SEQUENCE
            • ALTER TABLE
            • ALTER USER
            • ANALYZE
            • BEGIN
            • CALL
            • COMMENT
            • COMMIT
            • COPY
            • CREATE AGGREGATE
            • CREATE CAST
            • CREATE DATABASE
            • CREATE DOMAIN
            • CREATE EXTENSION
            • CREATE FUNCTION
            • CREATE GROUP
            • CREATE INDEX
            • CREATE MATERIALIZED VIEW
            • CREATE OPERATOR
            • CREATE OPERATOR CLASS
            • CREATE POLICY
            • CREATE PROCEDURE
            • CREATE ROLE
            • CREATE RULE
            • CREATE SCHEMA
            • CREATE SEQUENCE
            • CREATE TABLE
            • CREATE TABLE AS
            • CREATE TRIGGER
            • CREATE TYPE
            • CREATE USER
            • CREATE VIEW
            • DEALLOCATE
            • DELETE
            • DO
            • DROP AGGREGATE
            • DROP CAST
            • DROP DATABASE
            • DROP DOMAIN
            • DROP EXTENSION
            • DROP FUNCTION
            • DROP GROUP
            • DROP MATERIALIZED VIEW
            • DROP OPERATOR
            • DROP OPERATOR CLASS
            • DROP OWNED
            • DROP POLICY
            • DROP PROCEDURE
            • DROP ROLE
            • DROP RULE
            • DROP SEQUENCE
            • DROP TABLE
            • DROP TRIGGER
            • DROP TYPE
            • DROP USER
            • END
            • EXECUTE
            • EXPLAIN
            • GRANT
            • INSERT
            • LOCK
            • PREPARE
            • REASSIGN OWNED
            • REFRESH MATERIALIZED VIEW
            • RELEASE SAVEPOINT
            • RESET
            • REVOKE
            • ROLLBACK
            • ROLLBACK TO SAVEPOINT
            • SAVEPOINT
            • SELECT
            • SET
            • SET CONSTRAINTS
            • SET ROLE
            • SET SESSION AUTHORIZATION
            • SET TRANSACTION
            • SHOW
            • SHOW TRANSACTION
            • TRUNCATE
            • UPDATE
            • VALUES
          • WITH clause
            • WITH clause—SQL syntax and semantics
            • recursive CTE
            • case study—traversing an employee hierarchy
            • traversing general graphs
              • graph representation
              • common code
              • undirected cyclic graph
              • directed cyclic graph
              • directed acyclic graph
              • rooted tree
              • Unique containing paths
              • Stress testing find_paths()
            • case study—Bacon Numbers from IMDb
              • Bacon numbers for synthetic data
              • Bacon numbers for IMDb data
        • Data types
          • Array
            • array[] constructor
            • Literals
              • Text typecasting and literals
              • Array of primitive values
              • Row
              • Array of rows
            • FOREACH loop (PL/pgSQL)
            • array of DOMAINs
            • Functions and operators
              • ANY and ALL
              • Array comparison
              • Array slice operator
              • Array concatenation
              • Array properties
              • array_agg(), unnest(), generate_subscripts()
              • array_fill()
              • array_position(), array_positions()
              • array_remove()
              • array_replace() / set value
              • array_to_string()
              • string_to_array()
          • Binary
          • Boolean
          • Character
          • Date and time
            • Conceptual background
            • Timezones and UTC offsets
              • Catalog views
              • Extended_timezone_names
                • Unrestricted full projection
                • Real timezones with DST
                • Real timezones no DST
                • Synthetic timezones no DST
              • Offset/timezone-sensitive operations
                • Timestamptz to/from timestamp conversion
                • Pure 'day' interval arithmetic
              • Four ways to specify offset
                • Name-resolution rules
                  • 1 case-insensitive resolution
                  • 2 ~names.abbrev never searched
                  • 3 'set timezone' string not resolved in ~abbrevs.abbrev
                  • 4 ~abbrevs.abbrev before ~names.name
                  • Helper functions
              • Syntax contexts for offset
              • Recommended practice
            • Typecasting between date-time and text-values
            • Semantics of the date-time data types
              • Date data type
              • Time data type
              • Plain timestamp and timestamptz
              • Interval data type
                • Interval representation
                  • Ad hoc examples
                  • Representation model
                • Interval value limits
                • Declaring intervals
                • Justify() and extract(epoch...)
                • Interval arithmetic
                  • Interval-interval comparison
                  • Interval-interval addition and subtraction
                  • Interval-number multiplication
                  • Moment-moment overloads of "-"
                  • Moment-interval overloads of "+" and "-"
                • Custom interval domains
                • Interval utility functions
            • Typecasting between date-time datatypes
            • Operators
              • Test comparison overloads
              • Test addition overloads
              • Test subtraction overloads
              • Test multiplication overloads
              • Test division overloads
            • General-purpose functions
              • Creating date-time values
              • Manipulating date-time values
              • Current date-time moment
              • Delaying execution
              • Miscellaneous
                • Function age()
                • Function extract() | date_part()
                • Implementations that model the overlaps operator
            • Formatting functions
            • Case study—SQL stopwatch
            • Download & install the date-time utilities
            • ToC
          • JSON
            • JSON literals
            • Primitive and compound data types
            • Code example conventions
            • Indexes and check constraints
            • Functions & operators
              • ::jsonb, ::json, ::text (typecast)
              • ->, ->>, #>, #>> (JSON subvalues)
              • - and #- (remove)
              • || (concatenation)
              • = (equality)
              • @> and <@ (containment)
              • ? and ?| and ?& (key or value existence)
              • array_to_json()
              • jsonb_agg()
              • jsonb_array_elements()
              • jsonb_array_elements_text()
              • jsonb_array_length()
              • jsonb_build_object()
              • jsonb_build_array()
              • jsonb_each()
              • jsonb_each_text()
              • jsonb_extract_path()
              • jsonb_extract_path_text() and json_extract_path_text()
              • jsonb_object()
              • jsonb_object_agg()
              • jsonb_object_keys()
              • jsonb_populate_record()
              • jsonb_populate_recordset()
              • jsonb_pretty()
              • jsonb_set() and jsonb_insert()
              • jsonb_strip_nulls()
              • jsonb_to_record()
              • jsonb_to_recordset()
              • jsonb_typeof()
              • row_to_json()
              • to_jsonb()
          • Money
          • Numeric
          • Range
          • Serial
          • UUID
        • Functions and operators
          • Aggregate functions
            • Informal functionality overview
            • Invocation syntax and semantics
            • grouping sets, rollup, cube
            • Per function signature and purpose
              • avg(), count(), max(), min(), sum()
              • array_agg(), string_agg(), jsonb_agg(), jsonb_object_agg()
              • bit_and(), bit_or(), bool_and(), bool_or()
              • variance(), var_pop(), var_samp(), stddev(), stddev_pop(), stddev_samp()
              • linear regression
                • covar_pop(), covar_samp(), corr()
                • regr_%()
              • mode(), percentile_disc(), percentile_cont()
              • rank(), dense_rank(), percent_rank(), cume_dist()
            • case study—percentile_cont() and the "68–95–99.7" rule
            • case study—linear regression on COVID data
              • Download the COVIDcast data
              • Ingest the COVIDcast data
                • Inspect the COVIDcast data
                • Copy the .csv files to staging tables
                • Check staged data conforms to the rules
                • Join the staged data into a single table
                • SQL scripts
                  • Create cr_staging_tables()
                  • Create cr_copy_from_scripts()
                  • Create assert_assumptions_ok()
                  • Create xform_to_covidcast_fb_survey_results()
                  • ingest-the-data.sql
              • Analyze the COVIDcast data
                • symptoms vs mask-wearing by day
                • Data for scatter-plot for 21-Oct-2020
                • Scatter-plot for 21-Oct-2020
                • SQL scripts
                  • analysis-queries.sql
                  • synthetic-data.sql
          • currval()
          • lastval()
          • nextval()
          • Window functions
            • Informal functionality overview
            • Invocation syntax and semantics
            • Per function signature and purpose
              • row_number(), rank() and dense_rank()
              • percent_rank(), cume_dist() and ntile()
              • first_value(), nth_value(), last_value()
              • lag(), lead()
              • Tables for the code examples
                • table t1
                • table t2
                • table t3
                • table t4
            • case study—analyzing a normal distribution
              • Bucket allocation scheme
              • do_clean_start.sql
              • cr_show_t4.sql
              • cr_dp_views.sql
              • cr_int_views.sql
              • cr_pr_cd_equality_report.sql
              • cr_bucket_using_width_bucket.sql
              • cr_bucket_dedicated_code.sql
              • do_assert_bucket_ok
              • cr_histogram.sql
              • cr_do_ntile.sql
              • cr_do_percent_rank.sql
              • cr_do_cume_dist.sql
              • do_populate_results.sql
              • do_report_results.sql
              • do_compare_dp_results.sql
              • do_demo.sql
              • Reports
                • Histogram report
                • dp-results
                • compare-dp-results
                • int-results
          • yb_hash_code()
        • Extensions
        • Keywords
        • Reserved names
      • YCQL
        • ALTER KEYSPACE
        • ALTER ROLE
        • ALTER TABLE
        • CREATE INDEX
        • CREATE KEYSPACE
        • CREATE ROLE
        • CREATE TABLE
        • CREATE TYPE
        • DROP INDEX
        • DROP KEYSPACE
        • DROP ROLE
        • DROP TABLE
        • DROP TYPE
        • GRANT PERMISSION
        • GRANT ROLE
        • REVOKE PERMISSION
        • REVOKE ROLE
        • USE
        • INSERT
        • SELECT
        • EXPLAIN
        • UPDATE
        • DELETE
        • TRANSACTION
        • TRUNCATE
        • Simple expressions
        • Subscripted expressions
        • Function call
        • Operators
        • BLOB
        • BOOLEAN
        • Collection
        • FROZEN
        • INET
        • Integer and counter
        • Non-integer
        • TEXT
        • DATE, TIME, and TIMESTAMP
        • UUID and TIMEUUID
        • JSONB
        • Date and time
        • BATCH
    • CLIs
      • yb-ctl
      • yb-docker-ctl
      • ysqlsh
      • ycqlsh
      • yb-admin
      • yb-ts-cli
      • ysql_dump
      • ysql_dumpall
    • Configuration
      • yb-tserver
      • yb-master
      • yugabyted
      • Default ports
    • Drivers
      • Client drivers for YSQL
      • Client drivers for YCQL
    • Connectors
      • Kafka Connect YugabyteDB
    • Third party tools
      • Arctype
      • DBeaver
      • DbSchema
      • pgAdmin
      • SQL Workbench/J
      • TablePlus
      • Visual Studio Code
    • Sample datasets
      • Chinook
      • Northwind
      • PgExercises
      • SportsDB
      • Retail Analytics
  • RELEASES
    • Releases overview
      • v2.13 series (latest)
      • v2.12 series (stable)
      • v2.11 series
      • v2.9 series
      • v2.8 series
      • v2.7 series
      • v2.6 series
      • v2.5 series
      • v2.4 series
      • v2.3 series
      • v2.2 series
      • v2.1 series
      • v2.0 series
      • v1.3 series
      • v1.2 series
    • Release versioning
  • FAQ
    • Comparisons
      • Amazon Aurora
      • Google Cloud Spanner
      • CockroachDB
      • TiDB
      • Vitess
      • MongoDB
      • FoundationDB
      • Amazon DynamoDB
      • Azure Cosmos DB
      • Apache Cassandra
      • PostgreSQL
      • Redis in-memory store
      • Apache HBase
    • General FAQ
    • Operations FAQ
    • API compatibility FAQ
    • Yugabyte Platform FAQ
  • MISC
    • YEDIS
      • Quick start
      • Develop
        • Build an application
        • C#
        • C++
        • Go
        • Java
        • NodeJS
        • Python
      • API reference
        • APPEND
        • AUTH
        • CONFIG
        • CREATEDB
        • DELETEDB
        • LISTDB
        • SELECT
        • DEL
        • ECHO
        • EXISTS
        • EXPIRE
        • EXPIREAT
        • FLUSHALL
        • FLUSHDB
        • GET
        • GETRANGE
        • GETSET
        • HDEL
        • HEXISTS
        • HGET
        • HGETALL
        • HINCRBY
        • HKEYS
        • HLEN
        • HMGET
        • HMSET
        • HSET
        • HSTRLEN
        • HVALS
        • INCR
        • INCRBY
        • KEYS
        • MONITOR
        • PEXPIRE
        • PEXPIREAT
        • PTTL
        • ROLE
        • SADD
        • SCARD
        • RENAME
        • SET
        • SETEX
        • PSETEX
        • SETRANGE
        • SISMEMBER
        • SMEMBERS
        • SREM
        • STRLEN
        • ZRANGE
        • TSADD
        • TSCARD
        • TSGET
        • TSLASTN
        • TSRANGEBYTIME
        • TSREM
        • TSREVRANGEBYTIME
        • TTL
        • ZADD
        • ZCARD
        • ZRANGEBYSCORE
        • ZREM
        • ZREVRANGE
        • ZSCORE
        • PUBSUB
        • PUBLISH
        • SUBSCRIBE
        • UNSUBSCRIBE
        • PSUBSCRIBE
        • PUNSUBSCRIBE
    • Legal
      • Third party software
> APIs > YSQL > Functions and operators > Window functions > Per function signature and purpose >

percent_rank(), cume_dist() and ntile()

Report a doc issue Suggest new content
  • percent_rank()
  • cume_dist()
  • ntile()
  • Comparing the effect of percent_rank(), cume_dist(), and ntile() on the same input

These three window functions bear a strong family resemblance to each other. The explanation of what they achieve rests on the notion of a centile. A centile is a measure used in statistics to denote the value below which a given fraction of the values in a set falls. The term percentile is often preferred by those who like to express fractions as percentages. For example, the 70th percentile is the value below which 70% of the values fall.

Briefly:

  • percent_rank() assigns the centile fraction to each value in the input population
  • cume_dist() function returns, for a specified value within a population, the number of values that are less than or equal to the specified value divided by the total number of values—in other words, the relative position of a value within the population
  • ntile assigns the bucket number, in an equiheight histogram (see below) to each value in the population by specifying the required number of buckets as the function's actual argument.

Suppose that you want to divide a set of values into N subsets, where each subset contains the same number of values. The result is referred to as an equiheight histogram. This implies unequal value-axis bucket widths. Compare this with the width_bucket() regular built-in function that produces an equiwidth histogram—that is it marks off the value axis into equal width buckets that, in general each contains a different number of values.

If the required number of unequal width buckets for the equal height histogram is 5, then your goal is equivalently stated by saying that you want to split the population at the values that fall at the 20th, 40th, 60th, and 80th percentiles. In this use case, and if and only if the values are unique and the population size is an exact multiple of the required number of buckets, then any one of percent_rank(), cume_dist() or ntile() can be used to meet the goal. But using ntile() needs noticeably simpler SQL than using the other two because it's tailored for this, but only this, purpose. Moreover, if the set contains duplicate values, then only ntile() meets the goal (to the extent that this is feasible) because it randomly assigns ties at the boundaries between the buckets so that some go to the higher-value set and some go to the lower-value set. The dedicated section Analyzing a normal distribution with percent_rank(), cume_dist() and ntile() demonstrates this with two large populations of approximately normally distributed values—one with no duplicates and one with lots of duplicates. And it shows that if the population has ties, or when the population size is not an exact multiple of the required number of buckets, then percent_rank() and cume_dist() produce different results from each other, and each divides the population differently than ntile(), because they handle ties differently and because of other subtle differences.

If you want to split the population into N subsets of unequal population sizes, then you must use one of percent_rank() or cume_dist().

This documentation assumes that you already understand the subtle difference between the percentile notion and the cumulative distribution notion, and know how to define your goal with respect to splitting populations of values into subsets—and that you will therefore know how to choose which of the functions percent_rank(), cume_dist() or ntile() that you should use.

percent_rank()

Signature:

input value:       <no formal parameter>
return value:      double precision

Purpose: Return the percentile rank of each row within the window, with respect to the argument of the window_definition's window ORDER BY clause. The value p returned by percent_rank() is a number in the range 0 <= p <= 1. It is calculated like this:

percentile_rank = (rank - 1) / ("no. of rows in window" - 1)

Arguably, then, this function is wrongly named because to deserve its name it would return a value in the range 0 through 100.

The lowest value of percent_rank() within the window will always be 0.0, even when there is a tie between the lowest-ranking rows. The highest possible value of percent_rank() within the window is 1.0, but this value is seen only when the highest-ranking row has no ties. If the highest-ranking row does have ties, then the highest vale of percent_rank() within the window will be correspondingly less than 1.0 according to how many rows ties for the top rank.

For example, the query computes the percentile rank for a table "t" with primary key "k" and "score" column, over all the rows:

select
  k,
  score,
  100*(percent_rank() over (order by score)) as "percent_rank()"
from t;

When percent_rank() for a particular row has a value of, say,0.42, it means this score is higher than 42% of the other rows.

cume_dist()

Signature:

input value:       <no formal parameter>
return value:      double precision

Purpose: Return a value that represents the number of rows with values less than or equal to the current row’s value divided by the total number of rows—in other words, the relative position of a value in a set of values. The graph of all values of cume_dist() within the window is known as the cumulative distribution of the argument of the window_definition's window ORDER BY clause. The value c returned by cume_dist() is a number in the range 0 <= c <= 1. It is calculated like this:

cume_dist() =
  "no of rows with a value <= the current row's value" /
  "no. of rows in window"

Note: the formula suggests that The return value c will lie in the open-closed range 0 < c <= 1. But the results shown in the study described in the section Analyzing a normal distribution with percent_rank(), cume_dist() and ntile() show that the range is indeed the closed-closed 0 <= c <= 1. This is doubtless due to rounding errors in the function's implementation.

You can use cume_dist() to answer questions like this:

  • Show me the rows whose score is within the top x% of the window's population, ranked by score.

    prepare top_few_percent(double precision) as
    with v as (
      select
        score,
        100*cume_dist() over (order by dp_score) as cd,
        k
      from t)
    select
      score,
      to_char(cd, '999.9') as "cume_dist()",
      k
    from v
    where cd > $1
    order by cd desc, k;
    

ntile()

Signature:

input value:       no_of_buckets int
return value:      int

Purpose: Return an integer value for each row that maps it to a corresponding percentile. For example, if you wanted to mark the boundaries between the highest-ranking 20% of rows, the next-ranking 20% of rows, and so on, then you would use ntile(5). The top 20% of rows would be marked with 1, the next-to-top 20% of rows would be marked with 2, and so on so that the bottom 20% of rows would be marked with 5.

Note: If the number of rows in the window, N, is a multiple of the actual value with which you invoke ntile(), n, then each percentile set would have exactly N/n rows. This is achieved, if there are ties right at the boundary between two percentile sets, by randomly assigning some to one set and some to the other. If N is not a multiple of n, then ntile() assigns the rows to the percentile sets so that the numbers assigned to each are as close as possible to being the same.

Comparing the effect of percent_rank(), cume_dist(), and ntile() on the same input

If you haven't yet installed the tables that the code examples use, then go to the section The data sets used by the code examples.

The query that this section presents shows that the results produced by percent_rank() and cume_dist() are consistent with the formulas for these values that are given in the accounts, above, of these two functions.

Create a data set using the ysqlsh script that table t2 presents. This has been designed:

  • so that the scores for the rows with "class=1" are unique so that the ordering produces no ties
  • and so that the scores for the rows with "class=2" have duplicate values so that the ordering does produce ties. The first tie group has three rows; and the second has two rows.

Now do this:

with
  v1 as (
    select
      class,
      k,
      score,
      (rank()         over w1) as r,
      (percent_rank() over w1) as pr,
      (cume_dist()    over w1) as cd,
      (ntile(4)       over w1) as nt,
      (count(*)       over w2) as n_tot,
      (count(*)       over w3) as n_thru_curr
    from t2
    window
      w1 as (partition by class order by score),
      w2 as (w1 range between unbounded preceding and unbounded following),
      w3 as (w1 range between unbounded preceding and current row)),
  v2 as (
    select
      class,
      k,
      score,
      n_tot,
      n_thru_curr,
      r,
      (r::numeric - 1.0)/(n_tot::numeric - 1.0) as pr_calc,
      (n_thru_curr::numeric)/(n_tot::numeric)   as cd_calc,
      pr,
      cd,
      nt
    from v1)
select
  class,
  k,
  score,
  n_tot,
  n_thru_curr,
  (pr = pr_calc)::text     as "pr check",
  (cd = cd_calc)::text     as "cd check",
  r                        as "rank()",
  to_char(pr*100, '990.9') as "percent_rank()",
  to_char(cd*100, '990.9') as "cume_dist",
  nt                       as "ntile()"
from v2
order by class, "rank()", k;

This is the result. To make it easier to see the pattern, several blank lines have been manually inserted here between each successive set of rows with the same value for "class". And in the second set, which has ties, one blank line has been inserted between each tie group.

 class | k  | score | n_tot | n_thru_curr | pr check | cd check | rank() | percent_rank() | cume_dist | ntile() 
-------+----+-------+-------+-------------+----------+----------+--------+----------------+-----------+---------
     1 |  1 |     1 |     9 |           1 | true     | true     |      1 |    0.0         |   11.1    |       1
     1 |  2 |     2 |     9 |           2 | true     | true     |      2 |   12.5         |   22.2    |       1
     1 |  3 |     3 |     9 |           3 | true     | true     |      3 |   25.0         |   33.3    |       1
     1 |  4 |     4 |     9 |           4 | true     | true     |      4 |   37.5         |   44.4    |       2
     1 |  5 |     5 |     9 |           5 | true     | true     |      5 |   50.0         |   55.6    |       2
     1 |  6 |     6 |     9 |           6 | true     | true     |      6 |   62.5         |   66.7    |       3
     1 |  7 |     7 |     9 |           7 | true     | true     |      7 |   75.0         |   77.8    |       3
     1 |  8 |     8 |     9 |           8 | true     | true     |      8 |   87.5         |   88.9    |       4
     1 |  9 |     9 |     9 |           9 | true     | true     |      9 |  100.0         |  100.0    |       4

     2 | 10 |     2 |     9 |           3 | true     | true     |      1 |    0.0         |   33.3    |       1
     2 | 11 |     2 |     9 |           3 | true     | true     |      1 |    0.0         |   33.3    |       1
     2 | 12 |     2 |     9 |           3 | true     | true     |      1 |    0.0         |   33.3    |       1

     2 | 13 |     4 |     9 |           4 | true     | true     |      4 |   37.5         |   44.4    |       2

     2 | 14 |     5 |     9 |           5 | true     | true     |      5 |   50.0         |   55.6    |       2

     2 | 15 |     6 |     9 |           6 | true     | true     |      6 |   62.5         |   66.7    |       3

     2 | 16 |     7 |     9 |           8 | true     | true     |      7 |   75.0         |   88.9    |       4
     2 | 17 |     7 |     9 |           8 | true     | true     |      7 |   75.0         |   88.9    |       3

     2 | 18 |     9 |     9 |           9 | true     | true     |      9 |  100.0         |  100.0    |       4

Notice that in this example, the number of rows per window, 9, is not a multiple of the actual value, 4 with which the function is invoked. This means that the number of rows assigned to each bucket can't be the same. Here, as promised, ntile() makes a best effort to get the numbers as close to each other as is possible. You can confirm, visually, that the populations are three for one of the four buckets and two for the other three.

Notice, too, what the outcomes are for the tie groups. Each of percent_rank() and cume_dist() produces a different value for each row where "class=1", which has no ties. For "class=2", when there are ties, these functions each produce the same value for all the rows in each of the two tie groups. In contrast, ntile() assigns the two rows in the second tie group (with "score=7") to different percentile groups.

  • percent_rank()
  • cume_dist()
  • ntile()
  • Comparing the effect of percent_rank(), cume_dist(), and ntile() on the same input
Ask our community
  • Slack
  • Github
  • Forum
  • StackOverflow
Yugabyte
Contact Us
Copyright © 2017-2022 Yugabyte, Inc. All rights reserved.