Best practices
Core features
Global secondary indexes
Indexes use multi-shard transactional capability of YugabyteDB and are global and strongly consistent (ACID). To add secondary indexes, you need to create tables with transactions enabled. They can also be used as materialized views by using the INCLUDE
clause.
Unique indexes
YCQL supports unique indexes. A unique index disallows duplicate values from being inserted into the indexed columns.
Covered indexes
When querying by a secondary index, the original table is consulted to get the columns that aren't specified in the index. This can result in multiple random reads across the main table.
Sometimes, a better way is to include the other columns that you're querying that are not part of the index using the INCLUDE
clause. When additional columns are included in the index, they can be used to respond to queries directly from the index without querying the table.
This turns a (possible) random read from the main table to just a filter on the index.
Atomic read modify write operations with UPDATE IF EXISTS
For operations like UPDATE ... IF EXISTS
and INSERT ... IF NOT EXISTS
that require an atomic read-modify-write, Apache Cassandra uses LWT which requires 4 round-trips between peers. These operations are supported in YugabyteDB a lot more efficiently, because of YugabyteDB's CP (in the CAP theorem) design based on strong consistency, and require only a single Raft-round trip between peers. Number and counter types work the same and don't need a separate "counters" table.
JSONB document data type
YugabyteDB supports the jsonb
data type that makes it easy to model JSON data, which does not have a set schema and might change often. You can use JSONB to group less interesting and less accessed columns of a table. YCQL also supports JSONB expression indexes that can be used to speed up data retrieval that would otherwise require scanning the JSON entries.
Use jsonb columns only when necessary
jsonb
columns are slower to read and write compared to normal columns. They also take more space because they need to store keys in strings and make keeping data consistency harder. A good schema design is to keep most columns as regular ones or collections, and only using jsonb
for truly dynamic values. Don't create a data jsonb
column where you store everything, but a dynamic_data jsonb
column and other ones being primitive columns.
Incrementing numeric types
In YugabyteDB, YCQL extends Apache Cassandra to add increment and decrement operators for integer data types. Integers can be set, inserted, incremented, and decremented while COUNTER
can only be incremented or decremented. YugabyteDB implements CAS(compare and swap) operations in one round trip, compared to four for Apache Cassandra.
Expire older records automatically with TTL
YCQL supports automatic expiration of data using the TTL feature
. You can set a retention policy for data at table/row/column level and the older data is automatically purged from the database.
Note
TTL is not applicable to transactional tables and hence is not supported in that context.Performance
Use YugabyteDB drivers
Use YugabyteDB-specific client drivers because they are cluster and partition aware and support jsonb
columns.
Leverage connection pooling in the YCQL client
A single client (for example, a multi-threaded application) should ideally use a single cluster object. The single cluster object typically holds underneath the covers a configurable number of connections to yb-tservers. Typically 1 or 2 per TServer suffices to serve even 64-128 application threads. The same connection can be used for multiple outstanding requests, also known as multiplexing.
Use prepared statements
Use prepared statements whenever possible. This will ensure that YB partition aware drivers are able to route queries to the tablet leader, improve throughput and server doesn't have to parse the query on each operation.
Use batching for higher throughput
Use batching for writing a set of operations. This will send all operations in a single RPC call instead of using multiple RPC calls, one per operation. Each batch operation has higher latency compared to single rows operations but has higher throughput overall.
Column and row sizes
For consistent latency/performance, try keeping columns in the 2MB
range or less even though you support an individual column being about 32 MB
.
Big columns add up when selecting full rows or multiple of them. For consistent latency/performance, you suggest keeping the size of rows in the 32 MB
range or less.
Don't use big collections
Collections are designed for storing small sets of values that are not expected to grow to arbitrary size (such as phone numbers or addresses for a user rather than posts or messages). While collections of larger sizes are allowed, they may have a significant impact on performance for queries involving them. In particular, some list operations (insert at an index and remove elements) require a read-before-write.
Collections with many elements
Each element inside a collection ends up as a separate key value in DocDB adding per-element overhead.
If your collections are immutable, or you update the whole collection in full, consider using the JSONB
data type. An alternative would also be to use ProtoBuf or FlatBuffers and store the serialized data in a BLOB
column.
Use partition_hash
for large table scans
partition_hash
function can be handy for querying a subset of the data to get approximate row counts or to breakdown
full-table operations into smaller sub-tasks that can be run in parallel. See example usage along with a working Python script.
Miscellaneous
Use TRUNCATE
to empty tables instead of DELETE
TRUNCATE
deletes the database files that store the table and is very fast. While DELETE
inserts a delete marker
for each row in transactions and they are removed from storage when a compaction runs.