Bulk export
This page documents bulk export for YugabyteDB’s Cassandra-compatible YCQL API. To export data from a YugabyteDB (or an Apache Cassandra) table, you can use the cassandra-unloader
tool.
We will first create a source YugabyteDB table and populate it with data. Then you will export the data out using the cassandra-unloader tool. We will use a generic gaming user profile use case as a running example to illustrate the export process.
Create Source Table
Following is the schema of the destination YugabyteDB table.
CREATE KEYSPACE load;
USE load;
CREATE TABLE users(
user_id varchar,
score1 double,
score2 double,
points int,
object_id varchar,
PRIMARY KEY (user_id));
Generate Sample Data
# sample usage:
# To generate a 10GB (10240 MB) file.
# % python gen_csv.py <outfile_name> <outfile_size_MB>
# % python gen_csv.py file01.csv 10240
#
import numpy as np
import uuid
import csv
import os
import sys
outfile = sys.argv[1] # output file name
outsize_mb = int(sys.argv[2])
print("Outfile = " + outfile)
print("Outfile Size (MB) = " + str(outsize_mb))
chunksize = 10000
with open(outfile, 'ab') as csvfile:
while (os.path.getsize(outfile)//1024**2) < outsize_mb:
data = [[uuid.uuid4() for i in range(chunksize)],
np.random.random(chunksize)*1000,
np.random.random(chunksize)*50,
np.random.randint(1000000, size=(chunksize,)),
[uuid.uuid4() for i in range(chunksize)]]
csvfile.writelines(['%s,%.6f,%.6f,%i,%s\n' % row for row in zip(*data)])
Sample rows generated by script would like the following.
$ head file00.csv
3399bebc-d2cc-40c6-89d4-26102e08ff61,622.491927,40.262305,658257,44d73f8c-1d3c-424e-8fd2-d316c56b8454
4f362eac-f79f-45f6-b6b1-bd5a81f931dc,141.344278,3.024717,694290,7768b010-8411-490a-b523-88cc3ec53cb5
a24a6587-eea4-4907-ac7f-9f99dcac8f82,345.110599,3.869150,510943,5765d1d3-2855-4dbe-9f11-bb3b8631789f
...
To generate 5 CSV files of about 5 GB each, run the following commands.
python ./gen_csv.py file00.csv 5120 &
python ./gen_csv.py file01.csv 5120 &
python ./gen_csv.py file02.csv 5120 &
python ./gen_csv.py file03.csv 5120 &
python ./gen_csv.py file04.csv 5120 &
Load Sample Data
cassandra-loader
is a general purpose bulk loader for CQL that supports various types of delimited files (particularly CSV files). For more details, review the README of the YugabyteDB cassandra-loader fork. Note that cassandra-loader requires quotes for collection types (for example, “[1,2,3]” rather than [1,2,3] for lists).
Install cassandra-loader
You can do this as shown below.
$ wget https://github.com/yugabyte/cassandra-loader/releases/download/v0.0.27-yb-2/cassandra-loader
$ chmod a+x cassandra-loader
Run cassandra-loader
The files can be queued up for upload one at a time. Sample invocation:
./cassandra-loader \
-schema "load.users(user_id, score1, score2, points, object_id)" \
-boolStyle 1_0 \
-numFutures 1000 \
-rate 10000 \
-queryTimeout 65 \
-numRetries 10 \
-progressRate 200000 \
-host <clusterNodeIP> \
-f file01.csv
For additional options to cassandra-loader, see here.
Export Data
Install cassandra-unloader
$ wget https://github.com/brianmhess/cassandra-loader/releases/download/v0.0.27/cassandra-unloader
$ chmod a+x cassandra-unloader
Run cassandra-unloader
./cassandra-unloader \
-schema "load.users(user_id, score1, score2, points, object_id)" \
-boolStyle 1_0 \
-host <clusterNodeIP> \
-f outfile.csv
For additional options to cassandra-unloader, see here.
Always specify timezone
The time zone is not added to the default timestamp formats when using thecassandra-loader
and cassandra-unloader
utilities.
Make sure that timestamps are exported and imported in the same format, including the time zone.
Examples: yyyy-MM-dd HH:mm:ss.SSSZ
and yyyy-MM-dd HH:mm:ss.SSSXXX
.
Use tab character for delim on JSONB columns
The default delimiter (,
) does not work with JSONB
columns. Use the tab character for delim -delim $'\t'
.