CockroachDB supports operation on JSON objects. In these labs, we will get familiar with working with the JSONB data type, as well as with ways to optimize queries with JSONB objects.
-
Build the single region dev cluster following these instructions.
-
You also need:
- a modern web browser,
- a SQL client:
- Cockroach SQL client
psql
- DBeaver Community edition (SQL tool with built-in CockroachDB plugin)
Connect to the database to confirm it loaded successfully
# use cockroach sql, defaults to localhost:26257
cockroach sql --insecure
# or use the --url param for any another host:
cockroach sql --url "postgresql://localhost:26257/defaultdb?sslmode=disable"
# or use psql
psql -h localhost -p 26257 -U root defaultdb
Create a table by importing a CSV file from a cloud storage bucket.
IMPORT TABLE jblob (
id INT PRIMARY KEY,
myblob JSONB
) CSV DATA ('https://raw.githubusercontent.com/cockroachlabs/workshop_labs/master/JSON-optimization/data/raw_test_blob.tsv')
WITH
delimiter = e'\t';
Check how many JSON objects were imported:
job_id | status | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+------+---------------+---------
624769960352055297 | succeeded | 1 | 1 | 0 | 261102
(1 row)
Time: 504ms total (execution 503ms / network 1ms)
Just 1 row! The entire blob has been added to just 1 row, how useful is this within a database? Not much, but we can use it to test the built-in JSONB functions
Let's practice with some JSONB built-in functions.
The JSON blob looks like this:
{
"myjson_blob": [
{
"r_id": 2259,
"r_c_id": 54043195528453770,
[...]
},
{
"r_id": 1222,
"r_c_id": 21673573206743750,
[...]
},
{many more such items}
]
}
It's always nice to start by viewing a JSON object nicely formatted, so we know what we're actually dealing with
-- here we access the item 0 of the json array
SELECT jsonb_pretty(myblob -> 'myjson_blob' -> 0) FROM jblob WHERE id = 0;
{
"c_balance": 7122,
"c_base_ap_id": 192,
"c_iattr00": 142027308,
"c_iattr01": 379685059,
"c_iattr02": 389136665,
"c_iattr03": 145392585,
"c_iattr04": 931118926,
"c_iattr05": 8816575,
"c_iattr06": 984249473,
"c_iattr07": 116663385,
"c_iattr08": 907154685,
"c_iattr09": 15899371,
"c_iattr10": 648717549,
"c_iattr11": 724567744,
"c_iattr12": 1051370766,
"c_iattr13": 50210225,
"c_iattr14": 451755713,
"c_iattr15": 982547218,
"c_iattr16": 543188731,
"c_iattr17": 564981351,
"c_iattr18": 471058604,
"c_iattr19": 759207747,
"c_id": 54043195528453770,
"c_id_str": 54043195528453770,
"c_sattr00": "whtzcyuhoywdeigoqvrivmlhedp",
"c_sattr01": "xkqenedl",
"c_sattr02": "ppbphg",
"c_sattr03": "alxslwmk",
"c_sattr04": "xovvm",
"c_sattr05": "xkj",
"c_sattr06": "tfywzzd",
"c_sattr07": "jpry",
"c_sattr08": "nvlbkmg",
"c_sattr09": "lrxjgzd",
"c_sattr10": "otolyrq",
"c_sattr11": "znzk",
"c_sattr12": "viyqtkm",
"c_sattr13": "cujp",
"c_sattr14": "kutbaoda",
"c_sattr15": "nbuuodbh",
"c_sattr16": "hdvftpl",
"c_sattr17": "n",
"c_sattr18": "ri",
"c_sattr19": "yukz",
"f_al_id": 363,
"f_arrive_ap_id": 16,
"f_arrive_time": "2020-01-06 17:10:20.962+00:00",
"f_base_price": 258,
"f_depart_ap_id": 192,
"f_depart_time": "2019-10-15 09:10:20.962+00:00",
"f_iattr00": 8520160,
"f_iattr01": 675846595,
"f_iattr02": 504955969,
"f_iattr03": 767109370,
"f_iattr04": 707645173,
"f_iattr05": 340637155,
"f_iattr06": 304036642,
"f_iattr07": 421641226,
"f_iattr08": 440205470,
"f_iattr09": 413668930,
"f_iattr10": 712127756,
"f_iattr11": 51274104,
"f_iattr12": 641344442,
"f_iattr13": 862018850,
"f_iattr14": 386515711,
"f_iattr15": 840809361,
"f_iattr16": 916318900,
"f_iattr17": 418637645,
"f_iattr18": 515763995,
"f_iattr19": 967932899,
"f_iattr20": 586772453,
"f_iattr21": 713528331,
"f_iattr22": 993065765,
"f_iattr23": 234788091,
"f_iattr24": 343690263,
"f_iattr25": 289777773,
"f_iattr26": 1066280989,
"f_iattr27": 842571001,
"f_iattr28": 506399830,
"f_iattr29": 637903749,
"f_id": 422229648081259,
"f_seats_left": 147,
"f_seats_total": 150,
"f_status": 0,
"r_c_id": 54043195528453770,
"r_f_id": 422229648081259,
"r_iattr00": 673352173,
"r_iattr01": 355513020,
"r_iattr02": 209823742,
"r_iattr03": 772737207,
"r_iattr04": 199858911,
"r_iattr05": 737503122,
"r_iattr06": 945498537,
"r_iattr07": 824685761,
"r_iattr08": 810968743,
"r_id": 2259,
"r_price": 978,
"r_seat": 1
}
This function expands the outermost JSONB object into a set of key-value pairs.
SELECT jsonb_each(myblob -> 'myjson_blob' -> 0) FROM jblob WHERE id = 0;
jsonb_each
-------------------------------------------------------
(c_balance,7122)
(c_base_ap_id,192)
(c_iattr00,142027308)
(c_iattr01,379685059)
(c_iattr02,389136665)
(c_iattr03,145392585)
[...]
Returns sorted set of keys in the outermost JSONB object.
SELECT jsonb_object_keys(myblob -> 'myjson_blob' -> 0) FROM jblob WHERE id = 0;
jsonb_object_keys
---------------------
c_balance
c_base_ap_id
c_iattr00
c_iattr01
c_iattr02
c_iattr03
c_iattr04
[...]
Cool, good job! We can now drop this table
DROP TABLE IF EXISTS jblob CASCADE;
Create a table with FLATTENED JSONB objects by importing a CSV file from a cloud storage bucket. This CSV file was pre-processed to read the JSON file and extract all values into rows, below the code for reference
import json
# will work only if you have enough Memory to read the entire file
with open('file.json') as f:
data = json.load(f)
with open('file.tsv', 'w') as f:
[f.write('{}\t{}\n'.format(i, json.dumps(data[i], separators=(',', ':')))) for i in range(len(data))]
At the SQL prompt, import the data
IMPORT TABLE jflat (
id INT PRIMARY KEY,
myflat JSONB
) CSV DATA ('https://raw.githubusercontent.com/cockroachlabs/workshop_labs/master/JSON-optimization/data/raw_test_flat.tsv')
WITH
delimiter = e'\t';
job_id | status | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+------+---------------+---------
624770775819517953 | succeeded | 1 | 110 | 0 | 262051
The flat file has a total of 110 rows.
Let's create a query that counts the number with the same c_base_ap_id
.
Use the operator ->>
to access a JSONB field and returning a string.
SELECT myflat ->> 'c_base_ap_id' AS c_base_ap_id, count(*)
FROM jflat
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;
c_base_ap_id | count
---------------+--------
16 | 14
202 | 10
131 | 7
78 | 6
148 | 6
Create a query that sums the r_price
values by c_base_ap_id
showing the TOP 10 sums of r_price
.
SELECT myflat ->> 'c_base_ap_id' AS c_base_ap_id,
SUM(CAST(myflat ->> 'r_price' AS INT)) AS price
FROM jflat
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
c_base_ap_id | price
---------------+--------
16 | 8364
202 | 5351
148 | 3900
211 | 3429
131 | 3020
78 | 2932
77 | 2340
149 | 1996
60 | 1626
168 | 1616
Import more data into the jflat
table:
DROP TABLE IF EXISTS jblob CASCADE;
IMPORT INTO jflat (id, myflat)
CSV DATA (
'https://raw.githubusercontent.com/cockroachlabs/workshop_labs/master/JSON-optimization/data/raw_test_flat2.tsv'
)
WITH
delimiter = e'\t';
Let's review how many rows we have now in total
SELECT COUNT(*) FROM jflat;
count
---------
15939
Very good, we've a lot more data to work with!
Run the following query.
The operator @>
tests whether the left JSONB field contains the right JSONB field.
SELECT id
FROM jflat
WHERE myflat @> '{"c_sattr19": "momjzdfu"}';
id
---------
3358
3944
4179
6475
16007
16501
(6 rows)
Time: 557ms total (execution 554ms / network 2ms)
557ms, a bit too slow. Check the query plan
EXPLAIN (VERBOSE) SELECT id FROM jflat WHERE myflat @> '{"c_sattr19": "momjzdfu"}';
tree | field | description | columns | ordering
-----------------+---------------------+---------------------------------------+--------------+-----------
| distribution | full | |
| vectorized | true | |
project | | | (id) |
│ | estimated row count | 1771 | |
└── filter | | | (id, myflat) |
│ | estimated row count | 1771 | |
│ | filter | myflat @> '{"c_sattr19": "momjzdfu"}' | |
└── scan | | | (id, myflat) |
| estimated row count | 15939 | |
| table | jflat@primary | |
| spans | FULL SCAN | |
As expected, it's doing a FULL SCAN on primary
, which we always want to avoid.
We can improve the Response Time (RT) by creating inverted indexes.
CREATE INVERTED INDEX idx_json_inverted ON jflat(myflat);
Once created, pull the query plan again:
EXPLAIN (VERBOSE) SELECT id FROM jflat WHERE myflat @> '{"c_sattr19": "momjzdfu"}';
tree | field | description | columns | ordering
-------+---------------------+-----------------------------------------------------------+---------+-----------
| distribution | local | |
| vectorized | true | |
scan | | | (id) |
| estimated row count | 1771 | |
| table | jflat@idx_json_inverted | |
| spans | /"c_sattr19"/"momjzdfu"-/"c_sattr19"/"momjzdfu"/PrefixEnd | |
Good, it's leveraging the inverted index. Run the query gain
SELECT id FROM jflat WHERE myflat @> '{"c_sattr19": "momjzdfu"}';
id
---------
3358
3944
4179
6475
16007
16501
(6 rows)
Time: 1ms total (execution 13ms / network 1ms)
1ms! Great improvement!
Create a simple table out of jflat
.
CREATE TABLE jflat3 AS SELECT * FROM jflat LIMIT 5;
Check the c_base_ap_id
of the inserted rows
SELECT myflat ->> 'c_base_ap_id' AS c_base_ap_id FROM jflat3;
c_base_ap_id
----------------
192
77
168
148
269
(5 rows)
Run below query, to confirm you can do joins on JSONB objects.
SELECT jflat.myflat ->> 'c_base_ap_id' AS c_base_ap_id
FROM jflat JOIN jflat3
ON jflat.myflat ->> 'c_base_ap_id' = jflat3.myflat ->> 'c_base_ap_id';
c_base_ap_id
----------------
192
77
168
148
269
168
148
192
148
77
168
77
269
77
77
148
148
148
(18 rows)
While joins are possible on JSONB object, it is recommanded to extract the join field into a computed column for efficiency.
We discuss computed columns next.
Run the following query:
SELECT myflat ->> 'c_sattr19' AS attr19,
myflat ->> 'r_seat' AS seat,
count(*),
sum(CAST(myflat ->> 'r_price' AS INT))
FROM jflat
WHERE myflat ->> 'c_sattr19' LIKE '%mom%'
GROUP BY 1,2;
attr19 | seat | count | sum
-----------+------+-------+-------
momjzdfu | 1 | 2 | 1091
momjzdfu | 0 | 3 | 1747
momjzdfu | 2 | 1 | 865
(3 rows)
Time: 76ms total (execution 74ms / network 1ms)
Let's pull the query plan
EXPLAIN (VERBOSE)
SELECT myflat ->> 'c_sattr19' AS attr19,
myflat ->> 'r_seat' AS seat,
count(*),
sum(CAST(myflat ->> 'r_price' AS INT))
FROM jflat
WHERE myflat ->> 'c_sattr19' LIKE '%mom%'
GROUP BY 1,2;
tree | field | description | columns | ordering
----------------------+---------------------+-------------------------------------+----------------------------+-----------
| distribution | full | |
| vectorized | true | |
group | | | (attr19, seat, count, sum) |
│ | estimated row count | 5313 | |
│ | aggregate 0 | count_rows() | |
│ | aggregate 1 | sum(column7) | |
│ | group by | attr19, seat | |
└── render | | | (column7, attr19, seat) |
│ | estimated row count | 5313 | |
│ | render 0 | (myflat->>'r_price')::INT8 | |
│ | render 1 | myflat->>'c_sattr19' | |
│ | render 2 | myflat->>'r_seat' | |
└── filter | | | (myflat) |
│ | estimated row count | 5313 | |
│ | filter | (myflat->>'c_sattr19') LIKE '%mom%' | |
└── scan | | | (myflat) |
| estimated row count | 15939 | |
| table | jflat@primary | |
| spans | FULL SCAN | |
As you can see, it's doing a FULL SCAN: the type of filtering requested is not possible with Inverted Indexes.
We can tune the above query by adding computed columns to the table.
Let's create a copy of the table with computed columns, insert the data, then create an index with the fields specified in the WHERE clause
CREATE TABLE jflat_new (
id INT PRIMARY KEY,
myflat JSONB,
r_seat STRING AS (myflat::JSONB ->> 'r_seat') STORED,
attr19 STRING AS (myflat::JSONB ->> 'c_sattr19') STORED,
r_price INT AS (CAST(myflat::JSONB ->> 'r_price' AS INT)) STORED,
FAMILY "primary" (id, r_seat, attr19, r_price),
FAMILY "blob" (myflat)
);
INSERT INTO jflat_new SELECT id, myflat from jflat;
CREATE INDEX ON jflat_new(attr19) STORING (r_seat, r_price);
Let's review the table again, to confirm
SHOW CREATE jflat_new;
table_name | create_statement
-------------+-------------------------------------------------------------------------------
jflat_new | CREATE TABLE public.jflat_new (
| id INT8 NOT NULL,
| myflat JSONB NULL,
| r_seat STRING NULL AS (myflat->>'r_seat':::STRING) STORED,
| attr19 STRING NULL AS (myflat->>'c_sattr19':::STRING) STORED,
| r_price INT8 NULL AS (CAST(myflat->>'r_price':::STRING AS INT8)) STORED,
| CONSTRAINT "primary" PRIMARY KEY (id ASC),
| INDEX jflat_new_attr19_idx (attr19 ASC) STORING (r_seat, r_price),
| FAMILY "primary" (id, r_seat, attr19, r_price),
| FAMILY blob (myflat)
| )
Now, we can rewrite the query and pull the plan to confirm the optimizer uses the newly created index
EXPLAIN (VERBOSE)
SELECT attr19,
r_seat,
count(*),
sum(r_price)
FROM jflat_new
WHERE attr19 LIKE '%mom%'
GROUP BY 1,2;
tree | field | description | columns | ordering
-----------------+---------------------+--------------------------------+------------------------------+-----------
| distribution | full | |
| vectorized | true | |
group | | | (attr19, r_seat, count, sum) |
│ | estimated row count | 4111 | |
│ | aggregate 0 | count_rows() | |
│ | aggregate 1 | sum(r_price) | |
│ | group by | r_seat, attr19 | |
│ | ordered | +attr19 | |
└── filter | | | (r_seat, attr19, r_price) | +attr19
│ | estimated row count | 5313 | |
│ | filter | attr19 LIKE '%mom%' | |
└── scan | | | (r_seat, attr19, r_price) | +attr19
| estimated row count | 15939 | |
| table | jflat_new@jflat_new_attr19_idx | |
| spans | /!NULL- | |
Very good, the query plan is using the index. Let's run to see if the RT improved
SELECT attr19,
r_seat,
count(*),
sum(r_price)
FROM jflat_new
WHERE attr19 LIKE '%mom%'
GROUP BY 1,2;
attr19 | r_seat | count | sum
-----------+--------+-------+-------
momjzdfu | 0 | 3 | 1747
momjzdfu | 1 | 2 | 1091
momjzdfu | 2 | 1 | 865
(3 rows)
Time: 10ms total (execution 9ms / network 1ms)
Very good, from 76ms down to 10ms, good job!
Consider the following queries, which yield the same result
SELECT id FROM jflat WHERE myflat @> '{"c_sattr19": "momjzdfu"}';
SELECT id FROM jflat_new WHERE attr19 = 'momjzdfu';
id
---------
3358
3944
4179
6475
16007
16501
(6 rows)
Time: 15ms total (execution 14ms / network 2ms)
id
---------
3944
6475
16501
3358
4179
16007
(6 rows)
Time: 2ms total (execution 1ms / network 1ms)
It's very clear that the query that leverage the computed columns + index is far more performant than the Inverted index option.
Official Docs:
Blogs