This document intends to build on the JSONB value encoding RFC. Currently, it is only possible to value-encode JSONB values in CRDB. The following document intends on proposing a format for key-encoding JSONB values in hopes of having a semantic ordering for JSON which would in turn allow forward indexes on JSON columns. In order to implement this, a lexicographical order is required for the encoding of each JSONB value.
JSON is stored in the JSONB format in CRDB. The motivation and use-cases for including JSONB were included in the scoping RFC of JSONB. Currently, it is only possible to value-encode JSONB values which do not permit forward-indexes on JSON columns. A key-encoding of JSONB will include a valid lexicographical ordering for JSON which shall in turn allow the creation of primary indexes as well as the execution of ORDER BY
statements on a JSON column. Moreover, it is imperative to note that JSON numbers that are decimals and JSON containers (arrays and objects) that contain such numbers could have composite encodings. Thus, these changes also allow JSON columns to now be composite in nature.
A JSON value can be one of seven types:
- true,
- false,
- null,
- a UTF-8 string,
- an arbitrary-precision number,
- an array of zero or more JSON values,
- or an object consisting of key-value pairs, where the keys are unique strings and the values are JSON values.
The following rules were kept in mind while designing this form of encoding, as defined in the postgres documentation
-
Object > Array > Boolean > Number > String > Null
-
Object with n pairs > Object with n-1 pairs
-
Array with n elements > Array with n - 1 elements
-
Moreover, arrays with equal number of elements are compared in the order: element - 1, element - 2, element - 3, ….
-
Objects with an equal number of key value pairs are compared in the order: key - 1, value - 1, key - 2, value - 2, ….
In order to satisfy property 1 at all times, tags are defined in an increasing order of bytes. These tags will also have to be defined in a way where the tag representing an object is a large byte representation for a hexadecimal value (such as 0xff) and the subsequent objects have a value 1 less than the previous one, where the ordering is described in point 1 above. Moreover, care has to be taken to ensure that none of these bytes correspond to a possible encoding of a JSON value.
Additionally, tags representing terminators will also be defined. There will be two terminators, one for the ascending designation and the other for the descending one, and will be required to denote the end of a key encoding of the following JSON values: Objects, Arrays, Number and Strings. JSON Boolean and JSON Null are not required to have the terminator since they do not have variable length encoding due to the presence of a single tag (as explained later in this document).
When converting textual JSON input into JSONB, the primitive types described by RFC 7159 are mapped into the primitive types described in CRDB:
A JSONB null is stored solely as the tag representing the type.
A JSONB String is stored as:
- a JSONB String tag
- a UTF-8 string
- JSONB terminator tag
A JSONB Number is stored as:
- a JSONB Number tag
- encoding of the number using CRDB’s Decimal encoding.
- JSONB terminator tag
It is imperative to note that the following query results in being true in postgres:
select 'true'::JSONB > 'false'::JSONB;
Thus, instead of a single Boolean tag in the enum (which was described above), two separate tags, one for true and one for false, shall be present. Moreover, the tag for true will have a higher value, in bytes, than the tag for false.
Thus, a JSONB Boolean is stored as just the tag depicting its type.
A JSONB array is stored as:
- a JSONB array tag
- number of elements present in the array
- encoding of the JSON values
- JSONB terminator tag
An interesting property to note about a JSON object, while conversing about its possible encoding, is the fact that there is no order in which JSON keys are stored inside the object container. This results in the following query to be true (tested inside of psql):
select '{"a": "b", "c":"d"}'::JSONB = '{"c":"d", "a":"b"}'::JSONB;
Thus, it is important to sort the key-value pairs present within a JSON object. In order to achieve a sorting “effect”, encoding will be done in a fashion where all the key-value pairs will be sorted first according to the encodings of the current keys of these pairs. This will be followed by an encoding of each key and value present in these pairs and concatenating them to build the resulting encoding. The key and value encodings, of each such pair, will be intertwined to keep the property that the resultant encoding shall sort the same way as the original object.
Thus, a JSONB object is stored as:
- a JSONB object tag
- number of key-value pairs present in the object
- an encoding of the key-value pairs where the pairs have been sorted first on the encodings of the keys.
- JSONB terminator tag
Example JSON's | Encodings |
---|---|
enc(NULL::JSONB ) |
[JSONB_NULL] |
enc("a"::JSONB ) |
[JSONB_String, enc("a"), JSONB_Terminator] |
enc('1'::JSONB ) |
[JSONB_Number, enc(1), JSONB_Terminator] |
enc('False'::JSONB ) |
[JSONB_False] |
enc('True'::JSONB ) |
[JSONB_True] |
enc('[1, 2, "a"]'::JSONB ) |
[JSONB_Array, enc(3), JSONB_Number, enc(1), JSONB_Terminator, JSONB_Number, enc(2), JSONB_Terminator, JSONB_String, enc("a"), JSONB_Terminator, JSONB_Terminator] |
enc('{“a”:”b”, “c”:”d”}'::JSONB ) |
[ JSONB_Object, enc(2), JSONB_String, enc(a), JSONB_Terminator, JSONB_String, enc(b), JSONB_Terminator, JSONB_String, enc(c), JSONB_Terminator, JSONB_String, enc(d), JSONB_Terminator, JSONB_Terminator] |
enc('{“a”: ‘[1]’, “b”: {“c”: 0}'::JSONB ) |
[ JSONB_Object, enc(2), JSONB_String, enc(a), JSONB_Terminator, JSONB_Array, enc(1), enc(1), JSONB_Terminator, JSONB_String, enc(b), JSONB_Object, enc(1), JSONB_String, enc(c), JSONB_Terminator, JSONB_Number, enc(0), JSONB_Terminator, JSONB_Terminator ] |