Features:
- Converting protobuf columns to and from JSON.
- Selecting parts of a protobuf.
Examples:
SELECT protobuf_query('MyProto:some_submessage.some_map[some_key]', my_proto_column) FROM ...;
-- or for multiple results as an array
SELECT protobuf_query_array('MyProto:some_repeated_field[*].some_map[*].some_field', my_proto_column) FROM ...;
-- or as rows
SELECT protobuf_query_multi('MyProto:some_repeated_field[*].some_map[*].some_field', my_proto_column) FROM ...;
Protobufs in a database is in many ways similar to JSON in a database:
- less code needed to convert data structures into database rows and back
- less alter tables required as data fields are added and removed
The main advantage of protobufs over JSON is a compact and efficient representation. The main advantage of JSON is human-readability without extra steps.
Protobufs have a well-defined JSON representation supported by most implementations.
Using that for storage (as json
or jsonb
)
is a valid strategy if efficiency is not a major concern.
Note that JSON protobufs store fields by name instead of by number, which means that field renames will break backwards- and forwards-compatibility.
Requires Postgres 11 or newer.
Prerequisites for installing from source:
- Postgres server headers (if you use the Postgres APT repo, install
postgresql-server-dev-$VERSION
) - A (mostly) C++17-capable compiler (Ubuntu 18.04's default GCC 7.4.0 works)
To install:
make
sudo make install
Then in Postgres:
CREATE EXTENSION postgres_protobuf;
Prebuilt binaries are here.
Currently they are built with and tested on Ubuntu 20.04 only.
You can build your own binary package from source with make dist
.
On Ubuntu, install the contents of the binary package like this:
- copy the contents of
lib/
to/usr/lib/postgresql/11/lib/
- copy the contents of
extension/
to/usr/share/postgresql/11/extension/
Other distros may have those Postgres directories elsewhere.
First, you need to tell the extension about your protobuf schema.
Run the protobuf compiler protoc
on your .proto
files with
--descriptor_set_out=path/to/schema.pb
and --include_imports
,
then insert that file to the database with
INSERT INTO protobuf_file_descriptor_sets (name, file_descriptor_set)
VALUES ('default', contents_of_file)
where contents_of_file
is a Postgres byte array.
Commit the transaction if you're in one.
Now you can query protobufs described by your schema like this:
SELECT protobuf_query('path.to.Message:path.to.field', protobuf_as_byte_array) AS result;
The following functions are defined:
protobuf_query(query, protobuf)
returns the first matching field in the protobuf, or NULL if missing or proto3 default. (You cancoalesce
the nulls.)protobuf_query_array(query, protobuf)
returns all matching fields in the protobuf as a text array. Missing or proto3 default values are not returned.protobuf_query_multi(query, protobuf)
returns all matching fields in the protobuf as a set of rows. Missing or proto3 default values are not returned.protobuf_to_json_text(protobuf_type, protobuf)
converts the protobuf to a JSON string, assuming it's of the given type.protobuf_from_json_text(protobuf_type, json_str)
parses a protobuf from a JSON string, assuming it's of the given type.protobuf_extension_version()
returns the extension versionX.Y.Z
as a numberX*10000+Y*100+Z
.
Queries take the form [<descriptor_set>:]<message_name>:<path>
where
<descriptor_set>
(optional) is the name of the descriptor set you inserted intoprotobuf_file_descriptor_sets
. Defaults todefault
.<message_name>
is a fully qualified name of a protobuf message.<path>
may be composed of- field selectors like
submessage.field
. Fields may be specified by name or by number. - index selectors like
field[123]
, which select the Nth element of a repeated field. - map value selectors like
field[123]
orfield[abc]
, which select the given map key (both numbers and strings work). - universal selectors written
field[*]
, which select all elements of a repeated field or map. - universal map key selectors written
field|keys
, which select all keys of a map.
- field selectors like
While this extension should be good to go for exploratory queries and other non-demanding workloads, there are some limitations to consider before relying on it too heavily in a design.
This extension is written in C++, so some caution is warranted. It's may be unwise to give it untrusted queries or protobuf data that you haven't parsed and reserialized first. Conversion to/from JSON should be safer since it thinly wraps the well-tested protobuf library, but see the note about memory management below.
Queries need to load and scan through the entire protobuf column. Storing and querying large protobufs can be significantly slower than splitting the data into columns that can be queried individually.
In the current version, protobuf schemas are deserialized and cached only for the duration of a single transaction. If you intend to run many SELECTs on protobufs, wrap them in a transaction for better performance.
In the current version, there is no way to use the query functions as index expressions,
because the query functions depend on your protobuf schema, which may change over time.
In other words, you can't create an index (nor a UNIQUE
constraint) on the contents of a protobuf column.
A future version may allow such indices for queries that are written entirely in terms of field numbers.
This extension allocates most things on the default C++ heap, because the protobuf library does not support custom allocators. This memory might not be properly accounted for by Postgres's memory management and monitoring systems.
While care has been taken to avoid memory leaks and to tolerate memory exhaustion at any point,
this has not been rigorously tested, and I am unsure whether the protobuf library always cleans up
correctly after a bad_alloc
exception.
In the current version, protobuf map values are buffered before being scanned by queries. This means that huge protobufs whose bulk is under a highly nested map may take a lot of memory. Note that an attacker can easily construct such a protobuf if the schema has recursion involving a map. This caveat applies only to queries, not to JSON conversion.
Other than the above case, memory use is linear, or roughly
O(|descriptor sets| + |largest protobuf queried| + |result set|)
.
Protobuf versions 2 and 3 should both work, but groups (which have been deprecated for a long time) are currently not supported and will cause queries to fail if encountered.
Note that proto3 does not store default values for fields. The current implementation returns no result for missing values. This means that e.g. a numeric field whose value is 0 will not show up in results. A future version may change this default.
This extension has so far only been tested on AMD64.
The current version does not provide functions for modifying the contents of protobufs, and there are no concrete plans to add such functonality. There are also no concrete plans to significantly extend the query language.
If you do need these features occasionally, consider converting to JSON and back, since Postgres has a wide range of JSON operations.
There is an older project pg_protobuf, which appears to have been an experiment that is no longer actively maintained. It has less features and conveniences, e.g. fields must be referred to by number, but its code is much simpler, with no dependency on the Protobuf library or C++.