This repository is currently archived, please get in touch via [email protected] or the TileDB Forum if you have a use-case to discuss.
TileDB is an efficient library for managing large-scale, multi-dimensional dense and sparse array data introducing a novel array format. For more information about TileDB see the official TileDB documentation
This connector allows running SQL on TileDB arrays via Presto. The TileDB-Presto interface supports column subselection on attributes and predicate pushdown on dimension fields, leading to superb performance for projection and range queries.
A quickstart Docker image is available. The docker image will start a single-node
Presto cluster and open the CLI Presto interface where SQL can be run.
The Docker image includes two example tiledb arrays
/opt/tiledb_example_arrays/dense_global
and /opt/tiledb_example_arrays/sparse_global
.
Simply run:
docker run -it --rm tiledb/tiledb-presto
or mount a local array into the Docker container with the -v
option:
docker run -it --rm -v /local/array/path:/data/local_array tiledb/tiledb-presto
In the above example, replace /local/array/path
with the path to the
array folder on your local machine. The /data/local_array
path is the
path you will use within the Docker image to access /local/array/path
(you can replace it with another path of your choice).
The TileDB presto connector supports most SQL operations from PrestoDB. Arrays can be referenced dynamically and are not required to be "pre-registered" with Presto. No external service (such as Apache Hive) is required.
Examples:
show columns from "file:///opt/tiledb_example_arrays/dense_global";
Column | Type | Extra | Comment
--------+---------+-------+-----------
rows | integer | | Dimension
cols | integer | | Dimension
a | integer | | Attribute
select * from "file:///opt/tiledb_example_arrays/dense_global" WHERE rows = 3 AND cols between 1 and 2;
rows | cols | a
------+------+---
3 | 1 | 5
3 | 2 | 6
Presto uses the form of catalog
.schema
.table_name
for querying. TileDB
does not have a concept of a schema, so any valid string can be used for the
schema name when querying. tiledb
is used for convenience in the examples.
table_name
is the array URI and can be local (file://) or remote (s3://).
It is possible to specify a file that contains sql to be run from the docker image:
echo 'select * from "file:///opt/tiledb_example_arrays/dense_global" limit 10;' > example.sql
docker run -it --rm -v ${PWD}/example.sql:/tmp/example.sql tiledb/tiledb-presto /opt/presto/bin/entrypoint.sh --file /tmp/example.sql
You can also run a sql statement directly:
docker run -it --rm tiledb/tiledb-presto /opt/presto/bin/entrypoint.sh --execute 'select * from "file:///opt/tiledb_example_arrays/dense_global" limit 10;'
For more examples see docs/Examples.md.
For custom connector SQL options see docs/SQL.md.
Currently this connector is built as a plugin. It must be packaged and installed on the PrestoDB instances.
Download the latest release and skip to the section Installation on existing Presto instance.
The TileDB connector can be built using the following command from the top level directory of the Presto source.
./mvnw package
Tests can be skipped by adding -DskipTests
./mvnw package -DskipTests
If you are installing the plugin on an existing Presto instance, such as Amazon
EMR, you need to copy the target/presto-tiledb-$VERSION
folder
to a tiledb
directory under the plugin directory on echo Presto node.
Using Amazon EMR target/presto-tiledb-$VERSION
needs to be copied to
/usr/lib/presto/plugin/tiledb/
See docs/Limitations.md.
When a multi-dimensional array is queried in Presto, the dimensions are converted to table columns for the result set. TileDB array attributes attributes are also returned as columns.
A sparse array is materialized similarly to dense arrays. The following example
depicts a 2D 4x4
sparse array with dimensions dim1
, dim2
and
a single attribute a
. Notice that this array has mostly empty cells.
+-------+-------+-------+-------+
| | | | |
| a:1 | | | |
| | | | |
+-------------------------------+
| | | | |
| | | a:3 | a:2 |
| | | | |
+-------------------------------+
| | | | |
| | | | |
| | | | |
+-------------------------------+
| | | | |
| | | | |
| | | | |
+-------+-------+-------+-------+
For sparse arrays only non-empty cells are materialized and returned. The above array is modeled in Presto as a table of the form:
dim1 | dim2 | a
------+------+---
1 | 1 | 1
2 | 4 | 2
2 | 3 | 3