Skip to content
This repository has been archived by the owner on Mar 7, 2024. It is now read-only.

TileDB-Inc/TileDB-Presto

Repository files navigation

This repository is currently archived, please get in touch via [email protected] or the TileDB Forum if you have a use-case to discuss.

TileDB Presto Connector

TileDB-Presto CI

TileDB is an efficient library for managing large-scale, multi-dimensional dense and sparse array data introducing a novel array format. For more information about TileDB see the official TileDB documentation

This connector allows running SQL on TileDB arrays via Presto. The TileDB-Presto interface supports column subselection on attributes and predicate pushdown on dimension fields, leading to superb performance for projection and range queries.

Quickstart

Docker

A quickstart Docker image is available. The docker image will start a single-node Presto cluster and open the CLI Presto interface where SQL can be run. The Docker image includes two example tiledb arrays /opt/tiledb_example_arrays/dense_global and /opt/tiledb_example_arrays/sparse_global. Simply run:

docker run -it --rm tiledb/tiledb-presto

or mount a local array into the Docker container with the -v option:

docker run -it --rm -v /local/array/path:/data/local_array tiledb/tiledb-presto

In the above example, replace /local/array/path with the path to the array folder on your local machine. The /data/local_array path is the path you will use within the Docker image to access /local/array/path (you can replace it with another path of your choice).

The TileDB presto connector supports most SQL operations from PrestoDB. Arrays can be referenced dynamically and are not required to be "pre-registered" with Presto. No external service (such as Apache Hive) is required.

Examples:

show columns from "file:///opt/tiledb_example_arrays/dense_global";
Column |  Type   | Extra |  Comment  
--------+---------+-------+-----------
 rows   | integer |       | Dimension 
 cols   | integer |       | Dimension 
 a      | integer |       | Attribute 

select * from "file:///opt/tiledb_example_arrays/dense_global" WHERE rows = 3 AND cols between 1 and 2;
 rows | cols | a 
------+------+---
    3 |    1 | 5 
    3 |    2 | 6 

Presto uses the form of catalog.schema.table_name for querying. TileDB does not have a concept of a schema, so any valid string can be used for the schema name when querying. tiledb is used for convenience in the examples. table_name is the array URI and can be local (file://) or remote (s3://).

Running SQL from a File or Command Line Argument

It is possible to specify a file that contains sql to be run from the docker image:

echo 'select * from "file:///opt/tiledb_example_arrays/dense_global" limit 10;' > example.sql
docker run -it --rm -v ${PWD}/example.sql:/tmp/example.sql tiledb/tiledb-presto /opt/presto/bin/entrypoint.sh --file /tmp/example.sql

You can also run a sql statement directly:

docker run -it --rm tiledb/tiledb-presto /opt/presto/bin/entrypoint.sh --execute 'select * from "file:///opt/tiledb_example_arrays/dense_global" limit 10;'

For more examples see docs/Examples.md.

For custom connector SQL options see docs/SQL.md.

Installation

Currently this connector is built as a plugin. It must be packaged and installed on the PrestoDB instances.

Latest Release

Download the latest release and skip to the section Installation on existing Presto instance.

Building Connector From Source

The TileDB connector can be built using the following command from the top level directory of the Presto source.

./mvnw package

Tests can be skipped by adding -DskipTests

./mvnw package -DskipTests

Installation on an existing Presto instance

If you are installing the plugin on an existing Presto instance, such as Amazon EMR, you need to copy the target/presto-tiledb-$VERSION folder to a tiledb directory under the plugin directory on echo Presto node.

AWS EMR

Using Amazon EMR target/presto-tiledb-$VERSION needs to be copied to /usr/lib/presto/plugin/tiledb/

Configuration

See docs/Configuration.md.

Limitations

See docs/Limitations.md.

Arrays as SQL Tables

When a multi-dimensional array is queried in Presto, the dimensions are converted to table columns for the result set. TileDB array attributes attributes are also returned as columns.

Sparse Arrays

A sparse array is materialized similarly to dense arrays. The following example depicts a 2D 4x4 sparse array with dimensions dim1, dim2 and a single attribute a. Notice that this array has mostly empty cells.

+-------+-------+-------+-------+
|       |       |       |       |
|  a:1  |       |       |       |
|       |       |       |       |
+-------------------------------+
|       |       |       |       |
|       |       |  a:3  |  a:2  |
|       |       |       |       |
+-------------------------------+
|       |       |       |       |
|       |       |       |       |
|       |       |       |       |
+-------------------------------+
|       |       |       |       |
|       |       |       |       |
|       |       |       |       |
+-------+-------+-------+-------+

For sparse arrays only non-empty cells are materialized and returned. The above array is modeled in Presto as a table of the form:

 dim1 | dim2 | a
------+------+---
    1 |    1 | 1
    2 |    4 | 2
    2 |    3 | 3