This projects aims to make it easy to get started with Presto. It is based on Docker and Docker compose. Currently, the following features are supported:
- Dedicated Presto scheduler node and variable number of worker nodes
- Function Namespace Manager (for creating functions)
- Hive connector, Hive Metastore, and pseudo-replicated HDFS (i.e., without replication) with variable number of data nodes
- Reading from S3 without addtitional configuration (if running in EC2 and with a properly configured instance profile)
The following should be enough to bring up all required services:
docker-compose up
To change the number of Presto worker nodes or HDFS data nodes, use the --scale
flag of docker-compose:
docker-compose up --scale datanode=3 --scale presto-worker=3
Above command uses a pre-built docker image. If you want the image to be build locally, do the following instead:
docker-compose --file docker-compose-local.yml up
If you are behind a corporate firewall, you will have to configure Maven (which is used to build part of Presto) as follows before running above command:
export MAVEN_OPTS="-Dhttp.proxyHost=your.proxy.com -Dhttp.proxyPort=3128 -Dhttps.proxyHost=your.proxy.com -Dhttps.proxyPort=3128"
The data/
folder is mounted into the HDFS namenode container, from where you can upload it using the HDFS client in that container (docker-presto_presto_1
may have a different name on your machine; run docker ps
to find out):
docker exec -it docker-presto_namenode_1 hadoop fs -mkdir /dataset
docker exec -it docker-presto_namenode_1 hadoop fs -put /data/file.parquet /dataset/
docker exec -it docker-presto_namenode_1 hadoop fs -ls /dataset
You can use the Presto CLI included in the Docker containers of this project (adapt container name if necessary):
docker exec -it docker-presto_presto_1 presto-cli --catalog hive --schema default
Alternatively, you can download the Presto CLI, rename it, make it executable, and run the following:
./presto-cli --server localhost:8080 --catalog hive --schema default
Suppose you have the following file test.json
:
{"s": "hello world", "i": 42}
Upload it to /test/test.csv
on HDFS as described above. Then run the following in the Presto CLI:
CREATE TABLE test (s VARCHAR, i INTEGER) WITH (EXTERNAL_LOCATION = 'hdfs://namenode/test/', FORMAT = 'JSON');
For external tables from S3, spin up this service in an EC2 instance, set up an instance profile for that instance, and use the s3a://
protocol instead of hdfs://
.
In case you need to make manual changes or want to inspect the MySQL databases, you can connect to it like this:
docker exec -it docker-presto_mysql_1 mysql -ppassword