There are 2 ways of deploying Sleeper and interacting with an instance. You can deploy to AWS, or to Docker on your local machine. The Docker version has limited functionality and will only work with small volumes of data, but will allow you to deploy an instance, ingest some files, and run reports and scripts against the instance.
To get started we'll use the Sleeper CLI, which runs in Docker on your local machine.
The Sleeper CLI has the following dependencies:
The Sleeper CLI contains Docker images with the necessary dependencies and scripts to work with Sleeper. Run the
following commands to install the CLI. The version can be main
or a release in the format v0.16.0
.
curl "https://raw.githubusercontent.com/gchq/sleeper/[version]/scripts/cli/install.sh" -o ./sleeper-install.sh
chmod +x ./sleeper-install.sh
./sleeper-install.sh [version]
This installs a sleeper
command to run other commands inside a Docker container. You can use sleeper aws
or
sleeper cdk
to run aws
or cdk
commands without needing to install the AWS or CDK CLI on your machine. If you set
AWS environment variables or configuration on the host machine, that will be propagated to the Docker container when
you use sleeper
.
You can also upgrade the CLI to a different version with sleeper cli upgrade
.
The quickest way to get an instance of Sleeper is to deploy to LocalStack in Docker on your local machine. Note that the LocalStack version has very limited functionality in comparison to the AWS version, and can only handle small volumes of data. See the documentation on deploying to localstack for more information.
The easiest way to deploy a full instance of Sleeper and interact with it is to use the "system test" functionality. This deploys a Sleeper instance with a simple schema, and writes some random data into a table in the instance. You can then use the status scripts to see how much data is in the system, run some example queries, and view logs to help understand what the system is doing. It is best to do this from an EC2 instance as a significant amount of code needs to be uploaded to AWS.
To use the Sleeper CLI against AWS, you need to authenticate against your AWS account. You can do this by running
sleeper aws configure
, or other sleeper aws
commands according to your AWS setup. AWS Environment variables
will also be propagated to the Sleeper CLI.
If the CDK has never been bootstrapped in your AWS account, this must be done first. This only needs to be done once in a given AWS account.
sleeper cdk bootstrap
Next, you'll need a VPC that is suitable for deploying Sleeper. You'll also want an EC2 instance to deploy from, to avoid lengthy uploads of large jar files and Docker images. You can use the Sleeper CLI to create both of these.
If you'd prefer to use your own, you'll need to install the Sleeper CLI on your EC2, which should run on an x86_64 architecture. You'll need to authenticate with AWS as described above. You'll need to ensure your VPC meets Sleeper's requirements, but you can also deploy a fresh VPC with the CLI. This is documented in the deployment guide.
The Sleeper CLI can create an EC2 instance in a VPC that is suitable for deploying Sleeper. This will automatically configure authentication such that once you're in the EC2 instance you'll have administrator access to your AWS account.
sleeper environment deploy TestEnvironment
The sleeper environment deploy
command will wait for the EC2 instance to be deployed.
You can then SSH to it with EC2 Instance Connect and SSM Session Manager, using this command:
sleeper environment connect
Immediately after it's deployed, commands will run on this instance to install the Sleeper CLI. Once you're connected, you can check the progress of those commands like this:
cloud-init status
You can check the output like this (add -f
if you'd like to follow the progress):
tail /var/log/cloud-init-output.log
Once it has finished the EC2 will restart. Once it's restarted you can use the Sleeper CLI. Reconnect to the EC2
with sleeper environment connect
.
You can access a built copy of the Sleeper scripts by running sleeper deployment
in the EC2. That will get you a shell
inside a Docker container inside the EC2. You can run all the deployment scripts there as explained below. If you run it
outside of the EC2, you'll get the same thing but in your local Docker host. Use the one in the EC2 to avoid the
deployment being slow uploading jars and Docker images.
The Sleeper Git repository will also be cloned, and you can access it by running sleeper builder
in the EC2.
That will get you a shell inside a Docker container similar to the sleeper deployment
one, but with the dependencies
for building Sleeper. The whole working directory will be persisted between executions of sleeper builder
.
If you want someone else to be able to access the same environment EC2, they can run sleeper environment add <id>
with the same environment ID. To begin with you'll both log on as the same user and share a single screen
session. You
can set up separate users with sleeper environment adduser <username>
, and switch users with
sleeper environment setuser <username>
. If you call sleeper environment setuser
with no arguments, you'll switch
back to the original default user for the EC2.
To run the system test, set the environment variable ID
to be a globally unique string. This is the instance id. It
will be used as part of the name of various AWS resources, such as an S3 bucket, lambdas, etc., and therefore should
conform to the naming requirements of those services. In general stick to lowercase letters, numbers, and hyphens. We
use the instance id as part of the name of all the resources that are deployed. This makes it easy to find the resources
that Sleeper has deployed within each service (go to the service in the AWS console and type the instance id into the
search box).
Avoid reusing the same instance id, as log groups from a deleted instance will still be present unless you delete them. An instance will fail to deploy if it would replace log groups from a deleted instance.
Create an environment variable called VPC
which is the id of the VPC you want to deploy Sleeper to, and create an
environment variable called SUBNETS
with the ids of subnets you wish to deploy Sleeper to (note that this is only
relevant to the ephemeral parts of Sleeper - all of the main components use services which naturally span availability
zones). Multiple subnet ids can be specified with commas in between, e.g. subnet-a,subnet-b
.
The VPC must have an S3 Gateway endpoint associated with it otherwise the cdk deploy
step will fail.
While connected to your EC2 instance run:
sleeper deployment test/deployAll/deployTest.sh ${ID} ${VPC} ${SUBNETS}
An S3 bucket will be created for the jars, and ECR repos will be created and Docker images pushed to them. Note that this script currently needs to be run from an x86_64 machine as we do not yet have cross-architecture Docker builds. Then CDK will be used to deploy a Sleeper instance. This will take around 20 minutes. Once that is complete, some tasks are started on an ECS cluster. These tasks generate some random data and write it to Sleeper. 11 ECS tasks will be created. Each of these will write 40 million records. As all writes to Sleeper are asynchronous, it will take a while before the data appears (around 8 minutes).
You can watch what the ECS tasks that are writing data are doing by going to the ECS cluster named sleeper-${ID}-system-test-cluster, finding a task and viewing the logs.
Run the following command to see how many records are currently in the system:
sleeper deployment utility/filesStatusReport.sh ${ID} system-test
The randomly generated data in the table conforms to the schema given in the file scripts/templates/schema.template
.
This has a key field called key
which is of type string. The code that randomly generates the data will generate keys
which are random strings of length 10. To run a query, use:
sleeper deployment utility/query.sh ${ID}
As the data that went into the table is randomly generated, you will need to query for a range of keys, rather than a
specific key. The above script can be used to run a range query (i.e. a query for all records where the key is in a
certain range) - press 'r' and then enter a minimum and a maximum value for the query. Don't choose too large a range or
you'll end up with a very large amount of data sent to the console (e.g a min of 'aaaaaaaaaa' and a max of
'aaaaazzzzz'). Note that the first query is slower than the others due to the overhead of initialising some libraries.
Also note that this query is executed directly from a Java class. Data is read directly from S3 to wherever the script
is run. It is also possible to execute queries using lambda and have the results written to either S3 or to SQS. The
lambda-based approach allows for a much greater degree of parallelism in the queries. Use lambdaQuery.sh
instead of
query.sh
to experiment with this.
Be careful that if you specify SQS as the output, and query for a range containing a large number of records, then a large number of results could be posted to SQS, and this could result in significant charges.
Over time you will see the number of active files (as reported by the filesStatusReport.sh
script) decrease. This is
due to compaction tasks merging files together. These are executed in ECS clusters (named
sleeper-${ID}-merge-compaction-cluster
and sleeper-${ID}-splitting-merge-compaction-cluster
).
You will also see the number of leaf partitions increase. This functionality is performed using lambdas called
sleeper-${ID}-find-partitions-to-split
and sleeper-${ID}-split-partition
.
To ingest more random data, run:
sleeper deployment java -cp jars/system-test-*-utility.jar sleeper.systemtest.drivers.ingest.RunWriteRandomDataTaskOnECS ${ID} system-test
To tear all the infrastructure down, run
sleeper deployment test/tearDown.sh
Note that this will sometimes fail if there are ECS tasks running. Ensure that there are no compaction tasks running before doing this.
It is possible to run variations on this system-test by editing the system test properties, like this:
sleeper deployment
cd test/deployAll
editor system-test-instance.properties
./buildDeployTest.sh ${ID} ${VPC} ${SUBNET}
To deploy your own instance of Sleeper with a particular schema, go to the deployment guide.