This repository contains the fully specified deployment files for the prototype TESS platform.
The following is an exerpt from A Proposal for a Science Platform for TESS
We propose to create a TESS-focused, JupyterHub-based, science platform that will allow users to:
- quickly and easily visualize the TESS data and the community delivered HLSPs.
- explore cloud-based computational resources as a way to make the most use of the large amount of TESS FFI data.
- teach the methods and tools to work with MAST's time series data using a stable, collaborative environment and high quality tutorials.
This prototype has two primary deployments:
- tess-public: An ephemeral, mybinder.org style, open to the public JupyterHub focused on outreach
- tess-private: A persistent, authenticated JupyterHub focused on collaborative research
Both of these deployments will have very similar features, but differ in terms of resources allocated to them.
tess.omgwtf.in is an emphemeral, mybinder.org style, unauthenticated hub focused on outreach and teaching.
nbgitpuller is installed, so you can make nbgitpuller links to share with users. When clicked, they will start an ephemeral session, pull in the git repo linked to, and open the appropriate directory / file.
This link opens the spacetelescope/notebooks git repo, but opens specifically into the TESS related directory. This link is almost the same, but opens a specific notebook.
private.tess.omgwtf.in is an authenticated JupyterHub with persistent storage, otherwise similar to TESS Private. It currently uses GitHub for authentication, but lets everyone with a GitHub account through.
nbgitpuller links (directory, notebook) work here as well, with the added advantage that nbgitpuller will do 'automagic' merging for you, so both the author of the git repo and the user in JupyterHub can make changes to the notebook, and it will always preserve the user's changes. This is extremely useful in workshops, since instructors can continue tweaking materials after start of the workshop without worry of overwriting students' work.
This repository captures the complete system state of all the deployments for this prototype. This includes any AWS resources, the configuration of the JupyterHubs, secrets required to run the JupyterHubs, and the images themselves. This lets us do `continuous deployment <https://www.atlassian.com/continuous-delivery/continuous-deployment`_ - most changes to the configuration are made via GitHub pull requests to this repository. We will run automated tests against the pull request, and when satisfied, merge the pull request, which will deploy the changes. This increases the number of people who can safely make changes to the configuration of the hubs, empowering people to make changes as well as reducing the load on the folks who set up the infrastructure.
This is modelled around the deployment models of the PANGEO project, the mybinder.org project, UC Berkeley's instructional hubs and many other projects that are using hubploy.
We try to use the same image for the private and public instances, and this image is
present in deployments/tess-private/images/default
.
repo2docker is used to
build the actual user image, so you can use any of the supported config files to customize
the image as you wish. Currently, the environment.yml
file does most of the work.
All the JupyterHubs are based on Zero to JupyterHub (z2jh). z2jh uses configuration files in YAML format to specify exactly how the hub is configured. For convenience, and to make sure we do not repeat ourselves, this config is split into multiple files that form a hierarchy.
hub/values.yaml
contains config common to all the hubs in this repositorydeployments/<deployment>/config/common.yaml
is the primary config for the hub referred to by<deployment>
. The values here overridehub/values.yaml
.deployments/<deployment>/config/staging.yaml
anddeployments/<deployment>/config/prod.yaml
have config that is specific to the staging or production versions of the deployment. These should be as minimal as possible, since we try to keep staging & production as close to each other as possible.
Further, we use git-crypt to store encrypted
secrets in this repository (although we would like to move to sops
in the future). Encrypted config (primarily auth tokens and other secret tokens) are
stored in deployments/<deployment>/secrets/staging.yaml
and deployments/<deployment>/secrets/prod.yaml
.
There is no common.yaml
, since staging & production should not share any secret values.
We use hubploy to deploy our hubs in a
repeatable fashion. hubploy.yaml
contains information required for hubploy to
work - such as cluster name, region, provider, etc.
Various secret keys used to authenticate to cloud providers are kept under secrets/
for that deployment and referred to from hubploy.yaml
.
We need the following AWS resources set up for the hubs to run properly:
- A kubernetes cluster via Amazon EKS, with multiple node groups for 'core' and 'user' nodes.
- Home directory storage in Amazon EFS
- Per-cluster tools, such as cluster autoscaler and EFS Provisioner.
- Appropriate IAM User Credentials.
Instead of creating and maintaining these resourrces manually, we use the popular
terraform tool to do so for us. There is an attempt to
build a community-wide terraform template that can be used by different domains that need
a JupyterHub+Dask analytics cluster at https://github.com/pangeo-data/terraform-deploy. We
refer to it via a git submodule in
this repo under cloud-infrastructure
, with parameters set in infrastructure.tfvars
.
This is heavily a work in progress, but the hope is that eventually we'll have security, performance and cost optimized clusters that can be set up from this template.
Identify the files to be modified to effect the change you seek.
All files related to the user image are in
deployments/tess-private/images/default
- all deployments share this image. repo2docker is used to build image, so you can use any of the supported config files to customize the image as you wish.Currently, the
environment.yml
file has all packages, while JupyterLab plugins are installed viapostBuild
.Most JupyterHub related config files are in
hub/values.yaml
, with per-deployment overrides indeployments/<deployment>/config/
. See section on config files earlier in this document.Make a PR with your changes to this repo
This will trigger a GitHub Action on the PR. Note that at this point, it only tests the image to make sure it builds properly. No tests are performed on the configuration. Wait for this test to pass. If it fails, fix it until it passes.
- Merge the PR to the staging branch. This kicks off another GitHub action to deploy the changes to the staging hubs of both deployments. You can follow this in the Actions tab in GitHub.
- Once complete, test out the change you made in staging. Both the staging hubs
use the same image, so you can use either to test image changes. Test config
changes on the appropriate staging hub.
- Staging for Tess Public is https://staging.tess.omgwtf.in/
- Staging for Tess Private is https://staging.private.tess.omgwtf.in/
- If something isn't working like you think it should, repeat the process of making PRs and merging them to staging until it is.
- When you are satisfied with staging, time to deploy to production! Make a PR merging the current staging branch to prod - always use this handy link. You shouldn't merge your PR into prod - you should only merge staging to prod. This keeps our git histories clean, and helps makes reverts easy as well.
- Merging this PR will kick off a GitHub action that'll deploy the change to production. If you already have a running server, you have to restart it to pick up new image changes (File -> Hub Control Panel).
We shall try to use secure defaults wherever possible, while making sure we do not affect usability too much.
- Use efs-provisioner for setting up NFS home directories. This way, each user's pod only gets to mount their particular home directory, instead of mounting the entire NFS share.
- Use a securityContext to run user pods as a non-root user, and disable any setuid binaries (like sudo) with no-new-privs.
- Disable user access to instance metadata endpoint, which often contains sensitive credentials.
- Set up a PodSecurityPolicy to control what kind of pods dask-kubernetes can create. This is currently a biggish security hole, since the ability to create arbitrary pods can be easily escalated to root. Should be fixed shortly.
- Enable NetworkPolicy to set up internal firewalls, so we only permit whitelisted internal traffic. We could also possibly restrict outbound traffic to only ports 80 and 443.
- Put our worker nodes in a private subnet - currently they are all in a public subnet since EKS managed node groups do not support private subnets. This needs to be fixed by Amazon, or we can use non-managed nodegroups.
- Switch to using dask-gateway instead of dask-kubernetes. This gives us much better multi-tenancy and security isolation. It is currently undergoing a biggish architecture change, and we can switch once that lands.
- Give each user their own uid / gids, to strengthen security boundaries. The best way to do this is to use EFS Access Points. This needs upstream work in the AWS CSI Driver. Switching to the AWS CSI Driver will also give us encryption in transit for home directories. We need the CSI driver to add dynamic provisioning support first, though.
- tess-public and tess-private need to be on completely isolated resources - VPC, Cluster, etc. We can do this when we really open it up to the public.
Ideally, we would be able to put resources into some of these upstream fixes - they are fairly well specified and isolated.