You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm still offline for a bit, but wanted to dump some thoughts on our current setup, as of 2020-11-11. This is primarily focused on the GCP deployment (https://us-central1-b.gcp.pangeo.io/, and http://staging.us-central1-b.gcp.pangeo.io). It's also mainly focused on how things are (especially how they differ from a "stock" JupyterHub / daskhub deployment) rather than how they should be.
A small (autoscaling) core pool for the hub, dask-gateway, and various other service pods (more later).
A auto-provisioning, auto-scaling node pools for the rest. This uses GCP's node-pool auto-provisioning
feature where node-pools are automatically created based on the Kubernetes taints / tolerations (e.g.
it'll create a preemptible node pool for Dask workers, since we mark them as preemptible.
A Kubernetes Service Account and Google Service Account, pangeo used for various things (e.g. the scratch bucket, more later)
Otherwise, things probably follow zero-to-jupyterhub pretty closely.
Better test coverage: Right now we just ensure that we can create a Dask cluster.
Rollback deployments where the tests fail?
Scratch Bucket
Many workloads benefit from having some scratch space to write intermediate results. https://rechunker.readthedocs.io/en/latest/ is
a prime example. We don't want users writing large intermediates to their home directory. This is slow and expensive. So we've
provided them with the cloud-native alternative: a read / write bucket on GCS, pangeo-scratch.
This bucket is created with the scratch target in the Makefile, which uses lifecycle.json to specify that objects are automatically
deleted after 7 days.
On GCP, we use Workload Identity for the
kubernetes pods. If the Kubernetes Service Account is associated with a Google Service Account, the pod is able to do things that
the GSA can do. #610 (comment) has the hopefully up-to-date
commands used to do the association between the KSA and the GSA.
We have some monitoring of the clusters at http://grafana.us-central1-b.gcp.pangeo.io/grafana/.
We use prometheus to collect metrics from running pods / nodes, and grafana to visualize the metrics.
Finally, we provide an ingress to access the metrics over the internet. The metrics are public to read.
These are deployed separately from prod and staging, not as part of CI/CD, into the metrics namespace.
The pods are configured to squeeze into the core pool.
We ensure that the dask worker & scheduler pods export metrics, along with the JupyterHub username, at
To configure the ingress, we reserve the static IP of the LoadBalancer in GCP, and then point a DNS entry to it (our dns is through
Hurricane Electric).
MLFlow / Batch Worfklows
There's an incomplete effort to add mlflow / general batch workflow support to our hubs. We have a simple helm chart
for mlflow in the mlflow directory. This has an mlflow deployment / service running MLFlow which is registered
as a JupyterHub service and is accessible at https://{HUB_URL}/services/mlflow/.
Additionally, we set thee MLFLOW_TRACIKING_URI on the singleuser pod so that users can easily log metrics / artifacts.
Hi all,
I'm still offline for a bit, but wanted to dump some thoughts on our current setup, as of 2020-11-11. This is primarily focused on the GCP deployment (https://us-central1-b.gcp.pangeo.io/, and http://staging.us-central1-b.gcp.pangeo.io). It's also mainly focused on how things are (especially how they differ from a "stock" JupyterHub / daskhub deployment) rather than how they should be.
The Hub is deployed through CI in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d7deb23224604150b5380946367d0d95d42e45cb/.circleci/config.yml.
The chart in
pangeo-deploy
is a small wrapper around daskhub, which wraps up Dask Gateway and JupyterHub.We've customized a few things beyond the standard deployment.
Kubernetes Cluster
In theory the
cluster
target in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d7deb23224604150b5380946367d0d95d42e45cb/deployments/gcp-uscentral1b/Makefile controls the creation of the Kubernetes cluster (it may be out of date). The most notable things arefeature where node-pools are automatically created based on the Kubernetes taints / tolerations (e.g.
it'll create a preemptible node pool for Dask workers, since we mark them as preemptible.
pangeo
used for various things (e.g. the scratch bucket, more later)Otherwise, things probably follow zero-to-jupyterhub pretty closely.
Authentication
Like the AWS deployment, we use auth0 to authenticate with the hubs after they fill out the form: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/.github/workflows/UpdateMembers.yml, https://github.com/pangeo-data/pangeo-cloud-federation/actions?query=workflow%3AUpdateMembers.
Images
The GCP deployment simply uses the Docker images from https://github.com/pangeo-data/pangeo-docker-images with no modifications.
We use dependabot to automatically update our pinned
version as tags are pushed in
pangeo-docker-images
.Testing
We have rudimentary integration tests as part of our CI/CD. #753 provides an overview.
The summary is that pushes to
staging
willpangeo-bot
; we manually created a token for it and stored it as a secret in CI)test.py
file to the single-user pod andkubectl exec
itThis should be expanded in a few directions
prod
too (would have caught failing to launch dask worker pods on AWS #870, a prod-specific issue)Scratch Bucket
Many workloads benefit from having some scratch space to write intermediate results. https://rechunker.readthedocs.io/en/latest/ is
a prime example. We don't want users writing large intermediates to their home directory. This is slow and expensive. So we've
provided them with the cloud-native alternative: a read / write bucket on GCS,
pangeo-scratch
.This bucket is created with the
scratch
target in theMakefile
, which useslifecycle.json
to specify that objects are automaticallydeleted after 7 days.
On GCP, we use Workload Identity for the
kubernetes pods. If the Kubernetes Service Account is associated with a Google Service Account, the pod is able to do things that
the GSA can do. #610 (comment) has the hopefully up-to-date
commands used to do the association between the KSA and the GSA.
See #610 for background.
One notable downside is that this bucket is globally read/writable by everyone in the pangeo cluster. We set the
PANGEO_SCRATCH
environment variable in thepangeo-docker-images
to be equal togs://pangeo-scratch/{JUPYTERHUB_USERNAME}
,see https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/gcp-uscentral1b/config/common.yaml#L23.
Prometheus / Grafana
We have some monitoring of the clusters at http://grafana.us-central1-b.gcp.pangeo.io/grafana/.
We use prometheus to collect metrics from running pods / nodes, and grafana to visualize the metrics.
Finally, we provide an ingress to access the metrics over the internet. The metrics are public to read.
These are deployed separately from
prod
andstaging
, not as part of CI/CD, into themetrics
namespace.The pods are configured to squeeze into the core pool.
We ensure that the dask worker & scheduler pods export metrics, along with the JupyterHub username, at
pangeo-cloud-federation/pangeo-deploy/values.yaml
Lines 99 to 106 in d7deb23
To configure the ingress, we reserve the static IP of the LoadBalancer in GCP, and then point a DNS entry to it (our dns is through
Hurricane Electric).
MLFlow / Batch Worfklows
There's an incomplete effort to add
mlflow
/ general batch workflow support to our hubs. We have a simple helm chartfor
mlflow
in themlflow
directory. This has an mlflow deployment / service running MLFlow which is registeredas a JupyterHub service and is accessible at
https://{HUB_URL}/services/mlflow/
.Additionally, we set thee
MLFLOW_TRACIKING_URI
on the singleuser pod so that users can easily log metrics / artifacts.See https://discourse.pangeo.io/t/pangeo-batch-workflows/804/7 and pangeo-data/pangeo#800 for more.
The biggest outstanding issues are probably around
--no-conda
kind of works, assuming the single-user env has all the needed packages)And lots of polish. It's not clear to me if we should continue to go down the MLFlow path, but it is an option.
The text was updated successfully, but these errors were encountered: