Documenting the current GCP deployment #874

TomAugspurger · 2020-11-11T13:16:19Z

Hi all,

I'm still offline for a bit, but wanted to dump some thoughts on our current setup, as of 2020-11-11. This is primarily focused on the GCP deployment (https://us-central1-b.gcp.pangeo.io/, and http://staging.us-central1-b.gcp.pangeo.io). It's also mainly focused on how things are (especially how they differ from a "stock" JupyterHub / daskhub deployment) rather than how they should be.

The Hub is deployed through CI in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d7deb23224604150b5380946367d0d95d42e45cb/.circleci/config.yml.
The chart in pangeo-deploy is a small wrapper around daskhub, which wraps up Dask Gateway and JupyterHub.

We've customized a few things beyond the standard deployment.

Kubernetes Cluster

In theory the cluster target in https://github.com/pangeo-data/pangeo-cloud-federation/blob/d7deb23224604150b5380946367d0d95d42e45cb/deployments/gcp-uscentral1b/Makefile controls the creation of the Kubernetes cluster (it may be out of date). The most notable things are

A small (autoscaling) core pool for the hub, dask-gateway, and various other service pods (more later).
A auto-provisioning, auto-scaling node pools for the rest. This uses GCP's node-pool auto-provisioning
feature where node-pools are automatically created based on the Kubernetes taints / tolerations (e.g.
it'll create a preemptible node pool for Dask workers, since we mark them as preemptible.
A Kubernetes Service Account and Google Service Account, pangeo used for various things (e.g. the scratch bucket, more later)

Otherwise, things probably follow zero-to-jupyterhub pretty closely.

Authentication

Like the AWS deployment, we use auth0 to authenticate with the hubs after they fill out the form: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/.github/workflows/UpdateMembers.yml, https://github.com/pangeo-data/pangeo-cloud-federation/actions?query=workflow%3AUpdateMembers.

Images

The GCP deployment simply uses the Docker images from https://github.com/pangeo-data/pangeo-docker-images with no modifications.
We use dependabot to automatically update our pinned
version as tags are pushed in pangeo-docker-images.

Testing

We have rudimentary integration tests as part of our CI/CD. #753 provides an overview.
The summary is that pushes to staging will

Deploy the new changes
Start a single-users server (for pangeo-bot; we manually created a token for it and stored it as a secret in CI)
Copy a test.py file to the single-user pod and kubectl exec it
Report the result

This should be expanded in a few directions

Run on prod too (would have caught failing to launch dask worker pods on AWS #870, a prod-specific issue)
Better test coverage: Right now we just ensure that we can create a Dask cluster.
Rollback deployments where the tests fail?

Scratch Bucket

Many workloads benefit from having some scratch space to write intermediate results. https://rechunker.readthedocs.io/en/latest/ is
a prime example. We don't want users writing large intermediates to their home directory. This is slow and expensive. So we've
provided them with the cloud-native alternative: a read / write bucket on GCS, pangeo-scratch.

This bucket is created with the scratch target in the Makefile, which uses lifecycle.json to specify that objects are automatically
deleted after 7 days.

On GCP, we use Workload Identity for the
kubernetes pods. If the Kubernetes Service Account is associated with a Google Service Account, the pod is able to do things that
the GSA can do. #610 (comment) has the hopefully up-to-date
commands used to do the association between the KSA and the GSA.

See #610 for background.

One notable downside is that this bucket is globally read/writable by everyone in the pangeo cluster. We set the
PANGEO_SCRATCH environment variable in the pangeo-docker-images to be equal to gs://pangeo-scratch/{JUPYTERHUB_USERNAME},
see https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/gcp-uscentral1b/config/common.yaml#L23.

Prometheus / Grafana

We have some monitoring of the clusters at http://grafana.us-central1-b.gcp.pangeo.io/grafana/.
We use prometheus to collect metrics from running pods / nodes, and grafana to visualize the metrics.
Finally, we provide an ingress to access the metrics over the internet. The metrics are public to read.

These are deployed separately from prod and staging, not as part of CI/CD, into the metrics namespace.
The pods are configured to squeeze into the core pool.

We ensure that the dask worker & scheduler pods export metrics, along with the JupyterHub username, at

pangeo-cloud-federation/pangeo-deploy/values.yaml

Lines 99 to 106 in d7deb23

    
                             extra_annotations = { 
        
                                 "hub.jupyter.org/username": user.name, 
        
                                 "prometheus.io/scrape": "true", 
        
                                 "prometheus.io/port": "8787", 
        
                             } 
        
                             extra_labels = { 
        
                                 "hub.jupyter.org/username": user.name, 
        
                             }

.

To configure the ingress, we reserve the static IP of the LoadBalancer in GCP, and then point a DNS entry to it (our dns is through
Hurricane Electric).

MLFlow / Batch Worfklows

There's an incomplete effort to add mlflow / general batch workflow support to our hubs. We have a simple helm chart
for mlflow in the mlflow directory. This has an mlflow deployment / service running MLFlow which is registered
as a JupyterHub service and is accessible at https://{HUB_URL}/services/mlflow/.

Additionally, we set thee MLFLOW_TRACIKING_URI on the singleuser pod so that users can easily log metrics / artifacts.

See https://discourse.pangeo.io/t/pangeo-batch-workflows/804/7 and pangeo-data/pangeo#800 for more.

The biggest outstanding issues are probably around

Image / environment handling (--no-conda kind of works, assuming the single-user env has all the needed packages)
Lack of RBAC in the MLFlow UI (I think every can see (and maybe delete?) everyone else's experiments)

And lots of polish. It's not clear to me if we should continue to go down the MLFlow path, but it is an option.

The text was updated successfully, but these errors were encountered:

TomAugspurger mentioned this issue Nov 11, 2020

Roadmap for migrating existing Pangeo hubs to 2i2c operation 2i2c-org/farallon-image#12

Closed

This was referenced Jun 23, 2021

New Hub: Pangeo JupyterHub (GCP) 2i2c-org/infrastructure#482

Closed

Deploying Cluster for Pangeo 2i2c-org/infrastructure#489

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting the current GCP deployment #874

Documenting the current GCP deployment #874

TomAugspurger commented Nov 11, 2020

Documenting the current GCP deployment #874

Documenting the current GCP deployment #874

Comments

TomAugspurger commented Nov 11, 2020

Kubernetes Cluster

Authentication

Images

Testing

Scratch Bucket

Prometheus / Grafana

MLFlow / Batch Worfklows