Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give all notebook pods a temporary bucket #610

Closed
rabernat opened this issue May 15, 2020 · 45 comments
Closed

Give all notebook pods a temporary bucket #610

rabernat opened this issue May 15, 2020 · 45 comments

Comments

@rabernat
Copy link
Member

I've been thinking about what would help things work more smoothly on our cloud hubs in terms of data storage. One clear need is a place to put temporary data. Filesystem-based solutions are not a good solution because they are hard to share with dask workers.

What if we could create a temporary bucket for each user pod, which is automatically deleted at the end of each session? This would be awesome. We could propagate write credentials to the bucket to the dask workers, so that people could dump as much temporary data there as they want. But by deleting at the end of each session, we avoid blowing up our storage costs.

It seems like this sort of things should be possible with kubernetes, but I'm not sure how to do it.

@scottyhq
Copy link
Member

This is a great idea. It's been discussed a few times in separate issues. If anyone is willing to spend some time on it, I think what is described here is a viable way forward: #485 (comment)

This would be cloud-provider specific though.

@robfatland
Copy link
Member

robfatland commented May 15, 2020 via email

@rabernat
Copy link
Member Author

Perhaps create/delete pod-associated folders within a single persistent temp data bucket? Bucket create/delete might be more cumbersome.

AFAIK, "folders" don't exist in object storage. You either have write privileges for the entire bucket or not. So if we do this, and we care about isolating user data, we have to do it at the bucket level.

If we don't care about isolating user data, we could just have a single, global readable / writeable bucket. But we would be open to the possibility that one user could delete everyone else's data.

This makes me miss POSIX permissions.

@scottyhq
Copy link
Member

AFAIK, "folders" don't exist in object storage. You either have write privileges for the entire bucket or not. So if we do this, and we care about isolating user data, we have to do it at the bucket level.

True, but you can effectively accomplish this by a bucket policy or user policy for access to different "object prefixes" which act like folders. Described here for AWS, I image there is a similar setup for GCP: https://aws.amazon.com/blogs/security/writing-iam-policies-grant-access-to-user-specific-folders-in-an-amazon-s3-bucket/

The difficulty is mapping hub users to cloud account credentials. That would be accomplished via Auth0. It seems tricky but doable to modify KubeSpawner to link up the jupyter username to a Cloud access token, so using myself as an example I'd get access to s3://pangeo-scratch/scottyhq/ but not s3://pangeo-scratch/robfatland/ while logged in. Ultimately I think this approach will be critical in order to better track usage and costs per user.

If we don't care about isolating user data, we could just have a single, global readable / writeable bucket. But we would be open to the possibility that one user could delete everyone else's data.

We can do this very easily. I've tried it on the aws hub s3://pangeo-scratch with an expiration policy of 1 day. Seems to be working.

@scottyhq
Copy link
Member

Question for @yuvipanda or @consideRatio. I was trying to follow the approach here berkeley-dsep-infra/datahub#713 to give each user a bucket in GCP. If I understand correctly, the only way to give Kubespawner commands access to additional command line tools or python libraries (such as awscli or gcloud) is to modify the standard Hub image? (https://github.com/berkeley-dsep-infra/datahub/pull/713/files#diff-e71ae0db512a9f529e23dd65da53a262)

Is there anyway around that?

@rabernat
Copy link
Member Author

I image there is a similar setup for GCP

Yes, I think this is called Access Control Lists.

It sounds like we should be able to use ACLs and lifecycle management to provide per-user object storage!

We can't easily provide a size-based quota, which would be ideal. Instead, we can site a time limit on temporary objects. I feel like 7 or 14 days would be reasonable. Inevitably users would lose data unexpectedly until they got used to working this way.

@rabernat
Copy link
Member Author

I'd love to brainstorm a way forward on this at tomorrow's dev meeting. Some questions that come to mind:

  • One scratch bucket per user vs. one global scratch bucket? I'm currently leaning towards one scratch bucket per user. GCP has no limit on the number of buckets you can create. Monitoring and rules are easier to implement at the bucket level.
  • What are the blockers to implementation? @yuvipanda's work referenced above gives us a pretty clear path to implementation.
  • What about binder? Should Pangeo binders get scratch space? I think so. But how to we make it secure?

@scottyhq
Copy link
Member

One scratch bucket per user vs. one global scratch bucket? I'm currently leaning towards one scratch bucket per user. GCP has no limit on the number of buckets you can create. Monitoring and rules are easier to implement at the bucket level.

Sounds like the details will differ across cloud-providers, which I think is ok. For AWS the general recommendation seems to be one bucket with differing object permissions per user.

Another question to table:

  • How many users will their be for hubs and binderhubs going forward?

For BinderHubs, keeping open to any GitHub user is key for outreach and workshops. But I'm concerned about creating many 'things' (buckets, policies, etc) for an ever-increasing number of users. It might be okay if those things are tied to a session and are deleted automatically... For Hubs, there is at least a cap of several hundred users based on GitHub organization membership.

Another option worth considering is BYOB, where we document for users how to create a bucket on their own account and connect access from the hub or binderhub. (more complicated for users obviously, but more sustainable for those of us administering resources with limited time and credits to go around).

@rabernat
Copy link
Member Author

As discussed on today's call, I'm going to try the approach of a global scratch bucket with no formal user-specific credentials, instead using an environment variable to point each user to the appropriate path.

Can someone tell me which service account corresponds to the user notebook pods? I can't figure it out.
https://console.cloud.google.com/iam-admin/serviceaccounts?project=pangeo-181919

Also, dask workers will need access. I assume they are associated with "Dask Worker Service Account" ([email protected])? Is that correct?

@scottyhq
Copy link
Member

scottyhq commented May 27, 2020

the helm config should point to a pangeo service account for both user notebooks

and dask workers
serviceAccountName: pangeo
.

The linking of account credentials is done via the underlying cluster config. For AWS, kubectl get sa pangeo -n icesat2-prod -o yaml, shows the name of the linked role:

kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: ROLEID HERE

I'm guessing it's the same for GCP?

@TomAugspurger
Copy link
Member

@scottyhq does assigning that annotation magically grant S3 read / write privileges to the pod (assuming they've been granted to that role)?

For the multicloud demo I went through a much more complicated process of mounting a secrets volume and setting a GOOGLE_APPLICATION_CREDENTIALS file: https://github.com/pangeo-data/multicloud-demo/blob/4421333b72831665fc39b2a3b7e8b4f2f2374e9f/config-gcp.yaml#L12-L23. Documented a bit at https://github.com/pangeo-data/multicloud-demo#notes-on-requester-pays-gcp

@martindurant
Copy link

Just wanted to add a point on this:

deleting at the end of each session

That should totally be doable, but should not be counted as reliable, since python kernels or kube pods can disappear without warning. If you are going for prefixes within a bucket, then you would need to list all the files and send delete requests for each, which is potentially expensive to do (compared to nixing a whole bucket). Certainly would like to use async and batch deleting, or maybe even use a dedicated CLI tool if it does a good job. If you rely on life-cycles alone, you may well end up paying more than you hoped. Is it time to convene a group for building async into fsspec and friends?

Note that most of the object stores also allow for archiving data to cheaper storage, which is not appropriate for "scratch"/intermediates, but might be right for output data artefacts. Such archiving can also be done on a life-cycle (e.g., 7 days untouched, archive; 30days untouched, delete).

@TomAugspurger
Copy link
Member

TomAugspurger commented May 28, 2020

Agreed that "at the end of each session" would be hard to do, for the reasons you listed.

Note that most of the object stores also allow for archiving data to cheaper storage, which is not appropriate for "scratch"/intermediates, but might be right for output data artefacts.

Why do you think this wouldn't work for scratch / intermediate? I think we could have an aggressive lifecycle policy, like objects older than 1 day are deleted

Something like this

{
"lifecycle": {
  "rule": [
  {
    "action": {"type": "Delete"},
    "condition": {
      "age": 1,
      "isLive": true
    }
  },
]
}
}

I don't think this "scratch" bucket is well-suited to solving output data artifacts, so we can safely ignore that use case.

@martindurant
Copy link

I mean that archival storage is not so useful for intermediates that are likely to be read in again in the near future - better just delete them. I don't know the specifics of each store, but generally access to archived data is not just slower, it comes with quotas and access limits, and frequent access might even end up costing more. There are probably many options for each backend...

@TomAugspurger
Copy link
Member

Gotcha, thanks. I think that's a non-issue here as long as we're willing to treat this solely as scratch space. There are other, better options for making data products.

@martindurant
Copy link

There are other, better options for making data products.

Agreed.
Wouldn't it be nice to automatically create Intake catalogs for anything that is indeed written as a product? Just a thought.

@scottyhq
Copy link
Member

does assigning that annotation magically grant S3 read / write privileges to the pod (assuming they've been granted to that role)?

The additional step is assigning whatever 'policies' you want to that role (such as S3 read/write), and there are some additional things like OIDC configuration that happens at the cluster level. Full documentation is here - https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html

command line cluster management tools like eksctl take care of most of the details for AWS, or
@salvis2 has this terraform config https://github.com/ICESAT-2HackWeek/terraform-deploy/blob/master/aws/s3-data-bucket.tf

@TomAugspurger
Copy link
Member

Right, I suppose to rephrase my question: is there something special in eks that looks for eks.amazonaws.com/role-arn and grants the role ID, or is that a standard kubernetes thing, where I just swap in the GCP names for AWS?

@scottyhq
Copy link
Member

@TomAugspurger - as far as I can tell GKE has the equivalent approach to EKS documented here ("Workload identity"): https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity

@TomAugspurger
Copy link
Member

TomAugspurger commented May 29, 2020 via email

@TomAugspurger
Copy link
Member

Trying this out now for an hour.

@rabernat
Copy link
Member Author

rabernat commented Jun 1, 2020

I will likely have some time to work on this on Tuesday afternoon. Thanks @TomAugspurger for you work! Let me know where I can pick things up.

@TomAugspurger
Copy link
Member

Short status update:

  • Created a bucket pangeo-dev-staging (I think one bucket per namespace)
  • Created a google service account gcs-scratch-sa (this can be global). Granted it read/write permissions (oh, but a big TODO: this needs to be just for the single bucket...)
  • Enabled "Workload Identity" on the dev-pangeo-io cluster
  • Am migrating the existing node pools to use Workload Identity (https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#migrate_workloads_to). This is taking a while.

Remaining work

  1. Finish migrating node-pools
  2. Authorize KSA
K8S_NAMESPACE=dev-staging
KSA_NAME=pangeo
GSA_NAME=gcs-scratch-sa
GSA_PROJECT=pangeo-181919

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:cluster_project.svc.id.goog[${K8S_NAMESPACE}/${KSA_NAME}]" \
  ${GSA_NAME}@${GSA_PROJECT}.iam.gserviceaccount.com
  1. Annotate the KSA
kubectl annotate serviceaccount \
  --namespace ${K8S_NAMESPACE} \
   ${KSA_NAME} \
   iam.gke.io/gcp-service-account=${GSA_NAME}@${GSA_PROJECT}.iam.gserviceaccount.com

  1. Ensure all the pods get that annotation.

Hmm one of the nodes just failed to migrate... Will dig into that a bit more but then I do have to shelve this for a bit :)

@TomAugspurger
Copy link
Member

The migration might still be happening in the background. Here's the command that timed out

$ gcloud container node-pools update core-pool          --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA

In the GCP UI things are still updating though. If that's correct, then the we'd just need to migrate the remaining pools

gcloud container node-pools update dask-pool          --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update jupyter-pool       --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update jupyter-pool-small --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update jupyter-gpu-pool   --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA
gcloud container node-pools update scheduler-pool     --cluster=dev-pangeo-io-cluster --workload-metadata=GKE_METADATA

So I'll let that one process a bit longer before kicking off the rest.

@martindurant
Copy link

( --async should let your process run in the background without the CLI waiting/timing out)

@martindurant
Copy link

It seems async is only implemented on some gcloud commands, and maybe not this one, although there's beta and alpha... Never mind.

@TomAugspurger
Copy link
Member

This doesn't look great

Screen Shot 2020-06-01 at 10 58 37 AM

Let me know if you experience any issues with the GCP hubs today. In theory, everything I've done so far is reversible.

@scottyhq
Copy link
Member

scottyhq commented Jun 1, 2020

@TomAugspurger - remind me where the GCP nodegroup config is listed these days? Are you running k8s 1.15 or 1.16 currently on GKE? I'm not sure what the issue is, but for AWS nodegroup upgrades we typically create a separate nodegroup (e.g. dask-pool-v2), then delete the old one.

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 3, 2020

Currently on k8s 1.15.9-gke.24. Cycling the node pools makes sense. In-place migration seems like more hassle than it's worth.

Sounds like a pretty opportune time to add a terraform config to this repo :) Then we just adjust a few variable names and let it run.

@TomAugspurger
Copy link
Member

I have a PoC working on a separate kubernetes cluster.

The bucket pangeo-scratch is not publictly accessible. But it is read / write accessible from within the cluster.

>>> def check():
...     fs = gcsfs.GCSFileSystem(token="cloud")
...     return {file: fs.open(file).read() for file in fs.ls("pangeo-scratch/bar/")}        


>>> def put(dask_worker):
...     fs = gcsfs.GCSFileSystem(token="cloud")
...     name = dask_worker.address.split(":")[-1]
...     with fs.open("pangeo-scratch/bar/{}.txt".format(name), "wb") as f:
...         f.write(b"hi")

>>> client.run(put)
>>> client.run(check)
{'tls://10.52.4.2:42275': {'pangeo-scratch/bar/35899.txt': b'hi',
  'pangeo-scratch/bar/42275.txt': b'hi'},
 'tls://10.52.4.3:35899': {'pangeo-scratch/bar/35899.txt': b'hi',
  'pangeo-scratch/bar/42275.txt': b'hi'}}

The service account can only write to pangeo-scratch. It can't write to other buckets in the project, like pangeo-billing.

OSError: Forbidden: https://www.googleapis.com/upload/storage/v1/b/pangeo-billing/o
[email protected] does not have storage.objects.create access to pangeo-billing/bar/42275.txt.

I couldn't get auto nodepools working. Sorry Joe :)

I think the next step is to roll this out to the binders / hubs by creating new nodepools with this attribute set, and then deleting the old ones. I'll start with dev-staging tomorrow and see how it goes.

@TomAugspurger
Copy link
Member

#613 has (I think) all the necessary changes to the helm config. Just using the pangeo Kubernetes service account in more places. I think we want the user, scheduler, and worker pods to all be able to read / write to the bucket.

I've also created some node pools in the GCP cluster with workload identity enabled. We'll just need to remove the old node pools (I think we can do that whenever. No harm in doing it early I think).

@TomAugspurger
Copy link
Member

I think this is working, if people want to try things out. I've enabled it for dev-staging, ocean-staging, dev-prod, ocean-prod. I'll be pushing up docs on the configuration later today or tomorrow. For now I've set the lifecycle policy on pangeo-scratch to be 1 day. Objects older than 1 day are deleted.

We'll also want to provide some docs to users about how to actually use this, but the short version is that

fs = gcsfs.GCSFileSystem(token="cloud")

should let you read / write to the pangeo-scratch bucket.

@scottyhq
Copy link
Member

Seems to be working as well on AWS, but not thoroughly tested. NOTE on aws-uswest2.pangeo.io objects in s3://pangeo-scratch are wiped 24 hours after they are uploaded.

import s3fs
fs = s3fs.S3FileSystem()
fs.ls('pangeo-scratch')
lpath = 'ATL06_20190928165055_00270510_003_01.h5'
rpath = 'pangeo-scratch/scottyhq/ATL06_20190928165055_00270510_003_01.h5'
fs.upload(lpath, rpath)
s3obj = fs.open(rpath)
ds = xr.open_dataset(s3obj, engine='h5netcdf')

@TomAugspurger
Copy link
Member

Seems to be working with rechunker:

import zarr
import dask
import dask.array as da
import numpy as np
from matplotlib import pyplot as plt
import gcsfs

client = gateway.get_client()
fs = gcsfs.GCSFileSystem(token="cloud")

base_dir = "gcs://pangeo-scratch/taugspurger/rechunker/test_data"

store_source = fs.get_mapper(f'{base_dir}/source.zarr')

shape = (80000, 8000)
source_chunks = (200, 8000)
dtype = 'f4'

fs.rm(f'{base_dir}/source.zarr', recursive=True)
fs.rm(f'{base_dir}/target.zarr', recursive=True)
fs.rm(f'{base_dir}/temp.zarr', recursive=True)


a_source = zarr.ones(shape, chunks=source_chunks,
                     dtype=dtype, store=store_source)

target_store = fs.get_mapper(f'{base_dir}/target.zarr')
temp_store = fs.get_mapper(f'{base_dir}/temp.zarr')
max_mem = 25600000
target_chunks = (8000, 200)

from distributed import performance_report
from rechunker import api

res = api.rechunk_zarr2zarr_w_dask(a_source, target_chunks, max_mem,
                             target_store, temp_store=temp_store)

with performance_report():
    out = res.compute()

https://gistcdn.rawgit.org/TomAugspurger/9150ff7db8e89ba7ed7ce3b0965694e5/230d3db4f16356eb25a94cf9a70f9d2233c27595/dask-report(1).html

@rabernat
Copy link
Member Author

Should we add something to the new chart to populate an environment variable with gs://pangeo-scratch/<user_id/?

@TomAugspurger
Copy link
Member

I don't know if gcs supports prefix-level object lifecycles, so I worry that the <user_id>/ prefix would just be deleted.

@rabernat
Copy link
Member Author

I don't know if gcs supports prefix-level object lifecycles, so I worry that the <user_id>/ prefix would just be deleted.

Does this matter? These are not actual directories, just keys. You can write to gs://pangeo-scratch/rabernat/deep/nested/path as long as the bucket exists.

@TomAugspurger
Copy link
Member

Sorry, I misread your comment. I thought you were suggesting pre-populating the bucket with the key pangeo-scratch/<user_id>, rather than adding an environment variable. Yes, an environment variable would help with avoiding conflicts.

@TomAugspurger
Copy link
Member

Setting $PANGEO_HOME is a bit harder that I expected. We can't just set pangeo.jupyter.singleuser.extraenv.PANGEO_HOME='gcs://pangeo-scrats/$JUPYTERHUB_USER/', since we need the evaluated value of $JUPYTERHUB_USER (jupyterhub/zero-to-jupyterhub-k8s#1255). I'm not sure if modifying start in https://github.com/pangeo-data/pangeo-docker-images/blob/master/pangeo-notebook/start will do the trick or not.

@rabernat
Copy link
Member Author

I'm not sure if modifying start in https://github.com/pangeo-data/pangeo-docker-images/blob/master/pangeo-notebook/start will do the trick or not.

This sounds like the way to go. We could have a short bash script which tries to figure out what cloud we are on and sets PANGEO_SCRATCH appropriately.

Note that I prefer PANGEO_SCRATCH rather than ``PANGEO_HOME`. We should remind users at every step of the way that the storage is temporary.

@TomAugspurger
Copy link
Member

So I think the steps are

  1. PR to pangeo-cloud-federation that defines an environment variable SCRATCH_PREFIX: (gs://, s3://, etc.)
  2. PR to pangeo-docker-images in start that checks $SCRATCH_PREFIX and sets PANGEO_SCRATCH to $SCRATCH_PREFIX://pangeo-scratch/$JUPYTERHUB_USER/ .

@jhamman
Copy link
Member

jhamman commented Jun 29, 2020

I'm not sure we're going to be able to expand the JUPYTERHUB_USER environment variable as you are hoping but the place to try this is either in the start script or in the single-user section of the helm chart: https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-environment.html#set-environment-variables

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 29, 2020 via email

TomAugspurger added a commit to TomAugspurger/pangeo-stacks-dev that referenced this issue Jul 1, 2020
xref pangeo-data/pangeo-cloud-federation#610,
pangeo-data/pangeo#780. This adds a
PANGEO_SCRATCH environment variable. It relies on the existence of

1. PANGEO_SCRATCH_PROTOCOL
2. JUPYTERHUB_USERNAME

And combines those to form something like
`PANGEO_SCRATCH=gcs://pangeo-scratch/tomaugspurger`
@TomAugspurger
Copy link
Member

Pretty sure everything is done here.

yuvipanda added a commit to 2i2c-org/infrastructure that referenced this issue Mar 3, 2021
A temporary bucket that's cleared every 7 days,
and provides full access to all the users on that
hub. See
pangeo-data/pangeo-cloud-federation#610
for reasons why this is very useful.
@yuvipanda
Copy link
Member

2i2c-org/infrastructure#283 is the implementation I've ended up with, relying on GKE's cloud connector - there are similar things for AWS & AKS too. I also avoided the need for setting PANGEO_SCRATCH in the docker image with some fuckery here and here. This sets everything up as soon as I create a new hub, without any need for human intervention! YAY!

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 3, 2021
Pangeo hubs have a `PANGEO_SCRATCH` env variable that
points to a GCS bucket, used to share data between users.
We implement that here too, but with a more generic `SCRATCH_BUCKET`
env var (`PANGEO_SCRATCH` is also set for backwards compat).
pangeo-data/pangeo-cloud-federation#610
has some more info on the use cases for `PANGEO_SCRATCH`

Right now, we use Google Config Connector
(https://cloud.google.com/config-connector/docs/overview)
to set this up. We create Kubernetes CRDs, and the connector
creates appropriate cloud resources to match them. We use this
to provision a GCP Serivce account and a Storage bucket for each
hub.

Since these are GCP specific, running them on AWS fails. This
PR puts them behind a switch, so we can work on getting things to
AWS.

Eventually, it should also support AWS resources via the
AWS Service broker (https://aws.amazon.com/partners/servicebroker/)

Ref 2i2c-org#366
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants