-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give all notebook pods a temporary bucket #610
Comments
This is a great idea. It's been discussed a few times in separate issues. If anyone is willing to spend some time on it, I think what is described here is a viable way forward: #485 (comment) This would be cloud-provider specific though. |
Perhaps create/delete pod-associated folders within a single persistent
temp data bucket? Bucket create/delete might be more cumbersome. Fear
zombies.
Rob Fatland
UW Research Computing Director
…On Fri, May 15, 2020 at 12:21 PM Ryan Abernathey ***@***.***> wrote:
I've been thinking about what would help things work more smoothly on our
cloud hubs in terms of data storage. One clear need is a place to put
temporary data. Filesystem-based solutions are not a good solution because
they are hard to share with dask workers.
What if we could create a temporary bucket for each user pod, which is
automatically deleted at the end of each session? This would be awesome. We
could propagate write credentials to the bucket to the dask workers, so
that people could dump as much temporary data there as they want. But by
deleting at the end of each session, we avoid blowing up our storage costs.
It seems like this sort of things should be possible with kubernetes, but
I'm not sure how to do it.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#610>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJRWN7AWTICSV5ZNWMUODRRWI4HANCNFSM4NCHD2IA>
.
|
AFAIK, "folders" don't exist in object storage. You either have write privileges for the entire bucket or not. So if we do this, and we care about isolating user data, we have to do it at the bucket level. If we don't care about isolating user data, we could just have a single, global readable / writeable bucket. But we would be open to the possibility that one user could delete everyone else's data. This makes me miss POSIX permissions. |
True, but you can effectively accomplish this by a bucket policy or user policy for access to different "object prefixes" which act like folders. Described here for AWS, I image there is a similar setup for GCP: https://aws.amazon.com/blogs/security/writing-iam-policies-grant-access-to-user-specific-folders-in-an-amazon-s3-bucket/ The difficulty is mapping hub users to cloud account credentials. That would be accomplished via Auth0. It seems tricky but doable to modify KubeSpawner to link up the jupyter username to a Cloud access token, so using myself as an example I'd get access to
We can do this very easily. I've tried it on the aws hub |
Question for @yuvipanda or @consideRatio. I was trying to follow the approach here berkeley-dsep-infra/datahub#713 to give each user a bucket in GCP. If I understand correctly, the only way to give Kubespawner commands access to additional command line tools or python libraries (such as Is there anyway around that? |
Yes, I think this is called Access Control Lists. It sounds like we should be able to use ACLs and lifecycle management to provide per-user object storage! We can't easily provide a size-based quota, which would be ideal. Instead, we can site a time limit on temporary objects. I feel like 7 or 14 days would be reasonable. Inevitably users would lose data unexpectedly until they got used to working this way. |
I'd love to brainstorm a way forward on this at tomorrow's dev meeting. Some questions that come to mind:
|
Sounds like the details will differ across cloud-providers, which I think is ok. For AWS the general recommendation seems to be one bucket with differing object permissions per user. Another question to table:
For BinderHubs, keeping open to any GitHub user is key for outreach and workshops. But I'm concerned about creating many 'things' (buckets, policies, etc) for an ever-increasing number of users. It might be okay if those things are tied to a session and are deleted automatically... For Hubs, there is at least a cap of several hundred users based on GitHub organization membership. Another option worth considering is BYOB, where we document for users how to create a bucket on their own account and connect access from the hub or binderhub. (more complicated for users obviously, but more sustainable for those of us administering resources with limited time and credits to go around). |
As discussed on today's call, I'm going to try the approach of a global scratch bucket with no formal user-specific credentials, instead using an environment variable to point each user to the appropriate path. Can someone tell me which service account corresponds to the user notebook pods? I can't figure it out. Also, dask workers will need access. I assume they are associated with "Dask Worker Service Account" ([email protected])? Is that correct? |
the helm config should point to a
The linking of account credentials is done via the underlying cluster config. For AWS,
I'm guessing it's the same for GCP? |
@scottyhq does assigning that annotation magically grant S3 read / write privileges to the pod (assuming they've been granted to that role)? For the multicloud demo I went through a much more complicated process of mounting a secrets volume and setting a GOOGLE_APPLICATION_CREDENTIALS file: https://github.com/pangeo-data/multicloud-demo/blob/4421333b72831665fc39b2a3b7e8b4f2f2374e9f/config-gcp.yaml#L12-L23. Documented a bit at https://github.com/pangeo-data/multicloud-demo#notes-on-requester-pays-gcp |
Just wanted to add a point on this:
That should totally be doable, but should not be counted as reliable, since python kernels or kube pods can disappear without warning. If you are going for prefixes within a bucket, then you would need to list all the files and send delete requests for each, which is potentially expensive to do (compared to nixing a whole bucket). Certainly would like to use async and batch deleting, or maybe even use a dedicated CLI tool if it does a good job. If you rely on life-cycles alone, you may well end up paying more than you hoped. Is it time to convene a group for building async into fsspec and friends? Note that most of the object stores also allow for archiving data to cheaper storage, which is not appropriate for "scratch"/intermediates, but might be right for output data artefacts. Such archiving can also be done on a life-cycle (e.g., 7 days untouched, archive; 30days untouched, delete). |
Agreed that "at the end of each session" would be hard to do, for the reasons you listed.
Why do you think this wouldn't work for scratch / intermediate? I think we could have an aggressive lifecycle policy, like objects older than 1 day are deleted Something like this
I don't think this "scratch" bucket is well-suited to solving output data artifacts, so we can safely ignore that use case. |
I mean that archival storage is not so useful for intermediates that are likely to be read in again in the near future - better just delete them. I don't know the specifics of each store, but generally access to archived data is not just slower, it comes with quotas and access limits, and frequent access might even end up costing more. There are probably many options for each backend... |
Gotcha, thanks. I think that's a non-issue here as long as we're willing to treat this solely as scratch space. There are other, better options for making data products. |
Agreed. |
The additional step is assigning whatever 'policies' you want to that role (such as S3 read/write), and there are some additional things like OIDC configuration that happens at the cluster level. Full documentation is here - https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html command line cluster management tools like eksctl take care of most of the details for AWS, or |
Right, I suppose to rephrase my question: is there something special in eks that looks for |
@TomAugspurger - as far as I can tell GKE has the equivalent approach to EKS documented here ("Workload identity"): https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity |
Cool, thanks for digging that up?
@rabernat are you likely to attempt to implement this sometime soon? I'm
busy most of next week, but may be able to squeeze in an hour or two on
Monday to try things out.
…On Thu, May 28, 2020 at 12:02 PM Scott Henderson ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> - as far as I can tell
GKE has the equivalent approach to EKS documented here ("Workload
identity"):
https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#610 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIXJL4R5NKPVXJDRBBTRT2KJRANCNFSM4NCHD2IA>
.
|
Trying this out now for an hour. |
I will likely have some time to work on this on Tuesday afternoon. Thanks @TomAugspurger for you work! Let me know where I can pick things up. |
Short status update:
Remaining work
Hmm one of the nodes just failed to migrate... Will dig into that a bit more but then I do have to shelve this for a bit :) |
The migration might still be happening in the background. Here's the command that timed out
In the GCP UI things are still updating though. If that's correct, then the we'd just need to migrate the remaining pools
So I'll let that one process a bit longer before kicking off the rest. |
( |
It seems async is only implemented on some gcloud commands, and maybe not this one, although there's beta and alpha... Never mind. |
@TomAugspurger - remind me where the GCP nodegroup config is listed these days? Are you running k8s 1.15 or 1.16 currently on GKE? I'm not sure what the issue is, but for AWS nodegroup upgrades we typically create a separate nodegroup (e.g. |
Currently on k8s 1.15.9-gke.24. Cycling the node pools makes sense. In-place migration seems like more hassle than it's worth. Sounds like a pretty opportune time to add a terraform config to this repo :) Then we just adjust a few variable names and let it run. |
I have a PoC working on a separate kubernetes cluster. The bucket >>> def check():
... fs = gcsfs.GCSFileSystem(token="cloud")
... return {file: fs.open(file).read() for file in fs.ls("pangeo-scratch/bar/")}
>>> def put(dask_worker):
... fs = gcsfs.GCSFileSystem(token="cloud")
... name = dask_worker.address.split(":")[-1]
... with fs.open("pangeo-scratch/bar/{}.txt".format(name), "wb") as f:
... f.write(b"hi")
>>> client.run(put)
>>> client.run(check)
{'tls://10.52.4.2:42275': {'pangeo-scratch/bar/35899.txt': b'hi',
'pangeo-scratch/bar/42275.txt': b'hi'},
'tls://10.52.4.3:35899': {'pangeo-scratch/bar/35899.txt': b'hi',
'pangeo-scratch/bar/42275.txt': b'hi'}} The service account can only write to OSError: Forbidden: https://www.googleapis.com/upload/storage/v1/b/pangeo-billing/o
[email protected] does not have storage.objects.create access to pangeo-billing/bar/42275.txt. I couldn't get auto nodepools working. Sorry Joe :) I think the next step is to roll this out to the binders / hubs by creating new nodepools with this attribute set, and then deleting the old ones. I'll start with |
#613 has (I think) all the necessary changes to the helm config. Just using the pangeo Kubernetes service account in more places. I think we want the user, scheduler, and worker pods to all be able to read / write to the bucket. I've also created some node pools in the GCP cluster with workload identity enabled. We'll just need to remove the old node pools (I think we can do that whenever. No harm in doing it early I think). |
I think this is working, if people want to try things out. I've enabled it for dev-staging, ocean-staging, dev-prod, ocean-prod. I'll be pushing up docs on the configuration later today or tomorrow. For now I've set the lifecycle policy on We'll also want to provide some docs to users about how to actually use this, but the short version is that
should let you read / write to the |
Seems to be working as well on AWS, but not thoroughly tested. NOTE on aws-uswest2.pangeo.io objects in import s3fs
fs = s3fs.S3FileSystem()
fs.ls('pangeo-scratch')
lpath = 'ATL06_20190928165055_00270510_003_01.h5'
rpath = 'pangeo-scratch/scottyhq/ATL06_20190928165055_00270510_003_01.h5'
fs.upload(lpath, rpath)
s3obj = fs.open(rpath)
ds = xr.open_dataset(s3obj, engine='h5netcdf') |
Seems to be working with rechunker: import zarr
import dask
import dask.array as da
import numpy as np
from matplotlib import pyplot as plt
import gcsfs
client = gateway.get_client()
fs = gcsfs.GCSFileSystem(token="cloud")
base_dir = "gcs://pangeo-scratch/taugspurger/rechunker/test_data"
store_source = fs.get_mapper(f'{base_dir}/source.zarr')
shape = (80000, 8000)
source_chunks = (200, 8000)
dtype = 'f4'
fs.rm(f'{base_dir}/source.zarr', recursive=True)
fs.rm(f'{base_dir}/target.zarr', recursive=True)
fs.rm(f'{base_dir}/temp.zarr', recursive=True)
a_source = zarr.ones(shape, chunks=source_chunks,
dtype=dtype, store=store_source)
target_store = fs.get_mapper(f'{base_dir}/target.zarr')
temp_store = fs.get_mapper(f'{base_dir}/temp.zarr')
max_mem = 25600000
target_chunks = (8000, 200)
from distributed import performance_report
from rechunker import api
res = api.rechunk_zarr2zarr_w_dask(a_source, target_chunks, max_mem,
target_store, temp_store=temp_store)
with performance_report():
out = res.compute() |
Should we add something to the new chart to populate an environment variable with |
I don't know if gcs supports prefix-level object lifecycles, so I worry that the |
Does this matter? These are not actual directories, just keys. You can write to |
Sorry, I misread your comment. I thought you were suggesting pre-populating the bucket with the key |
Setting |
This sounds like the way to go. We could have a short bash script which tries to figure out what cloud we are on and sets Note that I prefer |
So I think the steps are
|
I'm not sure we're going to be able to expand the |
I think that it has to be the start script. The helm chart is too early on
in the process.
…On Mon, Jun 29, 2020 at 4:19 PM Joe Hamman ***@***.***> wrote:
I'm not sure we're going to be able to expand the JUPYTERHUB_USER
environment variable as you are hoping but the place to try this is either
in the start script or in the single-user section of the helm chart:
https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-environment.html#set-environment-variables
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#610 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOITE5QHHLJGZCNWDI2DRZEAMTANCNFSM4NCHD2IA>
.
|
xref pangeo-data/pangeo-cloud-federation#610, pangeo-data/pangeo#780. This adds a PANGEO_SCRATCH environment variable. It relies on the existence of 1. PANGEO_SCRATCH_PROTOCOL 2. JUPYTERHUB_USERNAME And combines those to form something like `PANGEO_SCRATCH=gcs://pangeo-scratch/tomaugspurger`
Pretty sure everything is done here. |
A temporary bucket that's cleared every 7 days, and provides full access to all the users on that hub. See pangeo-data/pangeo-cloud-federation#610 for reasons why this is very useful.
2i2c-org/infrastructure#283 is the implementation I've ended up with, relying on GKE's cloud connector - there are similar things for AWS & AKS too. I also avoided the need for setting PANGEO_SCRATCH in the docker image with some fuckery here and here. This sets everything up as soon as I create a new hub, without any need for human intervention! YAY! |
Pangeo hubs have a `PANGEO_SCRATCH` env variable that points to a GCS bucket, used to share data between users. We implement that here too, but with a more generic `SCRATCH_BUCKET` env var (`PANGEO_SCRATCH` is also set for backwards compat). pangeo-data/pangeo-cloud-federation#610 has some more info on the use cases for `PANGEO_SCRATCH` Right now, we use Google Config Connector (https://cloud.google.com/config-connector/docs/overview) to set this up. We create Kubernetes CRDs, and the connector creates appropriate cloud resources to match them. We use this to provision a GCP Serivce account and a Storage bucket for each hub. Since these are GCP specific, running them on AWS fails. This PR puts them behind a switch, so we can work on getting things to AWS. Eventually, it should also support AWS resources via the AWS Service broker (https://aws.amazon.com/partners/servicebroker/) Ref 2i2c-org#366
I've been thinking about what would help things work more smoothly on our cloud hubs in terms of data storage. One clear need is a place to put temporary data. Filesystem-based solutions are not a good solution because they are hard to share with dask workers.
What if we could create a temporary bucket for each user pod, which is automatically deleted at the end of each session? This would be awesome. We could propagate write credentials to the bucket to the dask workers, so that people could dump as much temporary data there as they want. But by deleting at the end of each session, we avoid blowing up our storage costs.
It seems like this sort of things should be possible with kubernetes, but I'm not sure how to do it.
The text was updated successfully, but these errors were encountered: