Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dask-gateway compatibility #87

Merged
merged 30 commits into from
Dec 29, 2020
Merged

Add dask-gateway compatibility #87

merged 30 commits into from
Dec 29, 2020

Conversation

bolliger32
Copy link
Collaborator

@bolliger32 bolliger32 commented Sep 9, 2020

Reworked some of the kubernetes module to allow for the use of dask-gateway when it is available (currently only set up to run with a coastal image on adrastea). There will be some related PR's to the helm-chart and docker_images repos.

Tried to keep the user-facing api as similar as possible. The get_*_cluster commands should be the exact same for almost all use cases.

Testing is somewhat limited b/c not all of the options you can pass to a cluster object with a kubernetes backend are available for a local backend (which is what is used on travis for the tests). But I think it's a start.

I think there was a black update that changed the autoformatting behavior a bit. I ran black on everything and it changed a few other files as well (sorry about that). I had thought everything had had black already run on it, but I think the version change may have introduced a change in behavior.

EDIT 12/10/20: Added "tag" argument to supersede #80

@brews
Copy link
Member

brews commented Sep 11, 2020

This is some great work, @bolliger32! And it warms my heart to see PRs with unit tests in them.

You mention "The get_*_cluster commands should be the exact same for almost all use cases." Are there any gotchas our use cases that we might need to watch for with this PR?

One potential change on quick look, you have aTODO: More tests in HISTORY. I don't know if that's cruft or intentional, or maybe this is still a work in progress and I need to back off.

Regardless, thanks for putting this together!

@bolliger32
Copy link
Collaborator Author

bolliger32 commented Sep 14, 2020

Haha thanks!

As for the get_*_cluster commands, I think the main difference I see is that not all of the kwargs are available with the current set-up of dask-gateway. They all could be - the main difference is that the cluster config interface is customized within the helm chart. So basically, we can expose any options we want at deployment time. In theory, if we hadn't already gotten people using the rhgk.get_cluster toolkit, I'd say we should just use dask-gateway directly b/c you can basically do the same thing by properly configuring which cluster options you expose to users.

  • Right now I'd say the biggest thing you can't do is specify an arbitrary scaling_factor or cpu/memory sizes. I chose to just allow 4 choices of worker size (what we previously called micro, standard, big, giant). But we could allow this if we wanted to. It yells at you if you are on a notebook image that has dask-gateway installed but try to use one of these kwargs
  • You also no longer have a worker_template.yml file, so the kwarg that points you to a different template is no longer an option.
  • I also removed the dask_config_dict kwarg, since this just set the dask config using this dict. I don't think its used often and instead you can just call dask.config.update(some_config_dict) before calling one of the get_cluster commands. Would not be hard to include this in the gateway get_cluster function as this is actually upstream of instantiating either a KubeCluster or GatewayCluster.

One other concern with dask-gateway of note - you do still end up passing credentials over the cluster which I would imagine has some potential security issues? I don't really know of a better way to do this... and we were already doing this via the GCSFUSE_TOKENS env var. One concern with our current dask-gateway deployment is that if someone displays the HTML repr of a cluster options object in the notebook, it has a text field that contains their credentials. I don't think this information ever makes it into the json of the notebook so wouldn't get saved in a repo accidentally, but it does seem a little strange. I don't see a great way to pass credentials that would avoid this - and users would never interact with one of those options objects unless they are bypassing the rhg_compute_tools framework to use dask_gateway directly. So I don't think it's a big issue. But just something I wanted to flag in case you thought it was a bigger issue.


As for the TODO...I was originally hoping to spin up an actual kubernetes cluster inside the CI (using k3s) and then deploy dask gateway on the kuberentes cluster (rather than using a local backend for dask-gateway). This would allow for some more thorough testing that options were getting passed correctly (some options don't actually have an effect in the local cluster (like changing the assigned worker memory size for some reason, which confuses me...). But that seems like a fair amount of work and potentially something to do for a future PR. In fact, I'll file an issue for that now :) (#88 )

@bolliger32
Copy link
Collaborator Author

@brews and @delgadom now that we're using this branch of rhg_compute_tools on all servers, I think we should merge it in. I know @brews gave it a look a couple months ago and I don't think there are any outstanding issues, but I wanted to check in before I hit merge.

@bolliger32
Copy link
Collaborator Author

And then I think it's worth a v0.3.0 tag when this is merged in.

@bolliger32
Copy link
Collaborator Author

bolliger32 commented Dec 24, 2020

@brews would love your post-holiday take on the GH actions I've set up to migrate from travis, as I'm still figuring these out. I've got two workflows. One for every push and pull request, which builds and runs tests. And then one for every release, which just builds and pushes to pypi.

I'm definitely learning both GH actions and python package publishing as I'm going, so let me know if things don't look right. I think I've tested out the release workflow though and it seems to successfully push to pypi. Right now theres a v1.0.0a2 release up there which was uploaded when I created a release.

I did not include any testing in the release workflow b/c I assume that would only be implemented following a merge into master, so all of the tests would already have been run. But we could change this if you think better to include tests there too.

Just two files to look at really - the two under .github/workflows

@bolliger32 bolliger32 merged commit a0d9071 into master Dec 29, 2020
@bolliger32
Copy link
Collaborator Author

Merged in order to be able to Mark current images as stable. The GH actions still could use a look over. We can update those via a follow up PR if needed

@bolliger32 bolliger32 deleted the gateway branch December 29, 2020 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

getting clusters is confusing af
2 participants