Rearchitecture #186

jcrist · 2019-12-12T18:33:42Z

After a few months working on this, a few past design decisions are starting to cause some issues.

High availability story? #148: we're currently limited to a single server model, with the same restrictions as JupyterHub (this is because much of the design is based on JupyterHub).
Even though we can support a database for persistence, setting up additional infrastructure to do so places an additional burden on administrators. People are unlikely to do so unless necessary.

At the same time, the external facing API seems fairly solid and useful. I've been trying to find a way to rearchitect to allow (but not mandate) running multiple services, without a (mandatory) database to synchronize with, and I think I've reached a decent design.

Design Requirements

We must be able to support all major backends using the same codebase (as much as possible)
- Kubernetes
- Hadoop
- HPC JobQueues
- Local processes
- Extension points for custom backends
We must keep install instructions simple. A basic local install must not require installing anything that can't be installed from PyPI/conda-forge.
Performance in a single server-instance must not be overly degraded by any changes. Currently we've optimized for the single-server model, so I expect to lose some performance here, but it would be good to keep this use case in mind.
Where possible additional infrastructure (e.g. a database) should not be required. If we can rely on existing systems found for each backend we should.
When properly configured, each backend should have a way for the gateway to run without a single point of failure (at least one written by us). I don't see HA for most deployments increasing scalability, but I do see it improving robustness.

Below I present a design outline of how I hope to achieve these goals for the various backends.

Authentication

As with JupyterHub, we have an Authenticator class that configures authentication for the gateway server. If a request comes in without a cookie/token, the authenticator is used to authenticate the user, then a cookie is added to the response so the authenticator won't be called again. The decrypted cookie is a uuid that maps back to a user row in our backing database.

I propose keeping the authenticator class much the same, but instead of using a uuid as a cookie, I propose using a jwt to store the user information. If a request contains a jwt, it's validated, and (if valid) information like user name and groups can be extracted from the token. This removes the requirement for a shared database to map cookies to user names - subsequent requests will already contain this information.

Cluster IDs

Currently we mint a new uuid for every created cluster. This is nice as all backends have the same cluster naming scheme, but means we need to rely on a database to map uuids to cluster backend information (e.g. pod name, etc...).

To remove the need for a database, I propose to encode backend information in the cluster id. This means that each backend will have a different looking id, but means we can parse the cluster id to reference it in the backing resource manager instead of using a database to map between our id and the backend's.

For the following backends, things might look like:

Kubernetes: Cluster CRD object name
Hadoop: application id
HPC job queue/local processes/etc...: requires a database, probably same id scheme as now.

Cluster Managers

To support multiple dask-gateway servers, I found it helpful to split our backends into two categories:

Useful database in the resource manager

Kubernetes
YARN

No useful database in the resource manager

Local processes
HPC Job queues

The former category could support all our needed functionality without any need for synchronization outside of requests to the resource manager. The latter would require additional infrastructure on our end if we wanted to run multiple instances of the gateway server.

Walking through my proposed ideal implementations for each backend:

Kubernetes

The proposed design for running dask-gateway on kubernetes in an ideal (IMO) deployment is as follows:

A dask-gateway deployment, containing one or more instances of our gateway server.
Traefik proxy deployment, containing one or more pods running traefik, configured with an IngressRoute provider. Traefik 2.0 added support for proxying TCP connections dispatched on SNIs, which means it can support all the things we needed for our own custom proxy implementation. The old proxy could still be used, but doesn't (currently) support multiple instances. Traefik handles all this for us in a scalable manner.
A CRD for a dask cluster, and a backing controller (probably written with metacontroller). Making dask clusters a resource allows us to better use kubernetes to store all our application information, removing the need for an external database. It also removes the need to synchronize our application state and the resource manager's state - they're identical in this case. Operations like scaling a cluster are just a patch on the CRD. The metacontroller web hooks could run in the dask-gateway-server pods, or could be their own pods - right now I'm leaning towards the former for simplicity.

Listing clusters, adding new clusters, removing old clusters, etc... can all be done with single kubernetes api requests, removing a lot of need for synchronization on our end. It also means admins can use kubectl without worrying about messing with a deployment's internal state.

Hadoop

The Hadoop resource manager could also be used to track all the application state we care about, but I'm not sure if querying it will be performant enough for us. We likely want to support an optional (or maybe mandatory) database here. Some benchmarking before implementation will need to be done.

In this case, an installation would contain:

One or more dask-gateway server instances, behind a load balancer (load balancer not needed if only running one instance).
Our own existing custom proxy, or traefik. If using traefik we could maybe make use of JupyterHub's work to make it configurable via a file provider.
Management of individual clusters will be moved to skein's application master, instead of the gateway server. This will work similar to the kubernetes version above - scaling a cluster will forward the request the application master to handle. Querying for the current scale will query the application master. This will likely increase latency (all requests will have to ping an external service), but remove the bottleneck on the gateway server for handling everything.

HPC job queue systems

HPC job queue systems will require an external DB to synchronize multiple dask-gateway servers when running in HA mode. With some small tweaks, this will mostly look like the existing implementation. I plan to rely on Postgres for this, making use of its SKIP LOCKED feature to implement a work queue for synchronizing spawners, and asyncpg for its fast postgres integration.

Cluster backend base class

Currently we define a base ClusterManager class that each backend has to implement. Each dask cluster gets its own ClusterManager instance, which manages starting/stopping/scaling the cluster.

With the plan described above we'll need to move the backend-specific abstractions up higher in the stack. I propose the following initial definition (will likely change as I start working on things):

class ClusterBackend:
    async def list_clusters(self, user=None):
        """List all running or pending clusters.

        Parameters
        ----------
        user : str, optional
            If provided, filter on this user name.
        """
        pass

    async def get_cluster(self, user, cluster_id):
        """Get information about a cluster.

        Same output as `list_clusters`, but for one cluster"""
        pass

    async def start_cluster(self, user, **cluster_options):
        """Start a new cluster for this user"""
        pass

    async def stop_cluster(self, user, cluster_id):
        """Stop a cluster.
        
        No-op if already stopped/doesn't exist.
        """
        pass
        
    async def scale_cluster(self, user, cluster_id, n):
        """Scale a cluster to `n` workers."""
        pass

Things like kubernetes and maybe hadoop would implement the ClusterBackend class. We'd also provide an implementation that uses a database to manage the multi-user/cluster state and abstracts on a single-cluster class (probably looking the same as our existing ClusterManager base class).

class DatabaseClusterBackendBase(ClusterBackend):
    """A backend class that uses a database to synchronize between users/clusters.

    HPC job queue backends would use this, as well as other backends lacking a
    sufficiently queryable resource manager."""
    cluster_manager_class = Type(
        klass=ClusterManager,
        help="The cluster manager to use to manage individual clusters"
    )

The text was updated successfully, but these errors were encountered:

jcrist · 2019-12-12T18:37:38Z

I plan to work on an ideal kubernetes implementation first, following the abstraction laid out above, then add hadoop support, then job queue support, redefining abstractions as needed (working from most -> least common deployment backend). Design changes will almost certainly occur as I work through issues, but I think we can make the kubernetes deployment look more "kubernetes native" without sacrificing our other backends or splitting the codebase.

jcrist · 2019-12-12T18:39:14Z

cc @yuvipanda for thoughts on metacontroller/CRDs (apologies for the poorly spec'd design notes above, I'm still working out my thoughts).

yuvipanda · 2019-12-16T21:16:34Z

This looks amazing, and is something I wanna see in the JupyterHub world too!

I'm curious about the choice of JWT. They are awesome with short expiry, but present serious issues if they have longer expiry times. Unlike cookies with backing sessions, there is no way to revoke these if compromised without revoking all tokens. So no 'logout' is possible. http://cryto.net/~joepie91/blog/2016/06/13/stop-using-jwt-for-sessions/ has some more information on the problems with replacing cookies with JWT. But I don't actually know how clients pass this info to dask-gateway...

At a cursory glance the rest seem great!

jcrist · 2019-12-16T22:14:14Z

Glad to hear it!

Unlike cookies with backing sessions, there is no way to revoke these if compromised without revoking all tokens.

I went with JWTs instead of cookies, because I wanted something we could trust that wouldn't require us to keep a user database in our application. Since the username is in the JWT, we only need to validate the JWT instead of doing a DB lookup.

The issues you bring up are valid, but aren't as big of a deal for dask-gateway as they would be for JupyterHub. Authentication for the gateway is designed for no-human-in-the-loop authentication (e.g. no web form, only things like tokens, kerberos, etc...), so forcing a relogin for all users will only consume resources (extra calls to the authentication backend) rather than user time (since login is automated). I think this is ok for applications we care about.

yuvipanda · 2019-12-16T23:01:26Z

Thank you for the explanation! The 'no human in the loop' definitely makes things different.

To make sure I understand this correctly, I want to write out the example of how this would work with JupyterHub. To get a JWT you would:

Pass the Authenticator a JupyterHub Token
Authenticator talks to JupyterHub, validates token
Issues JWT, signed with a particular key, with expiration time T seconds from now
JWT is used for further authentication until it expires (or is invalid), at which time GOTO 1

Is this about right?

If you wanna revoke any token, you would need to revoke all tokens. Where would this happen? The way I can think of is to rotate the signing key, but then you would need to rotate the key across everything that is validating JWTs, where it becomes a key distribution problem. It makes deployment a little more complex, at which point it is a tradeoff from having a database. This can also be mitigated by having T be small, to force re-issue of JWTs more frequently.

This makes me wonder - can the Authenticator just hit the external auth source each time it is asked? You can perform transparent caching here appropriate to the backend (different for JupyterHub vs Kerberos, for example) that can be tuned by the admin as they see fit. This takes JWTs (and hence distributed signing keys) out of the equation as well. This doesn't require persistence because the auth info is in memory, and works fine for HA since the caching is per-process, and the 'source of truth' is external to them all.

I might also be talking from a fundamental ignorance of how dask-gateway auth works. If so, let me know and I'll happily trawl through the code instead :)

jcrist · 2019-12-17T16:36:30Z

Is this about right?

Yeah, that looks correct.

This makes me wonder - can the Authenticator just hit the external auth source each time it is asked?

This is an interesting idea. I see the jwt lifetime as set relatively short (15 min by default perhaps, but configurable). Talking through the tradeoffs:

JWTs are shareable between instances of the gateway server. If a user has a JWT in the request, it doesn't matter which instance gets it, requests to the authentication backend are cached for the same amount of time globally for all servers. In contrast, if we implement caching per server, then each server has a different cache window, which is harder to reason about. The authentication backend will also be hit on average N times more per user, where N is the number of gateway servers.
JWTs can be revoked globally by updating the signing key for all servers. This requires an extra step for admins, but with the helm chart and hubploy shouldn't be too difficult. For caching per server, tokens can be revoked globally for all users by restarting all servers.

I think I like the JWT approach slightly more since the caching in this case is global across all instances of the server, which I find easier to reason about. Thoughts?

jcrist · 2020-01-23T15:26:43Z

I'm currently hacking away on this, progress so far is good. One complication that's come up though is implementing our current user resource limits model (https://gateway.dask.org/resource-limits.html) on kubernetes.

Currently we provide configurable limits on:

Clusters per user
Cores per user
Memory per user

Requests for new clusters/workers that exceed this limit will fail. This is easy to do in a system where all state is kept on a single server (or when using a database), but much harder when trying to keep all state in kubernetes objects. For non-kubernetes backends we could keep or tweak the existing model as needed, but for kubernetes we'll definitely need to change things. A few options I see here:

Drop support for user limits entirely, and rely on kubernetes-native methods of restricting resources. One option here would be creating a namespace per user, each with a ResourceQuota (https://kubernetes.io/docs/concepts/policy/resource-quotas/) limiting objects. Other models could work as well, we could make the mapping of user -> namespace configurable so you could also do a shared namespace per users to implement shared resource limits.
The above, but manage these namespaces/resource-quotas in the gateway itself. We'd want to make this configurable, so as not to restrict users into one model.
Drop support for user level limits, instead imposing limits per cluster. It's much easier to make local decisions in a controller (e.g. about a single cluster) rather than global ones about multiple clusters. So we could limit cores/memory/workers per cluster, but not clusters/cores/memory per user. This wouldn't be using ResourceQuota, and wouldn't impose any restrictions on mapping clusters -> namespaces, all the logic for this would be implemented in the controller.

Option 1 (and 2) support user-level (or group-level) resource limits natively, but impose restrictions on namespace usage (ResourceQuota limits are inherently per-namespace). Option 3 doesn't impose these restrictions, but changes limits to only per-cluster rather than per-user.

Right now I'm leaning towards Option 2 (or 1), but option 3 also isn't bad.

cc @yuvipanda, @mmccarty @jacobtomlinson for thoughts.

mmccarty · 2020-01-23T15:53:07Z

Great news Jim! We are in favor of option 2. That also gives us the option to fall back to option 1 if needed.
cc @dankerrigan @gforsyth @droctothorpe @zac-hopkinson

jacobtomlinson · 2020-01-27T10:12:58Z

Option 2 also sounds reasonable to me.

yuvipanda · 2020-01-27T20:22:23Z

Creating a new namespace per user can get complicated and unweildy on clusters that aren't solely dedicated to dask-gateway / Jupyter use. I like option 2, assuming you aren't creating namespaces there.

You can also use ResourceQuota without namespaces - see https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-priorityclass.

jacobtomlinson · 2020-02-03T16:09:56Z

@droctothorpe could you expand a little on the drawbacks? I think using traefik to reverse proxy the pod services doesn't have much to do with the cluster's service mesh. In this instance it sounds like traefik is effectively being used at the application level.

droctothorpe · 2020-02-07T14:16:58Z

Having dug a little deeper into the Traefik documentation, @jacobtomlinson, my concerns were unwarranted. Deleting the comment. Thanks for the appropriate pushback.

jcrist · 2020-02-11T21:36:40Z

The first step of this rearchitecture was done in #195. The backend interface has been defined, and all backends except k8s have been implemented using it. Further discussion of the k8s backend will take place in #198.

rokroskar · 2021-07-05T14:54:28Z

I was looking for a possible way to set up authentication with JWTs and found this issue, but no other mention of JWT-based authentication. Is this work on-going somewhere?

jcrist mentioned this issue Dec 12, 2019

High availability story? #148

Closed

jcrist mentioned this issue Dec 20, 2019

Easily set PVC with sqlite from helm chart #190

Closed

jcrist mentioned this issue Feb 4, 2020

Rearchitecture #195

Merged

9 tasks

jcrist closed this as completed in #195 Feb 11, 2020

scottyhq mentioned this issue Jul 30, 2020

Implement dask-gateway cluster limits per-user pangeo-data/pangeo-cloud-federation#693

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rearchitecture #186

Rearchitecture #186

jcrist commented Dec 12, 2019 •

edited

Loading

jcrist commented Dec 12, 2019

jcrist commented Dec 12, 2019

yuvipanda commented Dec 16, 2019

jcrist commented Dec 16, 2019 •

edited

Loading

yuvipanda commented Dec 16, 2019

jcrist commented Dec 17, 2019

jcrist commented Jan 23, 2020 •

edited

Loading

mmccarty commented Jan 23, 2020

jacobtomlinson commented Jan 27, 2020

yuvipanda commented Jan 27, 2020

jacobtomlinson commented Feb 3, 2020

droctothorpe commented Feb 7, 2020

jcrist commented Feb 11, 2020

rokroskar commented Jul 5, 2021

Rearchitecture #186

Rearchitecture #186

Comments

jcrist commented Dec 12, 2019 • edited Loading

Design Requirements

Authentication

Cluster IDs

Cluster Managers

Kubernetes

Hadoop

HPC job queue systems

Cluster backend base class

jcrist commented Dec 12, 2019

jcrist commented Dec 12, 2019

yuvipanda commented Dec 16, 2019

jcrist commented Dec 16, 2019 • edited Loading

yuvipanda commented Dec 16, 2019

jcrist commented Dec 17, 2019

jcrist commented Jan 23, 2020 • edited Loading

mmccarty commented Jan 23, 2020

jacobtomlinson commented Jan 27, 2020

yuvipanda commented Jan 27, 2020

jacobtomlinson commented Feb 3, 2020

droctothorpe commented Feb 7, 2020

jcrist commented Feb 11, 2020

rokroskar commented Jul 5, 2021

jcrist commented Dec 12, 2019 •

edited

Loading

jcrist commented Dec 16, 2019 •

edited

Loading

jcrist commented Jan 23, 2020 •

edited

Loading