Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearchitecture #186

Closed
jcrist opened this issue Dec 12, 2019 · 14 comments · Fixed by #195
Closed

Rearchitecture #186

jcrist opened this issue Dec 12, 2019 · 14 comments · Fixed by #195

Comments

@jcrist
Copy link
Member

jcrist commented Dec 12, 2019

After a few months working on this, a few past design decisions are starting to cause some issues.

  • High availability story? #148: we're currently limited to a single server model, with the same restrictions as JupyterHub (this is because much of the design is based on JupyterHub).
  • Even though we can support a database for persistence, setting up additional infrastructure to do so places an additional burden on administrators. People are unlikely to do so unless necessary.

At the same time, the external facing API seems fairly solid and useful. I've been trying to find a way to rearchitect to allow (but not mandate) running multiple services, without a (mandatory) database to synchronize with, and I think I've reached a decent design.

Design Requirements

  • We must be able to support all major backends using the same codebase (as much as possible)
    • Kubernetes
    • Hadoop
    • HPC JobQueues
    • Local processes
    • Extension points for custom backends
  • We must keep install instructions simple. A basic local install must not require installing anything that can't be installed from PyPI/conda-forge.
  • Performance in a single server-instance must not be overly degraded by any changes. Currently we've optimized for the single-server model, so I expect to lose some performance here, but it would be good to keep this use case in mind.
  • Where possible additional infrastructure (e.g. a database) should not be required. If we can rely on existing systems found for each backend we should.
  • When properly configured, each backend should have a way for the gateway to run without a single point of failure (at least one written by us). I don't see HA for most deployments increasing scalability, but I do see it improving robustness.

Below I present a design outline of how I hope to achieve these goals for the various backends.

Authentication

As with JupyterHub, we have an Authenticator class that configures authentication for the gateway server. If a request comes in without a cookie/token, the authenticator is used to authenticate the user, then a cookie is added to the response so the authenticator won't be called again. The decrypted cookie is a uuid that maps back to a user row in our backing database.

I propose keeping the authenticator class much the same, but instead of using a uuid as a cookie, I propose using a jwt to store the user information. If a request contains a jwt, it's validated, and (if valid) information like user name and groups can be extracted from the token. This removes the requirement for a shared database to map cookies to user names - subsequent requests will already contain this information.

Cluster IDs

Currently we mint a new uuid for every created cluster. This is nice as all backends have the same cluster naming scheme, but means we need to rely on a database to map uuids to cluster backend information (e.g. pod name, etc...).

To remove the need for a database, I propose to encode backend information in the cluster id. This means that each backend will have a different looking id, but means we can parse the cluster id to reference it in the backing resource manager instead of using a database to map between our id and the backend's.

For the following backends, things might look like:

  • Kubernetes: Cluster CRD object name
  • Hadoop: application id
  • HPC job queue/local processes/etc...: requires a database, probably same id scheme as now.

Cluster Managers

To support multiple dask-gateway servers, I found it helpful to split our backends into two categories:

Useful database in the resource manager

  • Kubernetes
  • YARN

No useful database in the resource manager

  • Local processes
  • HPC Job queues

The former category could support all our needed functionality without any need for synchronization outside of requests to the resource manager. The latter would require additional infrastructure on our end if we wanted to run multiple instances of the gateway server.

Walking through my proposed ideal implementations for each backend:

Kubernetes

The proposed design for running dask-gateway on kubernetes in an ideal (IMO) deployment is as follows:

  • A dask-gateway deployment, containing one or more instances of our gateway server.
  • Traefik proxy deployment, containing one or more pods running traefik, configured with an IngressRoute provider. Traefik 2.0 added support for proxying TCP connections dispatched on SNIs, which means it can support all the things we needed for our own custom proxy implementation. The old proxy could still be used, but doesn't (currently) support multiple instances. Traefik handles all this for us in a scalable manner.
  • A CRD for a dask cluster, and a backing controller (probably written with metacontroller). Making dask clusters a resource allows us to better use kubernetes to store all our application information, removing the need for an external database. It also removes the need to synchronize our application state and the resource manager's state - they're identical in this case. Operations like scaling a cluster are just a patch on the CRD. The metacontroller web hooks could run in the dask-gateway-server pods, or could be their own pods - right now I'm leaning towards the former for simplicity.

Listing clusters, adding new clusters, removing old clusters, etc... can all be done with single kubernetes api requests, removing a lot of need for synchronization on our end. It also means admins can use kubectl without worrying about messing with a deployment's internal state.

Hadoop

The Hadoop resource manager could also be used to track all the application state we care about, but I'm not sure if querying it will be performant enough for us. We likely want to support an optional (or maybe mandatory) database here. Some benchmarking before implementation will need to be done.

In this case, an installation would contain:

  • One or more dask-gateway server instances, behind a load balancer (load balancer not needed if only running one instance).
  • Our own existing custom proxy, or traefik. If using traefik we could maybe make use of JupyterHub's work to make it configurable via a file provider.
  • Management of individual clusters will be moved to skein's application master, instead of the gateway server. This will work similar to the kubernetes version above - scaling a cluster will forward the request the application master to handle. Querying for the current scale will query the application master. This will likely increase latency (all requests will have to ping an external service), but remove the bottleneck on the gateway server for handling everything.

HPC job queue systems

HPC job queue systems will require an external DB to synchronize multiple dask-gateway servers when running in HA mode. With some small tweaks, this will mostly look like the existing implementation. I plan to rely on Postgres for this, making use of its SKIP LOCKED feature to implement a work queue for synchronizing spawners, and asyncpg for its fast postgres integration.

Cluster backend base class

Currently we define a base ClusterManager class that each backend has to implement. Each dask cluster gets its own ClusterManager instance, which manages starting/stopping/scaling the cluster.

With the plan described above we'll need to move the backend-specific abstractions up higher in the stack. I propose the following initial definition (will likely change as I start working on things):

class ClusterBackend:
    async def list_clusters(self, user=None):
        """List all running or pending clusters.

        Parameters
        ----------
        user : str, optional
            If provided, filter on this user name.
        """
        pass

    async def get_cluster(self, user, cluster_id):
        """Get information about a cluster.

        Same output as `list_clusters`, but for one cluster"""
        pass

    async def start_cluster(self, user, **cluster_options):
        """Start a new cluster for this user"""
        pass

    async def stop_cluster(self, user, cluster_id):
        """Stop a cluster.
        
        No-op if already stopped/doesn't exist.
        """
        pass
        
    async def scale_cluster(self, user, cluster_id, n):
        """Scale a cluster to `n` workers."""
        pass

Things like kubernetes and maybe hadoop would implement the ClusterBackend class. We'd also provide an implementation that uses a database to manage the multi-user/cluster state and abstracts on a single-cluster class (probably looking the same as our existing ClusterManager base class).

class DatabaseClusterBackendBase(ClusterBackend):
    """A backend class that uses a database to synchronize between users/clusters.

    HPC job queue backends would use this, as well as other backends lacking a
    sufficiently queryable resource manager."""
    cluster_manager_class = Type(
        klass=ClusterManager,
        help="The cluster manager to use to manage individual clusters"
    )
@jcrist
Copy link
Member Author

jcrist commented Dec 12, 2019

I plan to work on an ideal kubernetes implementation first, following the abstraction laid out above, then add hadoop support, then job queue support, redefining abstractions as needed (working from most -> least common deployment backend). Design changes will almost certainly occur as I work through issues, but I think we can make the kubernetes deployment look more "kubernetes native" without sacrificing our other backends or splitting the codebase.

@jcrist
Copy link
Member Author

jcrist commented Dec 12, 2019

cc @yuvipanda for thoughts on metacontroller/CRDs (apologies for the poorly spec'd design notes above, I'm still working out my thoughts).

@yuvipanda
Copy link
Contributor

This looks amazing, and is something I wanna see in the JupyterHub world too!

I'm curious about the choice of JWT. They are awesome with short expiry, but present serious issues if they have longer expiry times. Unlike cookies with backing sessions, there is no way to revoke these if compromised without revoking all tokens. So no 'logout' is possible. http://cryto.net/~joepie91/blog/2016/06/13/stop-using-jwt-for-sessions/ has some more information on the problems with replacing cookies with JWT. But I don't actually know how clients pass this info to dask-gateway...

At a cursory glance the rest seem great!

@jcrist
Copy link
Member Author

jcrist commented Dec 16, 2019

Glad to hear it!

Unlike cookies with backing sessions, there is no way to revoke these if compromised without revoking all tokens.

I went with JWTs instead of cookies, because I wanted something we could trust that wouldn't require us to keep a user database in our application. Since the username is in the JWT, we only need to validate the JWT instead of doing a DB lookup.

The issues you bring up are valid, but aren't as big of a deal for dask-gateway as they would be for JupyterHub. Authentication for the gateway is designed for no-human-in-the-loop authentication (e.g. no web form, only things like tokens, kerberos, etc...), so forcing a relogin for all users will only consume resources (extra calls to the authentication backend) rather than user time (since login is automated). I think this is ok for applications we care about.

@yuvipanda
Copy link
Contributor

Thank you for the explanation! The 'no human in the loop' definitely makes things different.

To make sure I understand this correctly, I want to write out the example of how this would work with JupyterHub. To get a JWT you would:

  1. Pass the Authenticator a JupyterHub Token
  2. Authenticator talks to JupyterHub, validates token
  3. Issues JWT, signed with a particular key, with expiration time T seconds from now
  4. JWT is used for further authentication until it expires (or is invalid), at which time GOTO 1

Is this about right?

If you wanna revoke any token, you would need to revoke all tokens. Where would this happen? The way I can think of is to rotate the signing key, but then you would need to rotate the key across everything that is validating JWTs, where it becomes a key distribution problem. It makes deployment a little more complex, at which point it is a tradeoff from having a database. This can also be mitigated by having T be small, to force re-issue of JWTs more frequently.

This makes me wonder - can the Authenticator just hit the external auth source each time it is asked? You can perform transparent caching here appropriate to the backend (different for JupyterHub vs Kerberos, for example) that can be tuned by the admin as they see fit. This takes JWTs (and hence distributed signing keys) out of the equation as well. This doesn't require persistence because the auth info is in memory, and works fine for HA since the caching is per-process, and the 'source of truth' is external to them all.

I might also be talking from a fundamental ignorance of how dask-gateway auth works. If so, let me know and I'll happily trawl through the code instead :)

@jcrist
Copy link
Member Author

jcrist commented Dec 17, 2019

Is this about right?

Yeah, that looks correct.

This makes me wonder - can the Authenticator just hit the external auth source each time it is asked?

This is an interesting idea. I see the jwt lifetime as set relatively short (15 min by default perhaps, but configurable). Talking through the tradeoffs:

  • JWTs are shareable between instances of the gateway server. If a user has a JWT in the request, it doesn't matter which instance gets it, requests to the authentication backend are cached for the same amount of time globally for all servers. In contrast, if we implement caching per server, then each server has a different cache window, which is harder to reason about. The authentication backend will also be hit on average N times more per user, where N is the number of gateway servers.
  • JWTs can be revoked globally by updating the signing key for all servers. This requires an extra step for admins, but with the helm chart and hubploy shouldn't be too difficult. For caching per server, tokens can be revoked globally for all users by restarting all servers.

I think I like the JWT approach slightly more since the caching in this case is global across all instances of the server, which I find easier to reason about. Thoughts?

@jcrist
Copy link
Member Author

jcrist commented Jan 23, 2020

I'm currently hacking away on this, progress so far is good. One complication that's come up though is implementing our current user resource limits model (https://gateway.dask.org/resource-limits.html) on kubernetes.

Currently we provide configurable limits on:

  • Clusters per user
  • Cores per user
  • Memory per user

Requests for new clusters/workers that exceed this limit will fail. This is easy to do in a system where all state is kept on a single server (or when using a database), but much harder when trying to keep all state in kubernetes objects. For non-kubernetes backends we could keep or tweak the existing model as needed, but for kubernetes we'll definitely need to change things. A few options I see here:

  1. Drop support for user limits entirely, and rely on kubernetes-native methods of restricting resources. One option here would be creating a namespace per user, each with a ResourceQuota (https://kubernetes.io/docs/concepts/policy/resource-quotas/) limiting objects. Other models could work as well, we could make the mapping of user -> namespace configurable so you could also do a shared namespace per users to implement shared resource limits.

  2. The above, but manage these namespaces/resource-quotas in the gateway itself. We'd want to make this configurable, so as not to restrict users into one model.

  3. Drop support for user level limits, instead imposing limits per cluster. It's much easier to make local decisions in a controller (e.g. about a single cluster) rather than global ones about multiple clusters. So we could limit cores/memory/workers per cluster, but not clusters/cores/memory per user. This wouldn't be using ResourceQuota, and wouldn't impose any restrictions on mapping clusters -> namespaces, all the logic for this would be implemented in the controller.

Option 1 (and 2) support user-level (or group-level) resource limits natively, but impose restrictions on namespace usage (ResourceQuota limits are inherently per-namespace). Option 3 doesn't impose these restrictions, but changes limits to only per-cluster rather than per-user.

Right now I'm leaning towards Option 2 (or 1), but option 3 also isn't bad.

cc @yuvipanda, @mmccarty @jacobtomlinson for thoughts.

@mmccarty
Copy link
Member

Great news Jim! We are in favor of option 2. That also gives us the option to fall back to option 1 if needed.
cc @dankerrigan @gforsyth @droctothorpe @zac-hopkinson

@jacobtomlinson
Copy link
Member

Option 2 also sounds reasonable to me.

@yuvipanda
Copy link
Contributor

Creating a new namespace per user can get complicated and unweildy on clusters that aren't solely dedicated to dask-gateway / Jupyter use. I like option 2, assuming you aren't creating namespaces there.

You can also use ResourceQuota without namespaces - see https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-priorityclass.

@jacobtomlinson
Copy link
Member

@droctothorpe could you expand a little on the drawbacks? I think using traefik to reverse proxy the pod services doesn't have much to do with the cluster's service mesh. In this instance it sounds like traefik is effectively being used at the application level.

@jcrist jcrist mentioned this issue Feb 4, 2020
9 tasks
@droctothorpe
Copy link
Contributor

Having dug a little deeper into the Traefik documentation, @jacobtomlinson, my concerns were unwarranted. Deleting the comment. Thanks for the appropriate pushback.

@jcrist
Copy link
Member Author

jcrist commented Feb 11, 2020

The first step of this rearchitecture was done in #195. The backend interface has been defined, and all backends except k8s have been implemented using it. Further discussion of the k8s backend will take place in #198.

@rokroskar
Copy link

I was looking for a possible way to set up authentication with JWTs and found this issue, but no other mention of JWT-based authentication. Is this work on-going somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants