Don't expose insecure HTTP API #6407

gjoseph92 · 2022-05-20T21:22:20Z

#6270 exposed a new HTTP API, enabled by default. Copied from #6270 (comment):

I'm concerned about a security regression here. By default, this is opening up an API that allows anyone to change cluster state (via retire_workers currently, but I imagine other things might be added someday too).

Prior to this, the only way to do things that affected cluster state was through the client. All the HTTP routes were effectively read-only. (Whether there is a vulnerability in the bokeh dashboard is another topic; it's pretty possible there is, but I'm just talking here in principle.)

I think it's rather common to expose the HTTP routes to the public internet. For example, I believe dask-cloudprovider does this:

By default a Dask security group will be created with ports 8786 and 8787 exposed to the internet https://cloudprovider.dask.org/en/latest/aws.html#dask_cloudprovider.aws.EC2Cluster

You want those ports exposed for convenience, so you can connect to them. But you don't want anyone to be able to do stuff to the cluster, so you set up TLS using temporary credentials. dask-cloudprovider does this for you as well:

When a cluster is launched with any of these cluster managers a set of temporary keys will be generated and distributed to the cluster nodes via their startup script. All communication between the client, scheduler and workers will then be encrypted and only clients and workers with valid certificates will be able to connect to the scheduler.
https://cloudprovider.dask.org/en/latest/security.html#authentication-and-encryption

Currently, if you set up TLS for your cluster, this is mTLS, meaning the scheduler verifies the client's certificate (docs, code). This serves as a form of authentication and authorization: if you've set up cluster security, you can only tell the scheduler to do things if you hold a valid certificate.

However, the HTTP routes have no authentication (they use standard TLS, not mTLS, because mTLS would be very inconvenient when you want to look at the dashboard with a web browser).

So after this change, someone who had gone to the trouble to set up mTLS for their cluster (or was using the defaults of their cluster deployment system) would, by default, have an unauthenticated endpoint running that allowed anyone with access to :8787 (aka the dashboard) to affect cluster state.

I think we should do two things:

Short-term: disable the HTTP API if TLS is specified on the scheduler. This is a reasonable default.
Long-term: figure out a security posture and authentication for the HTTP API that's consistent with the security posture of other things that can affect cluster state (aka the client).

The text was updated successfully, but these errors were encountered:

jakirkham · 2022-05-20T23:28:37Z

Thanks Gabe! 🙏

cc @jacobtomlinson @Matt711 @quasiben (for awareness)

jacobtomlinson · 2022-05-23T09:13:37Z

Thanks for raising this @gjoseph92. I totaly agree with both the short-term and long-term suggestions.

Also a good point about dask-cloudprovider. This mode is a failsafe for when users do not provide any security rule information, so I would expect most users to not actually expose things like this. But I would like to change things over there to be failsecure instead because having a default behaviour of exposing things to the internet is not great.

Thanks for raising #6408 to handle the short-term issue, I've commented there.

For long-term options we could definitely implement some basic auth on the API, perhaps with an API key that can be set in the config. The intended use for the API is for external resource managers to be able to control things like retiring workers, those resource managers generally also create the scheduler in the first place so it would be easy for them to share a key.

We may also want to think about authentication on the dashboard, but I feel that is a parallel conversation to this.

psontag · 2022-05-23T09:32:01Z

Short-term: disable the HTTP API if TLS is specified on the scheduler. This is a reasonable default.

Maybe it makes sense to have a separate configuration option to enable/disable the HTTP API and leave it off by default? Coupling this to another option might be a bit confusing.

jacobtomlinson · 2022-05-23T09:45:08Z

@philipp-sontag-by yeah totally agree. I suggested this over in #6408.

mrocklin · 2022-05-23T13:32:39Z

Short term for the release should we revert the PR?

jcrist · 2022-05-23T14:05:37Z

Apologies for side tracking this issue a bit - I wasn't sure where else to post this comment, feel free to ignore

A different approach (one that is definitely more short-term work for the dask-kubernetes team, but maybe nicer in the end?) is to not rely on an API at all for the scheduler/worker pods in the k8s operator.

Dask-gateway contains an operator that does this by reversing the connection orientation. Schedulers periodically heartbeat to the api (read operator in your case), and the operator takes actions appropriately. The operations here are a bit more complicated (if you're interested, I'd be happy to walk through them either in an issue or a call), but have some nice properties:

The k8s operator never makes any network requests, no need to worry about timeouts/retries, and things scale very well
No need for handling operator -> scheduler authentication on the scheduler at all, the scheduler receives no inbound api requests from the operator
All operator actions scale with the number of clusters, not the number of workers. Schedulers periodically phone home with what workers are active, and this is used in place of a health check on individual workers (or having workers phone home instead).

There are likely reasons you went with your chosen approach, and I'm not arguing that you should change it. I just wanted to make sure you were aware there are other solutions to this problem.

jakirkham · 2022-05-23T18:33:52Z

The API is disabled with PR ( #6420 ), which should address this for the release

gjoseph92 · 2022-05-23T21:05:51Z

I'd be okay closing this if you'd like. Or we could keep it open if you'd like to discuss the long-term plan for security.

jakirkham · 2022-05-23T21:13:42Z

Am ok with keeping this open given there is still follow up work to address the issue. Though maybe we should update the title and OP to reflect this

gjoseph92 · 2022-05-23T21:15:56Z

^ that makes it sound like it should probably be a different issue then :)

jakirkham · 2022-05-23T21:40:51Z

Whichever is simplest :)

jacobtomlinson · 2022-05-24T08:48:38Z

Let's leave this open, I think the fact that we have plans for a long-term and short-term solution suggests that it is in fact the same issue.

@jcrist thanks for the feedback here. There are a few reasons why I set up the communication in the operator => scheduler direction:

No plugins or modifications to the scheduler are necessary, this simplifies the user config both in terms of Docker images and pod configs. We expose much more of Kubernetes to the user if they want it than Gateway does so I want to keep things as clean as possible there.
I can also see challenges in terms of security in the other direction. I'm keen to get Istio working well with the new operator, and adding a security rule to allow schedulers to speak to a Kubernetes system service feels like a bad choice. Especially given that user-provided code can be executed on the scheduler.
Many other kubernetes services probe the scheduler for metrics and health information so probing for scale information felt natural.
We are using kopf to write the operator which supports multiple instances, peer discovery and leadership election, so I'm not too concerned about operator scalability.

To resolve this issue long term the simplest option would probably be adding an API key option to the config which must be provided via a header in calls to protected routes.

jacobtomlinson · 2022-05-24T11:14:29Z

I've opened #6431 with a proposal for a long-term solution. Reviews would be much appreciated.

gjoseph92 mentioned this issue May 20, 2022

Don't expose HTTP API for secure clusters #6408

Closed

2 tasks

This was referenced May 23, 2022

Release process - Announcement and documentation dask/community#247

Closed

Release 2022.05.1 dask/community#245

Closed

jacobtomlinson mentioned this issue May 23, 2022

Disable HTTP API by default #6420

Merged

jacobtomlinson changed the title ~~Don't expose HTTP API for secure clusters~~ Don't expose insecure HTTP API May 24, 2022

jacobtomlinson linked a pull request May 24, 2022 that will close this issue

Add authentication to HTTP API #6431

Open

ian-r-rose mentioned this issue Jun 28, 2022

Dashboard for failed tasks #6595

Merged

2 tasks

ian-r-rose mentioned this issue Jul 20, 2022

Add Jupyter Server to Dask Scheduler #6737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't expose insecure HTTP API #6407

Don't expose insecure HTTP API #6407

gjoseph92 commented May 20, 2022 •

edited by jacobtomlinson

Loading

jakirkham commented May 20, 2022

jacobtomlinson commented May 23, 2022

psontag commented May 23, 2022

jacobtomlinson commented May 23, 2022

mrocklin commented May 23, 2022

jcrist commented May 23, 2022

jakirkham commented May 23, 2022

gjoseph92 commented May 23, 2022

jakirkham commented May 23, 2022

gjoseph92 commented May 23, 2022

jakirkham commented May 23, 2022

jacobtomlinson commented May 24, 2022

jacobtomlinson commented May 24, 2022

Don't expose insecure HTTP API #6407

Don't expose insecure HTTP API #6407

Comments

gjoseph92 commented May 20, 2022 • edited by jacobtomlinson Loading

jakirkham commented May 20, 2022

jacobtomlinson commented May 23, 2022

psontag commented May 23, 2022

jacobtomlinson commented May 23, 2022

mrocklin commented May 23, 2022

jcrist commented May 23, 2022

jakirkham commented May 23, 2022

gjoseph92 commented May 23, 2022

jakirkham commented May 23, 2022

gjoseph92 commented May 23, 2022

jakirkham commented May 23, 2022

jacobtomlinson commented May 24, 2022

jacobtomlinson commented May 24, 2022

gjoseph92 commented May 20, 2022 •

edited by jacobtomlinson

Loading