-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't expose insecure HTTP API #6407
Comments
Thanks Gabe! 🙏 cc @jacobtomlinson @Matt711 @quasiben (for awareness) |
Thanks for raising this @gjoseph92. I totaly agree with both the short-term and long-term suggestions. Also a good point about Thanks for raising #6408 to handle the short-term issue, I've commented there. For long-term options we could definitely implement some basic auth on the API, perhaps with an API key that can be set in the config. The intended use for the API is for external resource managers to be able to control things like retiring workers, those resource managers generally also create the scheduler in the first place so it would be easy for them to share a key. We may also want to think about authentication on the dashboard, but I feel that is a parallel conversation to this. |
Maybe it makes sense to have a separate configuration option to enable/disable the HTTP API and leave it off by default? Coupling this to another option might be a bit confusing. |
@philipp-sontag-by yeah totally agree. I suggested this over in #6408. |
Short term for the release should we revert the PR? |
Apologies for side tracking this issue a bit - I wasn't sure where else to post this comment, feel free to ignore A different approach (one that is definitely more short-term work for the dask-kubernetes team, but maybe nicer in the end?) is to not rely on an API at all for the scheduler/worker pods in the k8s operator. Dask-gateway contains an operator that does this by reversing the connection orientation. Schedulers periodically heartbeat to the api (read operator in your case), and the operator takes actions appropriately. The operations here are a bit more complicated (if you're interested, I'd be happy to walk through them either in an issue or a call), but have some nice properties:
There are likely reasons you went with your chosen approach, and I'm not arguing that you should change it. I just wanted to make sure you were aware there are other solutions to this problem. |
The API is disabled with PR ( #6420 ), which should address this for the release |
I'd be okay closing this if you'd like. Or we could keep it open if you'd like to discuss the long-term plan for security. |
Am ok with keeping this open given there is still follow up work to address the issue. Though maybe we should update the title and OP to reflect this |
^ that makes it sound like it should probably be a different issue then :) |
Whichever is simplest :) |
Let's leave this open, I think the fact that we have plans for a long-term and short-term solution suggests that it is in fact the same issue. @jcrist thanks for the feedback here. There are a few reasons why I set up the communication in the operator => scheduler direction:
To resolve this issue long term the simplest option would probably be adding an API key option to the config which must be provided via a header in calls to protected routes. |
I've opened #6431 with a proposal for a long-term solution. Reviews would be much appreciated. |
#6270 exposed a new HTTP API, enabled by default. Copied from #6270 (comment):
I'm concerned about a security regression here. By default, this is opening up an API that allows anyone to change cluster state (via
retire_workers
currently, but I imagine other things might be added someday too).Prior to this, the only way to do things that affected cluster state was through the client. All the HTTP routes were effectively read-only. (Whether there is a vulnerability in the bokeh dashboard is another topic; it's pretty possible there is, but I'm just talking here in principle.)
I think it's rather common to expose the HTTP routes to the public internet. For example, I believe dask-cloudprovider does this:
You want those ports exposed for convenience, so you can connect to them. But you don't want anyone to be able to do stuff to the cluster, so you set up TLS using temporary credentials. dask-cloudprovider does this for you as well:
Currently, if you set up TLS for your cluster, this is mTLS, meaning the scheduler verifies the client's certificate (docs, code). This serves as a form of authentication and authorization: if you've set up cluster security, you can only tell the scheduler to do things if you hold a valid certificate.
However, the HTTP routes have no authentication (they use standard TLS, not mTLS, because mTLS would be very inconvenient when you want to look at the dashboard with a web browser).
So after this change, someone who had gone to the trouble to set up mTLS for their cluster (or was using the defaults of their cluster deployment system) would, by default, have an unauthenticated endpoint running that allowed anyone with access to
:8787
(aka the dashboard) to affect cluster state.I think we should do two things:
The text was updated successfully, but these errors were encountered: