client: send node secret with every client-to-server RPC #16799

tgross · 2023-04-05T17:26:50Z

In Nomad 1.5.3 we fixed a security bug that allowed bypass of ACL checks if the request came thru a client node first. But this fix broke (knowingly) the identification of many client-to-server RPCs. These will be now measured as if they were anonymous. The reason for this is that many client-to-server RPCs do not send the node secret and instead rely on the protection of mTLS.

This changeset ensures that the node secret is being sent with every client-to-server RPC request. In a future version of Nomad we can add enforcement on the server side, but this was left out of this changeset to reduce risks to the safe upgrade path.

Sending the node secret as an auth token introduces a new problem during initial introduction of a client. Clients send many RPCs concurrently with Node.Register, but until the node is registered the node secret is unknown to the server and will be rejected as invalid. This causes permission denied errors.

To fix that, this changeset introduces a gate on having successfully made a Node.Register RPC before any other RPCs can be sent (except for Status.Ping, which we need earlier but which also ignores the error because that handler doesn't do an authorization check). This ensures that we only send requests with a node secret already known to the server. This also makes client startup a little easier to reason about because we know Node.Register must succeed first, and it should make for a good place to hook in future plans for secure introduction of nodes. The tradeoff is that an existing client that has running allocs will take slightly longer (a second or two) to transition to ready after a restart, because the transition in Node.UpdateStatus is gated at the server by first submitting Node.UpdateAlloc with client alloc updates.

Fixes: #16798
Fixes: https://github.com/hashicorp/nomad-enterprise/issues/1069

lgfa29

Getting a more predictable node registration flow is a nice benefit 😄

client/client.go

schmichael · 2023-06-16T00:47:12Z

client/client.go

+	// Block until we've registered at least once so that we know the server has
+	// our node secret and we can authenticate
+	select {
+	case <-c.registeredCh:
+	case <-c.shutdownCh:
+		return
+	}


This substantially changes what happens if a disconnected client agent with workloads restarts. Currently if a workload crashes, it can be restarted locally without reconnecting to the server. With this change after a restart the client will be blocked here and not monitoring workloads for exits, so they will not be restarted.

I think this also breaks disconnected client agents with Consul based TTL health checks as Nomad won't heartbeat those checks until the runners start.

Luckily I think there's a pretty easy fix that's also a nice optimization: persist whether first registration has occurred. This means we only ever block on for first registration when there's no workloads running anyway! If there are runners running then we know we've successfully registered so we should start running them again ASAP.

Refactoring

This does make me realize I don't think anyone writing a Nomad Client from scratch would structure the code this way... breaking the currently monolithic NewClient into its 3 actual phases:

NewClient - initialize the struct, no side effects or IO

Register - only blocks on first run

Run - ...the runners! all the non-registration goroutines get spawned here

...would maybe make all of this easier to reason about, but that would be a pretty major refactor and touch innumerable tests. I highly doubt this is worth pursuing, but it's fun to dream.

Gah, I totally missed the restart case 🤦 Persisting registration is simple enough, will fix. And that'll probably make an easy way to block the RPCs specifically as well (as noted above).

I sort of wish the server was what handed out the node secret because that would make it really easy to identify a node that hadn't yet registered because it wouldn't yet have a secret. Maybe we can revisit that workflow when we look at secure introduction of nodes in the near-ish future, and that might be a good time to consider the refactoring you're talking about here as well.

In Nomad 1.5.3 we fixed a security bug that allowed bypass of ACL checks if the request came thru a client node first. But this fix broke (knowingly) the identification of many client-to-server RPCs. These will be now measured as if they were anonymous. The reason for this is that many client-to-server RPCs do not send the node secret and instead rely on the protection of mTLS. This changeset ensures that the node secret is being sent with every client-to-server RPC request. In a future version of Nomad we can add enforcement on the server side, but this was left out of this changeset to reduce risks to the safe upgrade path. Sending the node secret as an auth token introduces a new problem during initial introduction of a client. Clients send many RPCs concurrently with `Node.Register`, but until the node is registered the node secret is unknown to the server and will be rejected as invalid. This causes permission denied errors. To fix that, this changeset introduces a gate on having successfully made a `Node.Register` RPC before any other RPCs can be sent (except for `Status.Ping`, which we need earlier but which also ignores the error because that handler doesn't do an authorization check). This ensures that we only send requests with a node secret already known to the server. This also makes client startup a little easier to reason about because we know `Node.Register` must succeed first, and it should make for a good place to hook in future plans for secure introduction of nodes. The tradeoff is that an existing client that has running allocs will take slightly longer (a second or two) to transition to ready after a restart, because the transition in `Node.UpdateStatus` is gated at the server by first submitting `Node.UpdateAlloc` with client alloc updates.

tgross · 2023-06-21T21:08:57Z

Ok @schmichael I think I've got this working as discussed. I do want to do a little more end-to-end testing of this before merging but I figure in the meantime I'd try to get feedback on the approach.

client/client.go

client/rpc.go

Co-authored-by: Michael Schurter <[email protected]>

The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By using `nil` as a sentinel value, we have the risk of nil pointer exceptions and improper handling of `nil` when returned from our various auth methods that can lead to privilege escalation bugs. This is the third in a series to eliminate the use of `nil` ACLs as a sentinel value for when ACLs are disabled. This patch involves creating a new "virtual" ACL object for checking permissions on client operations and a matching `AuthenticateClientOnly` method for client-only RPCs that can produce that ACL. Unlike the server ACLs PR, this also includes a special case for "legacy" client RPCs where the client was not previously sending the secret as it should (leaning on mTLS only). Those client RPCs were fixed in Nomad 1.6.0, but it'll take a while before we can guarantee they'll be present during upgrades. Ref: hashicorp/nomad-enterprise#1218 Ref: #18703 Ref: #18715 Ref: #16799

The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By using `nil` as a sentinel value, we have the risk of nil pointer exceptions and improper handling of `nil` when returned from our various auth methods that can lead to privilege escalation bugs. This is the third in a series to eliminate the use of `nil` ACLs as a sentinel value for when ACLs are disabled. This patch involves leveraging the refactored `auth` package to remove the weird "mixed auth" helper functions that only support the Variables read/list RPC handlers. Instead, pass the ACL object and claim together into the `AllowVariableOperations` method in the usual `acl` package. Ref: hashicorp/nomad-enterprise#1218 Ref: #18703 Ref: #18715 Ref: #16799 Ref: #18730 Fixes: #15875

The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By using `nil` as a sentinel value, we have the risk of nil pointer exceptions and improper handling of `nil` when returned from our various auth methods that can lead to privilege escalation bugs. This is the final patch in a series to eliminate the use of `nil` ACLs as a sentinel value for when ACLs are disabled. This patch adds a new virtual ACL policy field for when ACLs are disabled and updates our authentication logic to use it. Included: * Extends auth package tests to demonstrate that nil ACLs are treated as failed auth and disabled ACLs succeed auth. * Adds a new `AllowDebug` ACL check for the weird special casing we have for pprof debugging when ACLs are disabled. * Removes the remaining unexported methods (and repeated tests) from the `nomad/acl.go` file. * Update the semgrep rules to detect improper nil ACL checking and remove the old invalid ACL checks. * Update the contributing guide for RPC authentication. Ref: hashicorp/nomad-enterprise#1218 Ref: #18703 Ref: #18715 Ref: #16799 Ref: #18730 Ref: #18744

tgross self-assigned this Apr 5, 2023

tgross added this to the 1.6.0 milestone Apr 5, 2023

tgross added theme/metrics type/bug theme/auth theme/client labels Apr 5, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui April 5, 2023 17:31 View deployment

tgross force-pushed the auth-node-secret branch from 8d32bf2 to 63ef131 Compare June 8, 2023 20:43

vercel bot deployed to Preview – nomad-storybook-and-ui June 8, 2023 20:47 View deployment

tgross force-pushed the auth-node-secret branch from 63ef131 to 30f6a1c Compare June 12, 2023 17:52

vercel bot deployed to Preview – nomad-storybook-and-ui June 12, 2023 17:57 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui June 12, 2023 19:42 View deployment

tgross force-pushed the auth-node-secret branch from 33014e3 to 4940c13 Compare June 13, 2023 19:29

vercel bot deployed to Preview – nomad-storybook-and-ui June 13, 2023 19:34 View deployment

tgross marked this pull request as ready for review June 13, 2023 19:48

tgross requested review from schmichael, lgfa29 and jrasell June 13, 2023 19:48

lgfa29 approved these changes Jun 13, 2023

View reviewed changes

schmichael requested changes Jun 16, 2023

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui June 16, 2023 20:59 View deployment

tgross force-pushed the auth-node-secret branch from e96b201 to 219132d Compare June 20, 2023 20:38

vercel bot deployed to Preview – nomad-storybook-and-ui June 20, 2023 20:44 View deployment

tgross added 2 commits June 21, 2023 16:02

persist has-registered status to client state db

1c324e1

tgross force-pushed the auth-node-secret branch from 219132d to 18c5b5a Compare June 21, 2023 20:13

vercel bot deployed to Preview – nomad-storybook-and-ui June 21, 2023 20:16 View deployment

unblock intentionally failed test RPCs

18a6b30

tgross force-pushed the auth-node-secret branch from 18c5b5a to 18a6b30 Compare June 21, 2023 20:47

vercel bot deployed to Preview – nomad-storybook-and-ui June 21, 2023 20:50 View deployment

tgross requested a review from schmichael June 21, 2023 21:07

schmichael approved these changes Jun 22, 2023

View reviewed changes

client/client.go Outdated Show resolved Hide resolved

client/client.go Outdated Show resolved Hide resolved

client/rpc.go Outdated Show resolved Hide resolved

Apply suggestions from code review

54f1978

Co-authored-by: Michael Schurter <[email protected]>

vercel bot deployed to Preview – nomad-storybook-and-ui June 22, 2023 14:46 View deployment

tgross merged commit deae9bb into main Jun 22, 2023

tgross deleted the auth-node-secret branch June 22, 2023 15:06

tgross mentioned this pull request Oct 11, 2023

auth: add client-only ACL #18730

Merged

tgross mentioned this pull request Oct 12, 2023

auth: remove "mixed auth" special casing for Variables endpoint #18744

Merged

tgross mentioned this pull request Oct 13, 2023

auth: use ACLsDisabledACL when ACLs are disabled #18754

Merged

tgross mentioned this pull request Feb 22, 2024

Restoring snapshot removes current Nomad Clients, and clients don't re-join the cluster automatically #20024

Closed

tgross mentioned this pull request Aug 16, 2024

auth: enforce use of node secret and remove legacy auth #23838

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: send node secret with every client-to-server RPC #16799

client: send node secret with every client-to-server RPC #16799

tgross commented Apr 5, 2023 •

edited

Loading

lgfa29 left a comment

schmichael Jun 16, 2023

tgross Jun 16, 2023

tgross commented Jun 21, 2023

client: send node secret with every client-to-server RPC #16799

client: send node secret with every client-to-server RPC #16799

Conversation

tgross commented Apr 5, 2023 • edited Loading

lgfa29 left a comment

Choose a reason for hiding this comment

schmichael Jun 16, 2023

Choose a reason for hiding this comment

Refactoring

tgross Jun 16, 2023

Choose a reason for hiding this comment

tgross commented Jun 21, 2023

tgross commented Apr 5, 2023 •

edited

Loading