Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: don't use Status RPC for Consul discovery (#16490) #16490

Merged
merged 7 commits into from
Mar 16, 2023

Conversation

tgross
Copy link
Member

@tgross tgross commented Mar 14, 2023

Fixes #16470

In #16217 we switched clients using Consul discovery to the Status.Members
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have node:read. We also can't check the AuthToken for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.

Instead of hitting the Status.Peers or Status.Members RPC endpoint, use the
Consul response directly. Update the registerNode method to handle the list of
servers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.

@tgross
Copy link
Member Author

tgross commented Mar 14, 2023

Moved back to draft, as this is still failing in E2E testing. I'm realizing the problem is a little more complex that I thought at first glance -- the Status.Members RPC call is happening before we register the node, so even if we pass the client secret as an auth token, the server has never seen that secret before and can't authenticate the node, because we haven't registered yet!

@tgross tgross force-pushed the client-access-to-status-members branch from fe2098f to 7dddac8 Compare March 14, 2023 20:21
@tgross tgross changed the title client: pass secret in Status.Members RPC client: use Status.RPCServers RPC for Consul discovery Mar 14, 2023
@tgross tgross force-pushed the client-access-to-status-members branch from 7dddac8 to 487d00d Compare March 14, 2023 20:23
@tgross tgross force-pushed the client-access-to-status-members branch 2 times, most recently from ec611f9 to f8d1dc1 Compare March 14, 2023 20:56
@tgross
Copy link
Member Author

tgross commented Mar 14, 2023

I've reworked this PR heavily based on the results of a discussion I had with @shoenig and @gulducat in order to get this working in the pre-registration phase. Because the new RPC handler is substantially the same as ee3e952 which we closed unmerged, I've added Ruslan as a co-author on this commit.

I need to do one more round of testing with this on E2E with the advertise addresses set correctly, and then I'll be ready for review.

In #16217 we switched clients using Consul discovery to the `Status.Members`
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have `node:read`. We also can't check the `AuthToken` for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.

Create a new `Status.RPCServers` endpoint that mirrors the `Status.Peers`
endpoint but provides the RPC server addresses instead of the Raft
addresses. This fixes the authentication bug but also ensures we're only
registering with servers in the client's region and not in any other servers
that might have registered with Consul.

This changeset also expands the test coverage of the RPC endpoint and closes up
potential holes in the `ResolveACL` method that aren't currently bugs but easily
could become bugs if we called the method without ensuring its invariants are
upheld.

Co-authored-by: tantra35 <[email protected]>
@tgross
Copy link
Member Author

tgross commented Mar 15, 2023

Ok, I've tested this out on a cluster on AWS using HCP Consul for discovery and the networking tweaks discussed in #16507 to ensure that we're using the serf advertisement:

$ nomad operator api "/v1/status/peers"
["172.31.119.47:4647","172.31.112.244:4647","172.31.125.250:4647"]%

$ nomad operator api "/v1/agent/members" | jq -r '.Members[].Addr'
172.31.125.250
172.31.112.244
172.31.119.47

$ nomad operator api "/v1/agent/members" | jq -r '.Members[].Tags.rpc_addr'
172.31.92.121
172.31.93.95
172.31.88.76

There is no anonymous policy set. And both the servers and client end up happy:

$ nomad server members
Name                     Address         Port  Status  Leader  Raft Version  Build      Datacenter  Region
ip-172-31-88-76.global   172.31.119.47   4648  alive   false   3             1.5.2-dev  dc1         global
ip-172-31-92-121.global  172.31.125.250  4648  alive   true    3             1.5.2-dev  dc1         global
ip-172-31-93-95.global   172.31.112.244  4648  alive   false   3             1.5.2-dev  dc1         global

$ nomad node status
ID        DC   Name              Class   Drain  Eligibility  Status
700d45a8  dc1  ip-172-31-88-231  <none>  false  eligible     ready

hcp-consul

@tgross tgross marked this pull request as ready for review March 15, 2023 15:03
@tgross tgross force-pushed the client-access-to-status-members branch from ff33dae to 4010fdb Compare March 15, 2023 21:33
command/agent/agent.go Outdated Show resolved Hide resolved
command/agent/agent.go Outdated Show resolved Hide resolved
client/client.go Outdated
srv := &servers.Server{Addr: addr}
nomadServers = append(nomadServers, srv)
}
if slices.Contains(s.ServiceTags, region) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we switch to Meta (see below) this will need to change.

However now I think we introduced ordering to our otherwise flexible upgrade procedure (at least one Server would need to be upgraded before any Clients for this to work). 🤔

But wait... what actually happens if a Client in Region A calls Node.Register on a Server in Region B?

  • If the Regions are Federated, the Node.Register RPC is Forwarded and replies with the correct server list for Region A!
  • If the Regions are not Federated, the Node.Register RPC errors and the Client will try the next Server, presumably until reaching a Server in its own Region.

...except I don't think that will happen for the following reasons: 😭

  1. The Client.registerNode() code doesn't appear to use the response it gets from Node.Register! We pluck out the HeartbeatTTL but throw away the Server list. 👎
  2. If the Regions aren't Federated, we wait up to 30s before retrying the Node.Register RPC. This means for clusters with <1,000 nodes we risk going down on Client agent restarts just from a single failed Node.Register call! 👎

Have I mentioned lately I hate all of this discovery/registration code and logic? I think it's too clever by half and makes the really critical operations like Restarting Clients really need to Register before HeartbeatTTL extremely difficult to determine/observe/tune. 😞

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR (and the description above) to drop the extra tags/meta on Consul. And then I've updated the registerNode method to use the servers we get back from the response. If we get no servers or a "no path to region", we'll bubble that state up to the retry loop so that it triggers discovery again and hits the fast-path retry rather than the longer retry of 15s.

We don't really have to worry about regions in the Consul discovery step. If we
hit a different region and they're federated, the request will be forwarded. If
the regions aren't federated (not a very safe topology in general), the node
registration will fail and we'll retry.

Eliminate the region tags we added to Consul. Have `registerNode` update the
server list based on the response we get, and have it return the "no servers"
error if we get no servers so that we kick off discovery again and retry
immediately rather than 15s later.
client/client.go Outdated Show resolved Hide resolved
nomad/acl.go Outdated Show resolved Hide resolved
@tgross tgross merged commit 8684183 into main Mar 16, 2023
@tgross tgross deleted the client-access-to-status-members branch March 16, 2023 19:38
@tgross tgross changed the title client: use Status.RPCServers RPC for Consul discovery client: don't use Status RPC for Consul discovery (#16490) Mar 16, 2023
@tgross tgross added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Mar 20, 2023
tgross added a commit that referenced this pull request Mar 20, 2023
In #16217 we switched clients using Consul discovery to the `Status.Members`
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have `node:read`. We also can't check the `AuthToken` for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.

Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the
Consul response directly. Update the `registerNode` method to handle the list of
servers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.
tgross added a commit that referenced this pull request Mar 20, 2023
In #16217 we switched clients using Consul discovery to the `Status.Members`
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have `node:read`. We also can't check the `AuthToken` for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.

Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the
Consul response directly. Update the `registerNode` method to handle the list of
servers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.
tgross added a commit that referenced this pull request Mar 20, 2023
In #16217 we switched clients using Consul discovery to the `Status.Members`
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have `node:read`. We also can't check the `AuthToken` for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.

Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the
Consul response directly. Update the `registerNode` method to handle the list of
servers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.

Co-authored-by: Tim Gross <[email protected]>
tgross added a commit that referenced this pull request Mar 20, 2023
In #16217 we switched clients using Consul discovery to the `Status.Members`
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have `node:read`. We also can't check the `AuthToken` for the
request for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.

Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the
Consul response directly. Update the `registerNode` method to handle the list of
servers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.

Co-authored-by: Tim Gross <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line theme/auth theme/discovery type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clients require node:read on anon ACL policy to join cluster via Consul discovery (1.5.1)
3 participants