-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: don't use Status
RPC for Consul discovery (#16490)
#16490
Conversation
ba31797
to
fe2098f
Compare
Moved back to draft, as this is still failing in E2E testing. I'm realizing the problem is a little more complex that I thought at first glance -- the |
fe2098f
to
7dddac8
Compare
Status.Members
RPCStatus.RPCServers
RPC for Consul discovery
7dddac8
to
487d00d
Compare
ec611f9
to
f8d1dc1
Compare
I've reworked this PR heavily based on the results of a discussion I had with @shoenig and @gulducat in order to get this working in the pre-registration phase. Because the new RPC handler is substantially the same as ee3e952 which we closed unmerged, I've added Ruslan as a co-author on this commit. I need to do one more round of testing with this on E2E with the advertise addresses set correctly, and then I'll be ready for review. |
In #16217 we switched clients using Consul discovery to the `Status.Members` endpoint for getting the list of servers so that we're using the correct address. This endpoint has an authorization gate, so this fails if the anonymous policy doesn't have `node:read`. We also can't check the `AuthToken` for the request for the client secret, because the client hasn't yet registered so the server doesn't have anything to compare against. Create a new `Status.RPCServers` endpoint that mirrors the `Status.Peers` endpoint but provides the RPC server addresses instead of the Raft addresses. This fixes the authentication bug but also ensures we're only registering with servers in the client's region and not in any other servers that might have registered with Consul. This changeset also expands the test coverage of the RPC endpoint and closes up potential holes in the `ResolveACL` method that aren't currently bugs but easily could become bugs if we called the method without ensuring its invariants are upheld. Co-authored-by: tantra35 <[email protected]>
f8d1dc1
to
2bb67c3
Compare
Ok, I've tested this out on a cluster on AWS using HCP Consul for discovery and the networking tweaks discussed in #16507 to ensure that we're using the serf advertisement:
There is no anonymous policy set. And both the servers and client end up happy:
|
ff33dae
to
4010fdb
Compare
4010fdb
to
25d0361
Compare
client/client.go
Outdated
srv := &servers.Server{Addr: addr} | ||
nomadServers = append(nomadServers, srv) | ||
} | ||
if slices.Contains(s.ServiceTags, region) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we switch to Meta (see below) this will need to change.
However now I think we introduced ordering to our otherwise flexible upgrade procedure (at least one Server would need to be upgraded before any Clients for this to work). 🤔
But wait... what actually happens if a Client in Region A calls Node.Register on a Server in Region B?
- If the Regions are Federated, the Node.Register RPC is Forwarded and replies with the correct server list for Region A!
- If the Regions are not Federated, the Node.Register RPC errors and the Client will try the next Server, presumably until reaching a Server in its own Region.
...except I don't think that will happen for the following reasons: 😭
- The
Client.registerNode()
code doesn't appear to use the response it gets from Node.Register! We pluck out the HeartbeatTTL but throw away the Server list. 👎 - If the Regions aren't Federated, we wait up to 30s before retrying the Node.Register RPC. This means for clusters with <1,000 nodes we risk going
down
on Client agent restarts just from a single failed Node.Register call! 👎
Have I mentioned lately I hate all of this discovery/registration code and logic? I think it's too clever by half and makes the really critical operations like Restarting Clients really need to Register before HeartbeatTTL
extremely difficult to determine/observe/tune. 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the PR (and the description above) to drop the extra tags/meta on Consul. And then I've updated the registerNode
method to use the servers we get back from the response. If we get no servers or a "no path to region", we'll bubble that state up to the retry loop so that it triggers discovery again and hits the fast-path retry rather than the longer retry of 15s.
We don't really have to worry about regions in the Consul discovery step. If we hit a different region and they're federated, the request will be forwarded. If the regions aren't federated (not a very safe topology in general), the node registration will fail and we'll retry. Eliminate the region tags we added to Consul. Have `registerNode` update the server list based on the response we get, and have it return the "no servers" error if we get no servers so that we kick off discovery again and retry immediately rather than 15s later.
307a0c2
to
a7a6cdf
Compare
Status.RPCServers
RPC for Consul discoveryStatus
RPC for Consul discovery (#16490)
In #16217 we switched clients using Consul discovery to the `Status.Members` endpoint for getting the list of servers so that we're using the correct address. This endpoint has an authorization gate, so this fails if the anonymous policy doesn't have `node:read`. We also can't check the `AuthToken` for the request for the client secret, because the client hasn't yet registered so the server doesn't have anything to compare against. Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the Consul response directly. Update the `registerNode` method to handle the list of servers we get back in the response; if we get a "no servers" or "no path to region" response we'll kick off discovery again and retry immediately rather than waiting 15s.
In #16217 we switched clients using Consul discovery to the `Status.Members` endpoint for getting the list of servers so that we're using the correct address. This endpoint has an authorization gate, so this fails if the anonymous policy doesn't have `node:read`. We also can't check the `AuthToken` for the request for the client secret, because the client hasn't yet registered so the server doesn't have anything to compare against. Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the Consul response directly. Update the `registerNode` method to handle the list of servers we get back in the response; if we get a "no servers" or "no path to region" response we'll kick off discovery again and retry immediately rather than waiting 15s.
In #16217 we switched clients using Consul discovery to the `Status.Members` endpoint for getting the list of servers so that we're using the correct address. This endpoint has an authorization gate, so this fails if the anonymous policy doesn't have `node:read`. We also can't check the `AuthToken` for the request for the client secret, because the client hasn't yet registered so the server doesn't have anything to compare against. Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the Consul response directly. Update the `registerNode` method to handle the list of servers we get back in the response; if we get a "no servers" or "no path to region" response we'll kick off discovery again and retry immediately rather than waiting 15s. Co-authored-by: Tim Gross <[email protected]>
In #16217 we switched clients using Consul discovery to the `Status.Members` endpoint for getting the list of servers so that we're using the correct address. This endpoint has an authorization gate, so this fails if the anonymous policy doesn't have `node:read`. We also can't check the `AuthToken` for the request for the client secret, because the client hasn't yet registered so the server doesn't have anything to compare against. Instead of hitting the `Status.Peers` or `Status.Members` RPC endpoint, use the Consul response directly. Update the `registerNode` method to handle the list of servers we get back in the response; if we get a "no servers" or "no path to region" response we'll kick off discovery again and retry immediately rather than waiting 15s. Co-authored-by: Tim Gross <[email protected]>
Fixes #16470
In #16217 we switched clients using Consul discovery to the
Status.Members
endpoint for getting the list of servers so that we're using the correct
address. This endpoint has an authorization gate, so this fails if the anonymous
policy doesn't have
node:read
. We also can't check theAuthToken
for therequest for the client secret, because the client hasn't yet registered so the
server doesn't have anything to compare against.
Instead of hitting the
Status.Peers
orStatus.Members
RPC endpoint, use theConsul response directly. Update the
registerNode
method to handle the list ofservers we get back in the response; if we get a "no servers" or "no path to
region" response we'll kick off discovery again and retry immediately rather
than waiting 15s.