go-discover is only used during initial client introduction #17872

tgross · 2023-07-10T15:04:52Z

While we were working on #16490 @schmichael and I had a discussion about how we might improve the current process for server discovery and failover with respect to go-discover and Consul. Brainstorming some potential improvements:

Punt Consul discovery to go-discover and call back into go-discover whenever a client can't find a single healthy server. (Probably more effort than its worth and significant backward compat risks.)
Persist the server list to client state for faster registrations after node restarts. (This requires ensuring we don't increase our likelihood of hitting an error + long retry which might increase the number of down nodes due to restarts in stead of decreasing that!)
Clients could request servers extend their heartbeat on graceful shutdown (2x the default? a bit risky)
Actually document/design client disco/retry logic?

(4) is appealing as a bare minimum because our current logic predates RFCs and accreted new features in hopes just a bit more code would make it more robust. It's left us in a situation where there's some potential for weird emergent behavior, and it's all very difficult to debug, explain, or test for correctness in suboptimal conditions.

The existing code conflates some distinct properties and operations:

"Consul Discovery" is just kind of sprinkled throughout... mostly I think with the intention of using it when we've run out of other options, but I don't know how precise that is and we may be placing far more trust in Consul's availability and correctness than we should!
go-discover runs once concurrent with Client startup, so I think it races with Consul Discovery and could overwrite a perfectly good server list with a stale one? Again... hard to reason about since there's so many concurrent operations.
Our discovery is concurrent with registration is concurrent with other RPCs and that causes tons of complexity: core: enforce strict steps for clients reconnect #15808

This needs some discussion and design.

The text was updated successfully, but these errors were encountered:

tgross · 2023-10-26T15:03:20Z

Adding a note here that we're also adding go-netaddrs support in #18745.

tgross added type/enhancement theme/discovery theme/client stage/needs-discussion labels Jul 10, 2023

mmcquillan added the hcc/jira label Jun 24, 2024

tgross mentioned this issue Jul 11, 2024

Nomad client does not connect to the Consul server if it was rebootstrapped #23541

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go-discover is only used during initial client introduction #17872

go-discover is only used during initial client introduction #17872

tgross commented Jul 10, 2023

tgross commented Oct 26, 2023

go-discover is only used during initial client introduction #17872

go-discover is only used during initial client introduction #17872

Comments

tgross commented Jul 10, 2023

tgross commented Oct 26, 2023