You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While we were working on #16490@schmichael and I had a discussion about how we might improve the current process for server discovery and failover with respect to go-discover and Consul. Brainstorming some potential improvements:
Punt Consul discovery to go-discover and call back into go-discover whenever a client can't find a single healthy server. (Probably more effort than its worth and significant backward compat risks.)
Persist the server list to client state for faster registrations after node restarts. (This requires ensuring we don't increase our likelihood of hitting an error + long retry which might increase the number of down nodes due to restarts in stead of decreasing that!)
Clients could request servers extend their heartbeat on graceful shutdown (2x the default? a bit risky)
(4) is appealing as a bare minimum because our current logic predates RFCs and accreted new features in hopes just a bit more code would make it more robust. It's left us in a situation where there's some potential for weird emergent behavior, and it's all very difficult to debug, explain, or test for correctness in suboptimal conditions.
The existing code conflates some distinct properties and operations:
"Consul Discovery" is just kind of sprinkled throughout... mostly I think with the intention of using it when we've run out of other options, but I don't know how precise that is and we may be placing far more trust in Consul's availability and correctness than we should!
go-discover runs once concurrent with Client startup, so I think it races with Consul Discovery and could overwrite a perfectly good server list with a stale one? Again... hard to reason about since there's so many concurrent operations.
While we were working on #16490 @schmichael and I had a discussion about how we might improve the current process for server discovery and failover with respect to
go-discover
and Consul. Brainstorming some potential improvements:go-discover
and call back intogo-discover
whenever a client can't find a single healthy server. (Probably more effort than its worth and significant backward compat risks.)down
nodes due to restarts in stead of decreasing that!)(4) is appealing as a bare minimum because our current logic predates RFCs and accreted new features in hopes just a bit more code would make it more robust. It's left us in a situation where there's some potential for weird emergent behavior, and it's all very difficult to debug, explain, or test for correctness in suboptimal conditions.
The existing code conflates some distinct properties and operations:
go-discover
runs once concurrent with Client startup, so I think it races with Consul Discovery and could overwrite a perfectly good server list with a stale one? Again... hard to reason about since there's so many concurrent operations.This needs some discussion and design.
The text was updated successfully, but these errors were encountered: