rpc: grpc-gateway loopback conn mistakenly uses onlyOnceDialer and causes sticky permanent RPC errors #103762
Labels
A-observability-inf
A-server-networking
Pertains to network addressing,routing,initialization
backport-23.1.x
Flags PRs that need to be backported to 23.1
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
regression
Regression from a release.
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Describe the problem
The rpc-gateway connection (used by our HTTP interfaces) uses a loopback connector. This is mistakenly configured in v23.1 to use "onlyOnceDialer", a mechanism through which a connection is not re-attempted if it fails.
The result is that when a cluster is overloaded, the loopback connection may fail once (due to a timeout) and then it will fail forever after, causing most of the HTTP interfaces to become unusable.
xref #103692 (comment)
xref #99261 (comment)
To Reproduce
Overload a v23.1 cluster and use the HTTP connection until it fails once.
Then the failure will persist forever until the node is restarted.
Expected behavior
The loopback connection should be retried if it fails (i.e. not use onlyOnceDialer)
Jira issue: CRDB-28178
Epic: CRDB-28893
The text was updated successfully, but these errors were encountered: