Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: fail-fast when all replicas of a range are unavailable #74503

Open
tbg opened this issue Jan 6, 2022 · 1 comment
Open

kvcoord: fail-fast when all replicas of a range are unavailable #74503

tbg opened this issue Jan 6, 2022 · 1 comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Jan 6, 2022

With #33007, when a range loses quorum, we will generally have SQL clients experience fail-fast behavior: access to the unavailable range will immediately result in an error, as opposed to hanging indefinitely (as is the case in 21.2 and before). However, when a range has lost all replicas (or if all replicas are unreachable) I believe that DistSender will keep retrying forever:

  • look up descriptor (say r1/1 r1/2 r1/3)
  • try r1/1 (fail)
  • try r1/2 (fail)
  • try r1/3 (fail)
  • hit a SendError here
  • eject descriptor & re-lookup, goto beginning

While we do try to be resilient to network blips, there is probably value in a heuristic where if a request has been attempted twice for each possible replica, it's time to give up.

We will want to return a RangeUnavailableError in this case (similar to #74500) and have similar SQL UX (#74502).

Jira issue: CRDB-12121

@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label Jan 6, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Jan 6, 2022
@blathers-crl

This comment has been minimized.

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 6, 2022
@tbg tbg self-assigned this Jan 14, 2022
@tbg tbg removed their assignment May 31, 2022
@iAaronBuck iAaronBuck self-assigned this Apr 13, 2023
iAaronBuck added a commit to iAaronBuck/cockroach that referenced this issue Apr 26, 2023
…avior

Previously, the RangeUnavailableError did not exist,
and the DistSender would continue infinitely if
no responses from replicas representing a range
were received, as noted in the fail-fast DistSender
work requested in cockroachdb#74503. This fail-fast work also
requires a RangeUnavailableError. Therefore, this PR
resolves that unmet requirement by introducing the error
along with creating the fail-fast behavior within DistSender.

Epic: none
Release note (performance improvement): introduces the
RangeUnavailableError and enables fail-fast behavior in
DistSender.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

2 participants