Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: handle ReplicaUnavailableError #74504

Closed
tbg opened this issue Jan 6, 2022 · 2 comments
Closed

kvcoord: handle ReplicaUnavailableError #74504

tbg opened this issue Jan 6, 2022 · 2 comments
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity T-kv KV Team X-stale

Comments

@tbg
Copy link
Member

tbg commented Jan 6, 2022

Related to #33007.

Once #71806 lands, replicas (regardless of Raft and Lease status) unable to access the replication layer will fail-fast requests. #74500 will allow DistSender to detect these errors and we could make use of that to improve the UX in cases in which DistSender has speculatively tried a Replica, received a fail-fast error, but has other options.
In particular, we may want to avoid returning fail-fast errors up the stack for follower reads, since we can always contact the leaseholder. On the other hand, one of the common reasons for fail-fast errors is loss of quorum, and this will affect all replicas in the same way and trying the leaseholder only adds latency, which may be undesirable. There is a trade-off to make here between hoping to get a result yet and letting the client handle the fail-fast error as soon as possible. Thinking this through a bit, it is almost as though we really want DistSender to have its own notion of circuit breaking on a per-Range basis, where an async probe periodically checks whether it can contact the Range. This is related

We may want to make sure to try other replicas when receiving a fail-fast error and only return when we've exhausted all potential options. But we must also avoid infinite redirect loops: if the leaseholder returns fail-fast errors, and the other replicas return a NotLeaseholderError pointing back at the leaseholder, we need to propagate the failure upwards. This is somewhat similar to #74503 (the case in which all replicas are gone and DistSender retries forever anyway).

Jira issue: CRDB-12122

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-client Relating to the KV client and the KV interface. labels Jan 6, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Jan 6, 2022
@tbg
Copy link
Member Author

tbg commented Jan 6, 2022

The error will be returned as br.Error here:

default:
if ambiguousError != nil {
return nil, roachpb.NewAmbiguousResultErrorf("error=%s [propagate]", ambiguousError)
}
// The error received is likely not specific to this
// replica, so we should return it instead of trying other
// replicas.
return br, nil
}

which would be the starting point for making behavioral changes.

tbg added a commit to tbg/cockroach that referenced this issue Jan 11, 2022
Extracted from cockroachdb#71806.

For cockroachdb#33007, we are introducing per-Replica circuit breakers. When
tripped, they should return a structured error (cockroachdb#74500) that DistSender
(cockroachdb#74504) can handle and that the SQL layer can interpret and render
nicely in terms of tables and indexes (cockroachdb#74502).

This PR introduces this error and the necessary plumbing. It will be
easy to adjust the actual data within it with needed, so reviewers
should mainly focus on that the scaffolding is idiomatic.

I want to acknowledge that I am aware of the existence of
`sqlerrors.NewRangeUnavailableError`; there's possibly some interplay
between these two coming up in cockroachdb#74502 but we definitely need a
structured error that is emitted from KV (and in particular doesn't
cause wild import cycles) that is per *Replica*, as we envision (though
not initially) scenarios in which the range as a hole remains available
but individual Replicas are not.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Extracted from cockroachdb#71806.

For cockroachdb#33007, we are introducing per-Replica circuit breakers. When
tripped, they should return a structured error (cockroachdb#74500) that DistSender
(cockroachdb#74504) can handle and that the SQL layer can interpret and render
nicely in terms of tables and indexes (cockroachdb#74502).

This PR introduces this error and the necessary plumbing. It will be
easy to adjust the actual data within it with needed, but I had some
trouble implementing the redaction-related interfaces idiomatically
so I'm sending this out for a separate review.

I want to acknowledge that I am aware of the existence of
`sqlerrors.NewRangeUnavailableError`; there's possibly some interplay
between these two coming up in cockroachdb#74502 but we definitely need a
structured error that is emitted from KV (and in particular doesn't
cause wild import cycles) that is per *Replica*, as we envision (though
not initially) scenarios in which the range as a hole remains available
but individual Replicas are not.

Release note: None
@tbg tbg self-assigned this Jan 14, 2022
tbg added a commit to tbg/cockroach that referenced this issue Jan 17, 2022
Extracted from cockroachdb#71806.

For cockroachdb#33007, we are introducing per-Replica circuit breakers. When
tripped, they should return a structured error (cockroachdb#74500) that DistSender
(cockroachdb#74504) can handle and that the SQL layer can interpret and render
nicely in terms of tables and indexes (cockroachdb#74502).

This PR introduces this error and the necessary plumbing. It will be
easy to adjust the actual data within it with needed, but I had some
trouble implementing the redaction-related interfaces idiomatically
so I'm sending this out for a separate review.

I want to acknowledge that I am aware of the existence of
`sqlerrors.NewRangeUnavailableError`; there's possibly some interplay
between these two coming up in cockroachdb#74502 but we definitely need a
structured error that is emitted from KV (and in particular doesn't
cause wild import cycles) that is per *Replica*, as we envision (though
not initially) scenarios in which the range as a hole remains available
but individual Replicas are not.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 17, 2022
Extracted from cockroachdb#71806.

For cockroachdb#33007, we are introducing per-Replica circuit breakers. When
tripped, they should return a structured error (cockroachdb#74500) that DistSender
(cockroachdb#74504) can handle and that the SQL layer can interpret and render
nicely in terms of tables and indexes (cockroachdb#74502).

This PR introduces this error and the necessary plumbing. It will be
easy to adjust the actual data within it with needed; we might find a
reason to do so during cockroachdb#74502.

I want to acknowledge that I am aware of the existence of
`sqlerrors.NewRangeUnavailableError`; there's possibly some interplay
between these two coming up in cockroachdb#74502 but we definitely need a
structured error that is emitted from KV (and in particular doesn't
cause wild import cycles) that is per *Replica*, as we envision (though
not initially) scenarios in which the range as a hole remains available
but individual Replicas are not.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 18, 2022
Extracted from cockroachdb#71806.

For cockroachdb#33007, we are introducing per-Replica circuit breakers. When
tripped, they should return a structured error (cockroachdb#74500) that DistSender
(cockroachdb#74504) can handle and that the SQL layer can interpret and render
nicely in terms of tables and indexes (cockroachdb#74502).

This PR introduces this error and the necessary plumbing. It will be
easy to adjust the actual data within it with needed; we might find a
reason to do so during cockroachdb#74502.

I want to acknowledge that I am aware of the existence of
`sqlerrors.NewRangeUnavailableError`; there's possibly some interplay
between these two coming up in cockroachdb#74502 but we definitely need a
structured error that is emitted from KV (and in particular doesn't
cause wild import cycles) that is per *Replica*, as we envision (though
not initially) scenarios in which the range as a hole remains available
but individual Replicas are not.

Release note: None
craig bot pushed a commit that referenced this issue Jan 18, 2022
73785: kvserver/loqrecovery: record and post replica recovery events r=tbg,erikgrinaker a=aliher1911

Previously when replica is rewritten offline to recover from loss
of quorum, no trace of this event was kept in the system.
This is not ideal as it is not possible to understand what happened
unless person performing recovery documented its actions and then
informed person doing investigation.
This diff adds records of such actions. Records are created offline
in store and then propagated to server logs when node is restarted.

Release note: None

Fixes #73281

74698: roachpb: introduce ReplicaUnavailableError r=knz a=tbg

Extracted from #71806.

For #33007, we are introducing per-Replica circuit breakers. When
tripped, they should return a structured error (#74500) that DistSender
(#74504) can handle and that the SQL layer can interpret and render
nicely in terms of tables and indexes (#74502).

This PR introduces this error and the necessary plumbing. It will be
easy to adjust the actual data within it with needed; we might find a
reason to do so during #74502.

I want to acknowledge that I am aware of the existence of
`sqlerrors.NewRangeUnavailableError`; there's possibly some interplay
between these two coming up in #74502 but we definitely need a
structured error that is emitted from KV (and in particular doesn't
cause wild import cycles) that is per *Replica*, as we envision (though
not initially) scenarios in which the range as a hole remains available
but individual Replicas are not.

Release note: None

74895: dev: add label-merged-pr to build command r=rail a=jlinder

Give dev the ability to build the label-merged-pr binary.

Release note: None

74925: sql/row: remove descriptor return values from fetcher r=RaduBerinde a=RaduBerinde

The fetcher's next-row methods return the table and index descriptor.
This was only useful for interleaved tables which no longer exist.
This commit cleans up the code to remove these unnecessary return
values.

Release note: None

74927: opt: do not error if distribute expression is a no-op r=rytaft a=rytaft

This commit removes an assertion that caused an internal error when
a distribute operator was unnecessary. This assertion made the incorrect
assumption that it should always be possible to remove no-op enforcers.
However, it is not possible to completely remove no-op distribute
operators when there is also a sort required. This due to the way
physical properties are enforced, in which distribution is enforced
before ordering. Therefore, a no-op distribute operator should not
cause an assertion failure.

Fixes #74890

There is no release note since this bug is not part of any release.

Release note: None

74929: sql: fix ALTER DATABASE help text r=rytaft a=rytaft

Prior to this commit, the CLI help text shown for `ALTER DATABASE`
was incomplete and incorrect. This commit fixes the help text to
include some omitted options and fixes `ADD REGIONS` and `DROP REGIONS`
to say `ADD REGION` and `DROP REGION`.

Release note (cli change): Fixed the CLI help text for `ALTER DATABASE`
to show correct options for `ADD REGION` and `DROP REGION`, and include
some missing options such as `CONFIGURE ZONE`.

74931: label-merged-pr: don't exit on finding ref w/o a version r=rail a=jlinder

Before, if a git ref was in a version tag, label-merged-pr would simply
exit saying it couldn't find a version for the ref. However, it should
be possible to run the command with a set of commits that are not yet in
a released version.

Now, it simply says the ref is not yet in a released version and
continues with processing commits.

Release note: None

Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: James H. Linder <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
@tbg tbg removed their assignment May 31, 2022
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity T-kv KV Team X-stale
Projects
No open projects
Archived in project
Development

No branches or pull requests

1 participant