kvcoord: handle ReplicaUnavailableError #74504

tbg · 2022-01-06T09:48:46Z

Related to #33007.

Once #71806 lands, replicas (regardless of Raft and Lease status) unable to access the replication layer will fail-fast requests. #74500 will allow DistSender to detect these errors and we could make use of that to improve the UX in cases in which DistSender has speculatively tried a Replica, received a fail-fast error, but has other options.
In particular, we may want to avoid returning fail-fast errors up the stack for follower reads, since we can always contact the leaseholder. On the other hand, one of the common reasons for fail-fast errors is loss of quorum, and this will affect all replicas in the same way and trying the leaseholder only adds latency, which may be undesirable. There is a trade-off to make here between hoping to get a result yet and letting the client handle the fail-fast error as soon as possible. Thinking this through a bit, it is almost as though we really want DistSender to have its own notion of circuit breaking on a per-Range basis, where an async probe periodically checks whether it can contact the Range. This is related

We may want to make sure to try other replicas when receiving a fail-fast error and only return when we've exhausted all potential options. But we must also avoid infinite redirect loops: if the leaseholder returns fail-fast errors, and the other replicas return a NotLeaseholderError pointing back at the leaseholder, we need to propagate the failure upwards. This is somewhat similar to #74503 (the case in which all replicas are gone and DistSender retries forever anyway).

Jira issue: CRDB-12122

The text was updated successfully, but these errors were encountered:

tbg · 2022-01-06T10:10:27Z

The error will be returned as br.Error here:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 2148 to 2157 in d54e0dd

    
           default: 
        
           	if ambiguousError != nil { 
        
           		return nil, roachpb.NewAmbiguousResultErrorf("error=%s [propagate]", ambiguousError) 
        
           	} 
        
           	// The error received is likely not specific to this 
        
           	// replica, so we should return it instead of trying other 
        
           	// replicas. 
        
           	return br, nil 
        
           }

which would be the starting point for making behavioral changes.

Extracted from cockroachdb#71806. For cockroachdb#33007, we are introducing per-Replica circuit breakers. When tripped, they should return a structured error (cockroachdb#74500) that DistSender (cockroachdb#74504) can handle and that the SQL layer can interpret and render nicely in terms of tables and indexes (cockroachdb#74502). This PR introduces this error and the necessary plumbing. It will be easy to adjust the actual data within it with needed, so reviewers should mainly focus on that the scaffolding is idiomatic. I want to acknowledge that I am aware of the existence of `sqlerrors.NewRangeUnavailableError`; there's possibly some interplay between these two coming up in cockroachdb#74502 but we definitely need a structured error that is emitted from KV (and in particular doesn't cause wild import cycles) that is per *Replica*, as we envision (though not initially) scenarios in which the range as a hole remains available but individual Replicas are not. Release note: None

Extracted from cockroachdb#71806. For cockroachdb#33007, we are introducing per-Replica circuit breakers. When tripped, they should return a structured error (cockroachdb#74500) that DistSender (cockroachdb#74504) can handle and that the SQL layer can interpret and render nicely in terms of tables and indexes (cockroachdb#74502). This PR introduces this error and the necessary plumbing. It will be easy to adjust the actual data within it with needed, but I had some trouble implementing the redaction-related interfaces idiomatically so I'm sending this out for a separate review. I want to acknowledge that I am aware of the existence of `sqlerrors.NewRangeUnavailableError`; there's possibly some interplay between these two coming up in cockroachdb#74502 but we definitely need a structured error that is emitted from KV (and in particular doesn't cause wild import cycles) that is per *Replica*, as we envision (though not initially) scenarios in which the range as a hole remains available but individual Replicas are not. Release note: None

Extracted from cockroachdb#71806. For cockroachdb#33007, we are introducing per-Replica circuit breakers. When tripped, they should return a structured error (cockroachdb#74500) that DistSender (cockroachdb#74504) can handle and that the SQL layer can interpret and render nicely in terms of tables and indexes (cockroachdb#74502). This PR introduces this error and the necessary plumbing. It will be easy to adjust the actual data within it with needed; we might find a reason to do so during cockroachdb#74502. I want to acknowledge that I am aware of the existence of `sqlerrors.NewRangeUnavailableError`; there's possibly some interplay between these two coming up in cockroachdb#74502 but we definitely need a structured error that is emitted from KV (and in particular doesn't cause wild import cycles) that is per *Replica*, as we envision (though not initially) scenarios in which the range as a hole remains available but individual Replicas are not. Release note: None

73785: kvserver/loqrecovery: record and post replica recovery events r=tbg,erikgrinaker a=aliher1911 Previously when replica is rewritten offline to recover from loss of quorum, no trace of this event was kept in the system. This is not ideal as it is not possible to understand what happened unless person performing recovery documented its actions and then informed person doing investigation. This diff adds records of such actions. Records are created offline in store and then propagated to server logs when node is restarted. Release note: None Fixes #73281 74698: roachpb: introduce ReplicaUnavailableError r=knz a=tbg Extracted from #71806. For #33007, we are introducing per-Replica circuit breakers. When tripped, they should return a structured error (#74500) that DistSender (#74504) can handle and that the SQL layer can interpret and render nicely in terms of tables and indexes (#74502). This PR introduces this error and the necessary plumbing. It will be easy to adjust the actual data within it with needed; we might find a reason to do so during #74502. I want to acknowledge that I am aware of the existence of `sqlerrors.NewRangeUnavailableError`; there's possibly some interplay between these two coming up in #74502 but we definitely need a structured error that is emitted from KV (and in particular doesn't cause wild import cycles) that is per *Replica*, as we envision (though not initially) scenarios in which the range as a hole remains available but individual Replicas are not. Release note: None 74895: dev: add label-merged-pr to build command r=rail a=jlinder Give dev the ability to build the label-merged-pr binary. Release note: None 74925: sql/row: remove descriptor return values from fetcher r=RaduBerinde a=RaduBerinde The fetcher's next-row methods return the table and index descriptor. This was only useful for interleaved tables which no longer exist. This commit cleans up the code to remove these unnecessary return values. Release note: None 74927: opt: do not error if distribute expression is a no-op r=rytaft a=rytaft This commit removes an assertion that caused an internal error when a distribute operator was unnecessary. This assertion made the incorrect assumption that it should always be possible to remove no-op enforcers. However, it is not possible to completely remove no-op distribute operators when there is also a sort required. This due to the way physical properties are enforced, in which distribution is enforced before ordering. Therefore, a no-op distribute operator should not cause an assertion failure. Fixes #74890 There is no release note since this bug is not part of any release. Release note: None 74929: sql: fix ALTER DATABASE help text r=rytaft a=rytaft Prior to this commit, the CLI help text shown for `ALTER DATABASE` was incomplete and incorrect. This commit fixes the help text to include some omitted options and fixes `ADD REGIONS` and `DROP REGIONS` to say `ADD REGION` and `DROP REGION`. Release note (cli change): Fixed the CLI help text for `ALTER DATABASE` to show correct options for `ADD REGION` and `DROP REGION`, and include some missing options such as `CONFIGURE ZONE`. 74931: label-merged-pr: don't exit on finding ref w/o a version r=rail a=jlinder Before, if a git ref was in a version tag, label-merged-pr would simply exit saying it couldn't find a version for the ref. However, it should be possible to run the command with a set of commits that are not yet in a released version. Now, it simply says the ref is not yet in a released version and continues with processing commits. Release note: None Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: James H. Linder <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Rebecca Taft <[email protected]>

github-actions · 2023-11-23T11:04:51Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-client Relating to the KV client and the KV interface. labels Jan 6, 2022

blathers-crl bot added the T-kv KV Team label Jan 6, 2022

This was referenced Jan 6, 2022

kvserver: simplify lease forwarding constraints #74506

Closed

roachpb: introduce ReplicaUnavailableError #74698

Merged

tbg self-assigned this Jan 14, 2022

tbg mentioned this issue Jan 14, 2022

kvserver: circuit-break requests to unavailable ranges #33007

Closed

tbg removed their assignment May 31, 2022

github-actions bot added the no-issue-activity label Nov 23, 2023

github-actions bot added the X-stale label Dec 4, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023

exalate-issue-sync bot closed this as completed Dec 4, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: handle ReplicaUnavailableError #74504

kvcoord: handle ReplicaUnavailableError #74504

tbg commented Jan 6, 2022 •

edited by cockroach-jira-scripts

Loading

tbg commented Jan 6, 2022

github-actions bot commented Nov 23, 2023

kvcoord: handle ReplicaUnavailableError #74504

kvcoord: handle ReplicaUnavailableError #74504

Comments

tbg commented Jan 6, 2022 • edited by cockroach-jira-scripts Loading

tbg commented Jan 6, 2022

github-actions bot commented Nov 23, 2023

tbg commented Jan 6, 2022 •

edited by cockroach-jira-scripts

Loading