-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: do not report diff in consistency checks #89502
Conversation
b3877df
to
ca91e2c
Compare
This probably can be split into 3 commits:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 15 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @pavelkalinnikov)
pkg/kv/kvserver/replica_consistency.go
line 243 at r1 (raw file):
if args.Checkpoint { // A diff was already printed. Return because all the code below will do // is request another consistency check, with a diff and with
comment needs update
pkg/kv/kvserver/replica_consistency.go
line 483 at r1 (raw file):
// ChecksumRangeForTesting returns a checksum over the KV data of the given // range. Only for testing.
Shouldn't this live in helpers_test.go
so that it's not in production code?
aa0dd90
to
523901b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/kv/kvserver/replica_consistency.go
line 243 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
comment needs update
Done. Also updated a few other comments.
pkg/kv/kvserver/replica_consistency.go
line 483 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Shouldn't this live in
helpers_test.go
so that it's not in production code?
Good idea. This PR no longer modifies this func, so will address it in the next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for now. We should consider extending the fatal error message with some instructions on how to go about debugging the inconsistency, once we figure out the tooling.
Why aren't we removing replica_consistency_diff.go
? We should no longer need it, and if we want to add similar tooling we can dig it out from the Git history.
I reduced this PR to only remove the diff reporting behaviour (+ proto fields that support it). |
I did that first, but then decided to separate it for ease of review. Also, while thinking about the best tooling support, we may need diffs back; so will remove them a bit later if we indeed don't need them. Agree that it could be restored from git, but decided to take smaller steps here (don't think either way is particularly better). |
This commit removes reporting of the diff between replicas in case of consistency check failures. With the increased range sizes the previous approach has become infeasible. One possible way to inspect an inconsistency after this change is: 1. Run `cockroach debug range-data` tool to extract the range data from each replica's checkpoint. 2. Use standard OS tools like `diff` to analyse them. In the meantime, we are researching the UX of this alternative approach, and seeing if there can be a better tooling support. Release note (sql change): The crdb_internal.check_consistency function now does not include the diff between inconsistent replicas, should they occur. If an inconsistency occurs, the storage engine checkpoints should be inspected. This change is made because previously the range size limit has been increased from 64 MiB to O(GiB), so inlining diffs in consistency checks does not scale.
523901b
to
ec704f6
Compare
bors r=erikgrinaker,tbg |
Build failed (retrying...): |
Build succeeded: |
90127: roachpb: remove stale `timestamp.proto` import r=erikgrinaker a=erikgrinaker Fixes build warnings of the form: ``` roachpb/internal_raft.proto: warning: Import util/hlc/timestamp.proto but not used. ``` Likely fallout from #89502. Epic: None Release note: None Co-authored-by: Erik Grinaker <[email protected]>
This commit removes reporting of the diff between replicas in case of consistency check failures. With the increased range sizes the previous approach has become infeasible.
One possible way to inspect an inconsistency after this change is:
cockroach debug range-data
tool to extract the range data from each replica's checkpoint.diff
to analyse them.In the meantime, we are researching the UX of this alternative approach, and seeing if there can be a better tooling support.
Part of #21128
Epic: none
Release note (sql change): The
crdb_internal.check_consistency
function now does not include the diff between inconsistent replicas, should they occur. If an inconsistency occurs, the storage engine checkpoints should be inspected. This change is made because previously the range size limit has been increased from 64 MiB to O(GiB), so inlining diffs in consistency checks does not scale.