Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: do not report diff in consistency checks #89502

Merged
merged 1 commit into from
Oct 10, 2022

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Oct 6, 2022

This commit removes reporting of the diff between replicas in case of consistency check failures. With the increased range sizes the previous approach has become infeasible.

One possible way to inspect an inconsistency after this change is:

  1. Run cockroach debug range-data tool to extract the range data from each replica's checkpoint.
  2. Use standard OS tools like diff to analyse them.

In the meantime, we are researching the UX of this alternative approach, and seeing if there can be a better tooling support.

Part of #21128
Epic: none

Release note (sql change): The crdb_internal.check_consistency function now does not include the diff between inconsistent replicas, should they occur. If an inconsistency occurs, the storage engine checkpoints should be inspected. This change is made because previously the range size limit has been increased from 64 MiB to O(GiB), so inlining diffs in consistency checks does not scale.

@pav-kv pav-kv requested review from erikgrinaker and tbg October 6, 2022 16:50
@pav-kv pav-kv requested a review from a team as a code owner October 6, 2022 16:50
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv pav-kv force-pushed the rm_consistency_check_diff branch from b3877df to ca91e2c Compare October 6, 2022 17:06
@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 6, 2022

This probably can be split into 3 commits:

  • Delete diff reporting in the consistency check report. This is a short one, but the one that changes behaviour.
  • Remove diffing code, as it's used only for testing.
  • Remove the "snapshot" code, as it's used only for testing.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 15 of 15 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @pavelkalinnikov)


pkg/kv/kvserver/replica_consistency.go line 243 at r1 (raw file):

	if args.Checkpoint {
		// A diff was already printed. Return because all the code below will do
		// is request another consistency check, with a diff and with

comment needs update


pkg/kv/kvserver/replica_consistency.go line 483 at r1 (raw file):

// ChecksumRangeForTesting returns a checksum over the KV data of the given
// range. Only for testing.

Shouldn't this live in helpers_test.go so that it's not in production code?

@pav-kv pav-kv force-pushed the rm_consistency_check_diff branch 2 times, most recently from aa0dd90 to 523901b Compare October 10, 2022 12:12
@pav-kv pav-kv changed the title kvserver: remove consistency check diffs kvserver: do not report diff in consistency checks Oct 10, 2022
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


pkg/kv/kvserver/replica_consistency.go line 243 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

comment needs update

Done. Also updated a few other comments.


pkg/kv/kvserver/replica_consistency.go line 483 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Shouldn't this live in helpers_test.go so that it's not in production code?

Good idea. This PR no longer modifies this func, so will address it in the next one.

@pav-kv pav-kv requested a review from tbg October 10, 2022 12:16
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now. We should consider extending the fatal error message with some instructions on how to go about debugging the inconsistency, once we figure out the tooling.

Why aren't we removing replica_consistency_diff.go? We should no longer need it, and if we want to add similar tooling we can dig it out from the Git history.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 10, 2022

I reduced this PR to only remove the diff reporting behaviour (+ proto fields that support it).

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 10, 2022

Why aren't we removing replica_consistency_diff.go? We should no longer need it, and if we want to add similar tooling we can dig it out from the Git history.

I did that first, but then decided to separate it for ease of review. Also, while thinking about the best tooling support, we may need diffs back; so will remove them a bit later if we indeed don't need them. Agree that it could be restored from git, but decided to take smaller steps here (don't think either way is particularly better).

This commit removes reporting of the diff between replicas in case of
consistency check failures. With the increased range sizes the previous
approach has become infeasible.

One possible way to inspect an inconsistency after this change is:
1. Run `cockroach debug range-data` tool to extract the range data from
   each replica's checkpoint.
2. Use standard OS tools like `diff` to analyse them.

In the meantime, we are researching the UX of this alternative approach, and
seeing if there can be a better tooling support.

Release note (sql change): The crdb_internal.check_consistency function now
does not include the diff between inconsistent replicas, should they occur. If
an inconsistency occurs, the storage engine checkpoints should be inspected.
This change is made because previously the range size limit has been increased
from 64 MiB to O(GiB), so inlining diffs in consistency checks does not scale.
@pav-kv pav-kv force-pushed the rm_consistency_check_diff branch from 523901b to ec704f6 Compare October 10, 2022 14:06
@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 10, 2022

bors r=erikgrinaker,tbg

@craig
Copy link
Contributor

craig bot commented Oct 10, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Oct 10, 2022

Build succeeded:

@craig craig bot merged commit 94e2a2b into cockroachdb:master Oct 10, 2022
@pav-kv pav-kv deleted the rm_consistency_check_diff branch October 10, 2022 17:22
craig bot pushed a commit that referenced this pull request Oct 19, 2022
90127: roachpb: remove stale `timestamp.proto` import r=erikgrinaker a=erikgrinaker

Fixes build warnings of the form:

```
roachpb/internal_raft.proto: warning: Import util/hlc/timestamp.proto but not used.
```

Likely fallout from #89502.

Epic: None

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants