Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: inconsistency failed #97337

Closed
cockroach-teamcity opened this issue Feb 19, 2023 · 22 comments · Fixed by #97410
Closed

roachtest: inconsistency failed #97337

cockroach-teamcity opened this issue Feb 19, 2023 · 22 comments · Fixed by #97410
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-storage Storage Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 19, 2023

roachtest.inconsistency failed with artifacts on master @ 3d054f37c7c87f53cb56fac4e5500f0d1130d09a:

test artifacts and logs in: /artifacts/inconsistency/run_1
(cluster.go:1940).Run: output in run_095838.950466055_n1_cockroach-debug-rang: ./cockroach debug range-descriptors $(find {store-dir}/auxiliary/checkpoints/* -maxdepth 0 -type d | head -n1) returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_095838.958078383_n1_cockroach-debug-rang.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-24642

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv-replication labels Feb 19, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Feb 19, 2023
@erikgrinaker
Copy link
Contributor

@pavelkalinnikov Can you have a look?

(1) ./cockroach debug range-descriptors $(find {store-dir}/auxiliary/checkpoints/* -maxdepth 0 -type d | head -n1) returned
  | stderr:
  | ERROR: pebble: database "/mnt/data1/cockroach/auxiliary/checkpoints/r1_at_601" does not exist
  | Failed running "debug range-descriptors"
This node is terminating because a replica inconsistency was detected between [n1,s1,r1/1:‹×›]
and its other replicas: (n1,s1):1,(n2,s2):2,(n3,s3):3. Please check your cluster-wide log files for more
information and contact the CockroachDB support team. It is not necessarily safe
to replace this node; cluster data may still be at risk of corruption.

A checkpoints directory to aid (expert) debugging should be present in:
‹×›

A file preventing this node from restarting was placed at:
‹×›

@cockroach-teamcity
Copy link
Member Author

roachtest.inconsistency failed with artifacts on master @ e9c96e7179e19aae2f8d386f67eb950db8c3354b:

test artifacts and logs in: /artifacts/inconsistency/run_1
(cluster.go:1940).Run: output in run_094522.363151173_n1_cockroach-debug-rang: ./cockroach debug range-descriptors $(find {store-dir}/auxiliary/checkpoints/* -maxdepth 0 -type d | head -n1) returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_094522.370592238_n1_cockroach-debug-rang.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker erikgrinaker added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 20, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

Interesting

ERROR: pebble: database "/mnt/data1/cockroach/auxiliary/checkpoints/r1_at_601" does not exist

indicates that find found the folder, but Pebble did not see it?

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

Reproduces on GCE worker. I can see the checkpoint folder, but cockroach debug range-descriptors pointed at it reports "does not exist" error.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

In my experiment:

$ ls /mnt/data1/cockroach/auxiliary/checkpoints/r1_at_589
000022.log  000025.sst  000026.log  000027.log  000028.log  000029.log  MANIFEST-000023  OPTIONS-000024  auxiliary  checkpoint.txt  marker.format-version.000001.012  marker.manifest.000001.MANIFEST-000023

$ ./cockroach debug range-descriptors /mnt/data1/cockroach/auxiliary/checkpoints/r1_at_589
ERROR: pebble: database "/mnt/data1/cockroach/auxiliary/checkpoints/r1_at_589" does not exist
Failed running "debug range-descriptors"

$ ./cockroach debug pebble db check /mnt/data1/cockroach/auxiliary/checkpoints/r1_at_589
checked 39111 points and 0 tombstone

The Pebble tool (cockroach debug pebble) opens this directory and does some work, but the CRDB tool fails to open it.

@cockroach-teamcity
Copy link
Member Author

roachtest.inconsistency failed with artifacts on master @ dd2749ae4ab61eed2f99238acb74e8d3c6b4cb1d:

test artifacts and logs in: /artifacts/inconsistency/run_1
(cluster.go:1956).Run: output in run_094554.823242937_n1_cockroach-debug-rang: ./cockroach debug range-descriptors $(find {store-dir}/auxiliary/checkpoints/* -maxdepth 0 -type d | head -n1) returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_094554.828989952_n1_cockroach-debug-rang.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

I think we are hitting one of the new error branches from #97054:

return nil, errors.Errorf("pebble: database %q does not exist", cfg.StorageConfig.Dir)

Upd: confirmed that this PR introduced the problem, by running the roachtest before and after the commit.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

@RaduBerinde Is the version file required in checkpoints as well?

Also, the "database does not exist" error sounds confusing. The directory does exist, but it apparently lacks the required file. Should the error be more specific? If it's a version problem, it should mention the version problem.

At first I thought it was this error. It would be nice if the errors weren't exactly the same.

@pav-kv pav-kv added the T-storage Storage Team label Feb 21, 2023
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Feb 21, 2023
@pav-kv pav-kv removed their assignment Feb 21, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 21, 2023

Thanks for looking into it Pavel!

Btw, do you know what's up with this weird output here?

A checkpoints directory to aid (expert) debugging should be present in:
‹×›

A file preventing this node from restarting was placed at:
‹×›

Is this perhaps redaction markers or some such?

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

Where did you copy this from?

I'm seeing this:

...
F230221 09:10:43.460445 2509 kv/kvserver/replica_consistency.go:791 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 1738 +A checkpoints directory to aid (expert) debugging should be present in:
F230221 09:10:43.460445 2509 kv/kvserver/replica_consistency.go:791 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 1738 +‹/mnt/data1/cockroach/auxiliary›
F230221 09:10:43.460445 2509 kv/kvserver/replica_consistency.go:791 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 1738 +
F230221 09:10:43.460445 2509 kv/kvserver/replica_consistency.go:791 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 1738 +A file preventing this node from restarting was placed at:
F230221 09:10:43.460445 2509 kv/kvserver/replica_consistency.go:791 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 1738 +‹/mnt/data1/cockroach/auxiliary/_CRITICAL_ALERT.txt›
...

Upd: Ah right, I can see some redacted version of it too in run_1/logs/1.cockroach.log:

A checkpoints directory to aid (expert) debugging should be present in:
‹×›

A file preventing this node from restarting was placed at:
‹×›

Maybe you're seeing this ‹×› nonsense instead of ‹×›? The × is a Unicode symbol.

@erikgrinaker
Copy link
Contributor

Yeah, thought so. You can unredact it by wrapping the format args in redact.Safe().

craig bot pushed a commit that referenced this issue Feb 21, 2023
97379: kvserver: unredact checkpoint paths in inconsistency message r=erikgrinaker a=pavelkalinnikov

Touches #97337
Release note: none
Epic: none

Co-authored-by: Pavel Kalinnikov <[email protected]>
@RaduBerinde
Copy link
Member

Yes, require the min version file to Open a store. If the min version file isn't there but the store exists when we try to open it, we assume it is from a very old CRDB version which didn't write the min version file.

I agree the error message should be improved, I'll do that.

What do you think is the right thing here wrt checkpoints? Should we include this file in all checkpoints? Or should we treat debug differently and not require the file when opening?

@RaduBerinde
Copy link
Member

Btw how was the checkpoint here created? From CRDB code (computeChecksumPostApply -> checkpoint)? Should be easy to copy the file in that code path.

RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Feb 21, 2023
We require the min version filename to open a store. If it doesn't
exist, we assume the store doesn't exist which can lead to a confusing
message if just the min version file is missing. This commit makes
this message more useful.

Informs cockroachdb#97337

Release note: none
Epic: none
@RaduBerinde
Copy link
Member

@pavelkalinnikov I posted #97410. As a workaround for now you can copy the STORAGE_MIN_VERSION from any recent CRDB store to the checkpoint directory.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

Btw how was the checkpoint here created? From CRDB code (computeChecksumPostApply -> checkpoint)?

Yes, this path.

As a workaround for now you can copy the STORAGE_MIN_VERSION from any recent CRDB store to the checkpoint directory.

I think your PR #97410 will fix this since you're writing the min version file in the checkpoint path that we use. I you don't mind, you could run this to make sure it fixes the issue (on a GCE worker):

dev build --cross cockroach
roachtest run --cockroach=artifacts/cockroach inconsistency --debug

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

Any existing checkpoints though will be "broken" if created before this fix, but the existence of these checkpoints is a bigger problem (inconsistency) in the first place, so we would be aware of their existence in prod. We don't know of any for now.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

@RaduBerinde Is there indication in manifest/etc whether the store is a checkpoint? Maybe we could not require max version file when reading checkpoints.

Alternatively, we could add a "skip version check" kind of option into cockroach debug, to guarantee that even checkpoints from the past can be opened.

@RaduBerinde
Copy link
Member

I don't think there is. In any case, the main reason for requiring the min version file was to avoid opening the Pebble store at all, because opening old versions can corrupt state in some cases (more details in #42653, #89836).

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 21, 2023

@RaduBerinde Is this a problem only when storage is opened in read/write mode, or read-only would be problematic too? Would it be (somewhat) ok to not require this file in the read-only mode?

@RaduBerinde
Copy link
Member

Interesting idea.. but I'm not sure. The WAL still needs to be replayed (even if the results aren't flushed), and the WAL replay can call back into Cockroach with certain expectations. I'd guess we would at least be safe from corrupting the store though.

@RaduBerinde RaduBerinde self-assigned this Feb 21, 2023
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Feb 21, 2023
We now write out the min version file with checkpoints generated
through the storage layer. This allows checkpoints to be opened with
the cockroach storage layer (not just with the low level pebble tool).

Informs cockroachdb#97337

Release note: None
Epic: none
@RaduBerinde
Copy link
Member

I you don't mind, you could run this to make sure it fixes the issue (on a GCE worker):

dev build --cross cockroach
roachtest run --cockroach=artifacts/cockroach inconsistency --debug

Done, thank you!

@cockroach-teamcity
Copy link
Member Author

roachtest.inconsistency failed with artifacts on master @ 286b3e235171a39b8f9910555affcc7ce310741a:

test artifacts and logs in: /artifacts/inconsistency/run_1
(cluster.go:1956).Run: output in run_101054.536391220_n1_cockroach-debug-rang: ./cockroach debug range-descriptors $(find {store-dir}/auxiliary/checkpoints/* -maxdepth 0 -type d | head -n1) returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_101054.542482014_n1_cockroach-debug-rang.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker erikgrinaker added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 22, 2023
@craig craig bot closed this as completed in 2431b4b Feb 22, 2023
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants