Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/version-upgrade failed #99894

Closed
cockroach-teamcity opened this issue Mar 29, 2023 · 11 comments
Closed

roachtest: acceptance/version-upgrade failed #99894

cockroach-teamcity opened this issue Mar 29, 2023 · 11 comments
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 29, 2023

roachtest.acceptance/version-upgrade failed with artifacts on master @ bb5c97b5d2073a27e7ec88654b05ac913112f610:

test artifacts and logs in: /artifacts/acceptance/version-upgrade/run_1
(mixedversion.go:366).Run: mixed-version test failure while running step 29 (run "test schema change step"): output in run_052958.923353501_n1_workload-run-schemac: ./workload run schemachange {pgurl:1-4} --concurrency 2 --max-ops 10 --verbose 1 returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_052959.759759262_n1_workload-run-schemac.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-26117

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 29, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 29, 2023
@blathers-crl blathers-crl bot added the T-testeng TestEng Team label Mar 29, 2023
@renatolabs
Copy link
Contributor

Same familiar error:

"message": "***UNEXPECTED ERROR; Failed to generate a random operation: ERROR: operation \"show cluster setting version\" timed out after 2m0.001s (given timeout 2m0s): value differs between local setting ([18 8 8 22 16 2 24 0 32 11]) and KV ([18 8 8 22 16 2 24 0 32 10]); try again later (\u003cnil\u003e after 1m59.141621974s) (SQLSTATE XXUUU)",

(and once again, no migrations around the time of failure are actually taking more than 2 minutes).

@fqazi @ajwerner was #99665 expected to stop this kind of failure from happening?

@ajwerner
Copy link
Contributor

That one is interesting!

@ajwerner
Copy link
Contributor

No it has nothing to do with that failure mode. The deal here is that somewhere in the cluster setting land we're getting divergence between the in-memory value and the KV value.

@ajwerner
Copy link
Contributor

To be more concrete, the way the code here works is that it reads the value from KV and checks the value in memory. They do not move in lock-step (the in-memory value moves first). If the in-memory version does not match the kv version, we retry. We keep retrying because of different mismatches. I think we expect mis-matches nowadays while any migrations are on-going.

I don't know quite what the right thing to do here is. Maybe we want to return an in-memory value if we've seen the KV value ever move past it?

@renatolabs
Copy link
Contributor

No it has nothing to do with that failure mode.

So you're saying that #99590 (fixed by #99665) and this issue are unrelated and just happen to have the same error message as outcome? Can you explain how one could have identified the distinction?

I think we expect mis-matches nowadays while any migrations are on-going.

What do you mean by nowadays? The behavior you described (which seems to be the logic in show_cluster_setting.go) is basically unchanged since at least 22.2. I've looked through a few issues where we got this same error message and it seems that it generally pointed towards some other problem that needed fixing.

@ajwerner
Copy link
Contributor

I guess the truth is that they both failed because the migrations too more than 2 minutes. The difference is that here they were making progress, but were sort of slow and there they weren't making progress at all. I only fixed the proximate issue that was causing the migration to be hung. You're right that in effect the issue is unresolved.

@ajwerner
Copy link
Contributor

I wonder if what we should do is note when the version in memory is the fence version successor to the version in KV and then just return the KV version.

@ajwerner
Copy link
Contributor

@renatolabs this is what I have in mind: #99967

ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 29, 2023
In `SHOW CLUSTER SETTING version` we wait for the stored version to match
the in-memory version. The in-memory version get pushed to the fence but
the fence is not persisted, so, while migrations are ongoing, we won't see
these values match. In practice this is a problem these days because the
migrations take a long time.

Epic: none

Informs cockroachdb#99894

Release note: None
@ajwerner
Copy link
Contributor

I can't get this one to fail on master, at least after 20 runs.

@ajwerner
Copy link
Contributor

I wonder if some of the fixes merged by Raphael recently changed things.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 31, 2023
In `SHOW CLUSTER SETTING version` we wait for the stored version to match
the in-memory version. The in-memory version get pushed to the fence but
the fence is not persisted, so, while migrations are ongoing, we won't see
these values match. In practice this is a problem these days because the
migrations take a long time.

Epic: none

Informs cockroachdb#99894

Release note (bug fix): Fixed a bug which could cause `SHOW CLUSTER SETTING
version` to hang and return an opaque error while cluster finalization is
ongoing.
craig bot pushed a commit that referenced this issue Mar 31, 2023
99967: sql: allow the in-memory version to be the next fence version r=ajwerner a=ajwerner

In `SHOW CLUSTER SETTING version` we wait for the stored version to match the in-memory version. The in-memory version get pushed to the fence but the fence is not persisted, so, while migrations are ongoing, we won't see these values match. In practice this is a problem these days because the migrations take a long time.

Epic: none

Informs #99894

Release note: None

100003: sql: add a regression test for virtual table generation hang r=yuzefovich a=yuzefovich

This commit adds a regression test for #99753 that verifies that virtual table generation doesn't hang when the worker goroutine returns the query canceled error.

Informs #99753

Release note: None

Co-authored-by: ajwerner <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Mar 31, 2023
In `SHOW CLUSTER SETTING version` we wait for the stored version to match
the in-memory version. The in-memory version get pushed to the fence but
the fence is not persisted, so, while migrations are ongoing, we won't see
these values match. In practice this is a problem these days because the
migrations take a long time.

Epic: none

Informs #99894

Release note (bug fix): Fixed a bug which could cause `SHOW CLUSTER SETTING
version` to hang and return an opaque error while cluster finalization is
ongoing.
@renatolabs
Copy link
Contributor

Closing this one as it's very likely fixed by #99967. We'll take another look if it fails again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team
Projects
None yet
Development

No branches or pull requests

3 participants