-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: acceptance/version-upgrade failed #99894
Comments
Same familiar error:
(and once again, no migrations around the time of failure are actually taking more than 2 minutes). @fqazi @ajwerner was #99665 expected to stop this kind of failure from happening? |
That one is interesting! |
No it has nothing to do with that failure mode. The deal here is that somewhere in the cluster setting land we're getting divergence between the in-memory value and the KV value. |
To be more concrete, the way the code here works is that it reads the value from KV and checks the value in memory. They do not move in lock-step (the in-memory value moves first). If the in-memory version does not match the kv version, we retry. We keep retrying because of different mismatches. I think we expect mis-matches nowadays while any migrations are on-going. I don't know quite what the right thing to do here is. Maybe we want to return an in-memory value if we've seen the KV value ever move past it? |
So you're saying that #99590 (fixed by #99665) and this issue are unrelated and just happen to have the same error message as outcome? Can you explain how one could have identified the distinction?
What do you mean by nowadays? The behavior you described (which seems to be the logic in |
I guess the truth is that they both failed because the migrations too more than 2 minutes. The difference is that here they were making progress, but were sort of slow and there they weren't making progress at all. I only fixed the proximate issue that was causing the migration to be hung. You're right that in effect the issue is unresolved. |
I wonder if what we should do is note when the version in memory is the fence version successor to the version in KV and then just return the KV version. |
@renatolabs this is what I have in mind: #99967 |
In `SHOW CLUSTER SETTING version` we wait for the stored version to match the in-memory version. The in-memory version get pushed to the fence but the fence is not persisted, so, while migrations are ongoing, we won't see these values match. In practice this is a problem these days because the migrations take a long time. Epic: none Informs cockroachdb#99894 Release note: None
I can't get this one to fail on master, at least after 20 runs. |
I wonder if some of the fixes merged by Raphael recently changed things. |
In `SHOW CLUSTER SETTING version` we wait for the stored version to match the in-memory version. The in-memory version get pushed to the fence but the fence is not persisted, so, while migrations are ongoing, we won't see these values match. In practice this is a problem these days because the migrations take a long time. Epic: none Informs cockroachdb#99894 Release note (bug fix): Fixed a bug which could cause `SHOW CLUSTER SETTING version` to hang and return an opaque error while cluster finalization is ongoing.
99967: sql: allow the in-memory version to be the next fence version r=ajwerner a=ajwerner In `SHOW CLUSTER SETTING version` we wait for the stored version to match the in-memory version. The in-memory version get pushed to the fence but the fence is not persisted, so, while migrations are ongoing, we won't see these values match. In practice this is a problem these days because the migrations take a long time. Epic: none Informs #99894 Release note: None 100003: sql: add a regression test for virtual table generation hang r=yuzefovich a=yuzefovich This commit adds a regression test for #99753 that verifies that virtual table generation doesn't hang when the worker goroutine returns the query canceled error. Informs #99753 Release note: None Co-authored-by: ajwerner <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
In `SHOW CLUSTER SETTING version` we wait for the stored version to match the in-memory version. The in-memory version get pushed to the fence but the fence is not persisted, so, while migrations are ongoing, we won't see these values match. In practice this is a problem these days because the migrations take a long time. Epic: none Informs #99894 Release note (bug fix): Fixed a bug which could cause `SHOW CLUSTER SETTING version` to hang and return an opaque error while cluster finalization is ongoing.
Closing this one as it's very likely fixed by #99967. We'll take another look if it fails again. |
roachtest.acceptance/version-upgrade failed with artifacts on master @ bb5c97b5d2073a27e7ec88654b05ac913112f610:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-26117
The text was updated successfully, but these errors were encountered: