-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: TestPromoteNonVoterInAddVoter failed #101519
Comments
kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 4fb894e0398f21a3f7d0ea9f289718ed3ac65fdd:
|
Tried reproducing:
No success so far. |
kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 4bbe2faae22d21a650fbdbc280e0f3b2ad81aaf6:
|
kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ adad57859a5281e2846082761c8232890b3d2a68:
|
Tried again and no success
|
kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 0d890b56341ca40b5ffb37c51d6b63bc322e80f4:
|
My bad - the failure wasn't reproducing because of... cockroach/pkg/kv/kvserver/replicate_queue_test.go Line 2216 in 7a8be52
Re-stressing now - I'm going to try and get to the bottom of this one. |
Again stressed for 60 minutes (sequentially, not actually stressing) and no failure. |
Got a failure
Relevant logs from the replicate queue
The behavior the test expects: promote two non-voters, The rebalancing activity moves a replica from Details
Which believes the range is overreplicated, despite the RF being set to 5 and there being 5 voters. This is then amended: Details
Which is where the add voter So the question is why did |
We are running into a stale range config here, where the current config was set a while ago Set here Details
Old config used to remove voters incorrectly on Details
|
Sounds like a mux rangefeed issue. Is |
We are All metamorphic constants
I'm leaning towards skipping this test and all other allocator related tests which use a test cluster under metamorphic constants like #101882. I was stressing on 506a55a which has the fixes in #102094 - so this problem doesn't appear solved (if it is the mux rangefeed). |
CC @miretskiy, this appears to reproduce rather easily. |
Btw, the test could also consider just overriding the setting in question (always disabling mux), instead of skipping all metamorphic tests entirely. With a TODO to remove the override once mux rangefeeds are fixed. |
Yeah that sounds like a better solution - these failures have created a lot of noise / wasted time spent debugging, but if we're shipping these on by default in 23.2 we need to re-enable eventually. If these ranges were larger, you could have under replication for a dangerous amount of time. |
Yeah, sorry about that. @miretskiy Do we need to disable this metamorphism again until we've fixed the problems? These constant flakes cause a fair bit of additional work for test triagers. |
It would be very sad to disable mux on master -- we would have no signal or tests to know if the problems Regardless, this failure happens very infrequently it seems; 60+ minutes of stressing. |
SGTM - I'm stressing atm with mux rangefeed disabled. I'm slightly suspicious that it is mux rangefeed since these failures began a month after earlier flakes we traced to mux rangefeed in #99207 and this test has failed at least twice a week since the initial failure. |
It looks like mux rangefeeds were wrongly being blamed afaict. I've tested with setting Side note, stressing without metamorphic is 6x faster.
The issue that comes up is that the existing leaseholder recieves the span config update at I'm surprised we haven't seen this more in tests which involve span configuration changes. I'm also not sure how we easily get around this without the current LH and new LH agreeing on a min span config sequence number before transferring. I'm sorry we jumped at mux as a RC here @miretskiy! I've opened #103083 to deflake the test. |
Filed #103086 (comment). Fixing the problem seems like it comes with its own issues, since it requires blocking a new leaseholder (or selecting one to begin with) from performing replication changes until it has at least the old leaseholders config version or greater. Perhaps not an material issue if we still don't allow removing replicas during the period and constrain up replication to only handle dead nodes? In any case, not a high priority problem to fix since we expect it to eventually resolve itself. |
It is possible for span config updates to arrive at different times between stores. `TestPromoteNonVoterInAddVoter` was flaking when the incoming leaseholder would act upon a stale span config, before receiving the updated one which the outgoing leaseholder used. This resulted in the test failing as more than just the two expected promotion events appeared in the range log, as the incoming leaseholder removed voters, then subsequently added them back upon receiving the up to date span configuration. This commit checks on the prefix of the range log add voter events, to avoid failing the test when this untimely behavior occurs. Fixes: cockroachdb#101519 Release note: None
103077: docgen: add EXPLAIN, EXPLAIN ANALYZE options to diagrams r=michae2 a=taroface Add various `EXPLAIN` and `EXPLAIN ANALYZE` options that were documented but not exposed in the SQL diagrams. Also add `REDACT`, which will be documented with cockroachdb/docs#16929. Epic: none Release note: none Release justification: non-production code change 103083: kvserver: deflake TestPromoteNonVoterInAddVoter r=andrewbaptist a=kvoli It is possible for span config updates to arrive at different times between stores. `TestPromoteNonVoterInAddVoter` was flaking when the incoming leaseholder would act upon a stale span config, before receiving the updated one which the outgoing leaseholder used. This resulted in the test failing as more than just the two expected promotion events appeared in the range log, as the incoming leaseholder removed voters, then subsequently added them back upon receiving the up to date span configuration. #103086 tracks this issue. This PR checks on the prefix of the range log add voter events, to avoid failing the test when this untimely behavior occurs. Stressed overnight, removing the skip under stress flag: ``` dev test pkg/kv/kvserver -f TestPromoteNonVoterInAddVoter -v --stress --stress-args="-p 4" ... 27158 runs so far, 0 failures, over 12h56m55s ``` This PR also adds additional (v=6) logging of the range descriptor and span config, as these come in handy when debugging failures such as this. Fixes: #101519 Release note: None 103332: cloud/kubernetes: bump version to v23.1.0 r=ZhouXing19 a=ZhouXing19 Epic: None Release note: None 103335: util/version: update roachtest version for 23.2 to 23.1.0 r=ZhouXing19 a=ZhouXing19 Epic: None Release note: None 103338: ci: additionally build `workload` and `dev` for all Unix systems r=rail,srosenberg a=rickystewart This includes ARM machines and macOS systems. Epic: none Release note: None Co-authored-by: Ryan Kuo <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Jane Xing <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
It is possible for span config updates to arrive at different times between stores. `TestPromoteNonVoterInAddVoter` was flaking when the incoming leaseholder would act upon a stale span config, before receiving the updated one which the outgoing leaseholder used. This resulted in the test failing as more than just the two expected promotion events appeared in the range log, as the incoming leaseholder removed voters, then subsequently added them back upon receiving the up to date span configuration. This commit checks on the prefix of the range log add voter events, to avoid failing the test when this untimely behavior occurs. Fixes: cockroachdb#101519 Release note: None
kv/kvserver.TestPromoteNonVoterInAddVoter failed with artifacts on master @ 8e6cdbd8b3d12ba5d65db101a3abfeb2cb1bf470:
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-26986
The text was updated successfully, but these errors were encountered: