-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
teamcity: failed tests on master: testrace/TestSnapshotAfterTruncation/sameTerm, testrace/TestSnapshotAfterTruncation, testrace/TestSnapshotAfterTruncation/differentTerm #17448
Comments
Well, it's a little stupid, but git bisect actually points all the way back to 9a68241. And sure enough, reverting that one-character change on top of head fixes the flakes (although of course we don't actually want to revert that for real clusters). I think the test and/or the |
To make my previous comment more reproducible, all tests I ran were run on a 24 vCPU gceworker with the command:
|
Hm, I retried that change because I was considering putting it in as a temporary fix to stop the flakiness and found that it only actually fixes things if I also revert e100e6d and c2dc44b (but notably reverting those two without changing My testing the other day about patching just the In short, |
I mean, I guess? Seems to me that we've had cases in the past where someone has asserted that things are likely mere test issues which were later discovered to be not test-only, so I'm a bit uncomfortable with that assertion. Stopping the bleeding seems good, but we should certainly prioritize getting to the bottom of this. Perhaps @irfansharif's newly acquired familiarity with this area can be of use here. |
This isn't a great fix, but it'll stop the flakes we've been seeing (cockroachdb#17448). We should follow up and dig deeper when we're not in a rush to polish off a release.
I agree our experience has shown that test flakiness is often due to a real issue. I just have other real issues that are definitely real that I need to look into first. Also, it seems that c2dc44b was definitely associated with a spike in flakiness, so getting rid of it seems good until we're able to look into it. If @irfansharif has cycles to look into this stuff sooner, I'm fine with not reverting. |
I thought c2dc44b wasn't supposed to change any of the lease renewal configuration, just simplify the calculations. Did we miss something? |
It changed any tests that relied on |
I've been trying to re-enable this test so that I can use similar infrastructure to test a theory for #17524. It seems to be nice and easy to reproduce the flake if cockroach/pkg/storage/client_raft_test.go Lines 715 to 725 in 8ef84ea
I'm a little confused how this ever was supposed to work, so I'm probably misunderstanding something. My theory about the flake is that most of the time we weren't waiting for the next leader election to complete after restarting the other two stores before restarting the third. When we add the sleep, we allow this leader election to finish with much higher probability, so it's much more likely the third store will get stuck performing a liveness update. DISCLAIMER: I'm not very confident about this theory because I still think I have something confused. Also, if we're not going to fix this immediately, I'll at least re-enable |
I think that makes sense. Most of the time the blocked node comes up fast enough to still be considered live. If it isn't though, it won't be able to heartbeat. I'd add |
👍 thanks for the confirmation. The changes you suggested sound good. |
Fixes cockroachdb#17448. This change adds a `restartStoreWithoutHeartbeat` method to `multiTestContext`, which is used in cases like `TestSnapshotAfterTruncation/differentTerm` where a successful heartbeat on store restart is not required.
The following tests appear to have failed:
#315166:
The text was updated successfully, but these errors were encountered: