-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: TestTenantRateLimiter failed #129612
Comments
Related to #127697 (comment) and #125728. Looking at the test, it's almost certainly that we're hitting the 120s context cancellation that's set up here: /pkg/kv/kvserver/client_tenant_test.go#L243-L246 // This test shouldn't take forever. If we're going to fail, better to
// do it in minutes than in an hour.
ctx, cancel := context.WithTimeout(ctx, 2*time.Minute)
defer cancel() This cancellation seems to have been put there because this is a test that tends to deadlock instead of fail. The test wants to see that a certain number of tenant requests is not blocked and that excess requests are blocked. But if background writes sneak in, maybe some of the ones that we think shouldn't get blocked would block, and this would block the test until the context cancellation hits. In fact, the error above tells us that we're sitting on a block that verifies that "the first burst writes don't get blocked" here: /pkg/kv/kvserver/client_tenant_test.go#L254-L258 // Now ensure that in the same instant the write QPS limit does affect the
// tenant. First issue requests that can happen without blocking.
for i := 0; i < burstWrites; i++ {
require.NoError(t, ts.DB().Put(ctx, mkKey(), 0))
} but from the looks of it they are certainly getting blocked, perhaps due to an errant request sneaking in. The test has plenty of evidence (various things disabled, etc) that suggests that this kind of thing has been an issue in the past. I'll send an update to the test that fails it gracefully - i.e. with a clear error and without hanging for two minute first - but it will remain flaky. I'll add some debug logging that prints the source of everything that consumes from the limiter, so next failure should hand it to us on a platter so that we can react. |
kv/kvserver.TestTenantRateLimiter failed on release-24.1.4-rc @ 75d6ac620fb2d223454026e220a6999cedbba41c:
Parameters:
Same failure on other branches
|
129640: kvserver: deflake, fix, speed up TestTenantRateLimiter r=tbg a=tbg This test would occasionally (rarely) hang for 120s and then fail with an opaque AmbiguousResultError in case that some of the requests the test doesn't expect to block end up doing so. I'm fairly sure this test can sometimes deadlock because enough extra writes sneak in to cause the test's requests to block. Seeing how this is a tarpit test (once you start looking your work day is consumed), I addressed the following problems: - it claimed to verify metrics, but the regexp had rotted and so it wasn't verifying anything. - it ran into throttling in the client-side tenant rate limiter, so the test took 10s-15s. Now it runs in ~1s. - it also did >3k requests, so it was quite expensive. Now it does a couple hundred bigger ones, which should be a smidge cheaper. - it was flaky due to trying to make very strong claims about when requests would block, despite dozens of background requests that are hard to avoid sneaking in on every run. - it would be difficult to debug when hanging due to unexpected blocking. Now it's a somewhat weaker but still meaningful test that going forward we are paying a much smaller tax for. Should it block again the test will fail gracefully; it will still be difficult to root cause but it is also now wholly unexpected that this should ever occur. See: #129612 (comment) Fixes #129612. Release note: None Epic: None Co-authored-by: Tobias Grieger <[email protected]>
Based on the specified backports for linked PR #129640, I applied the following new label(s) to this issue: branch-release-23.2, branch-release-24.1, branch-release-24.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This test would occasionally (rarely) hang for 120s and then fail with an opaque AmbiguousResultError in case that some of the requests the test doesn't expect to block end up doing so. I'm fairly sure this test can sometimes deadlock because enough extra writes sneak in to cause the test's requests to block. Seeing how this is a tarpit test (once you start looking your work day is consumed), I addressed the following problems: - it claimed to verify metrics, but the regexp had rotted and so it wasn't verifying anything. - it ran into throttling in the client-side tenant rate limiter, so the test took 10s-15s. Now it runs in ~1s. - it also did >3k requests, so it was quite expensive. Now it does a couple hundred bigger ones, which should be a smidge cheaper. - it was flaky due to trying to make very strong claims about when requests would block, despite dozens of background requests that are hard to avoid sneaking in on every run. - it would be difficult to debug when hanging due to unexpected blocking. Now it's a somewhat weaker but still meaningful test that going forward we are paying a much smaller tax for. Should it block again the test will fail gracefully; it will still be difficult to root cause but it is also now wholly unexpected that this should ever occur. See: #129612 (comment) Fixes #129612. Release note: None Epic: None
This test would occasionally (rarely) hang for 120s and then fail with an opaque AmbiguousResultError in case that some of the requests the test doesn't expect to block end up doing so. I'm fairly sure this test can sometimes deadlock because enough extra writes sneak in to cause the test's requests to block. Seeing how this is a tarpit test (once you start looking your work day is consumed), I addressed the following problems: - it claimed to verify metrics, but the regexp had rotted and so it wasn't verifying anything. - it ran into throttling in the client-side tenant rate limiter, so the test took 10s-15s. Now it runs in ~1s. - it also did >3k requests, so it was quite expensive. Now it does a couple hundred bigger ones, which should be a smidge cheaper. - it was flaky due to trying to make very strong claims about when requests would block, despite dozens of background requests that are hard to avoid sneaking in on every run. - it would be difficult to debug when hanging due to unexpected blocking. Now it's a somewhat weaker but still meaningful test that going forward we are paying a much smaller tax for. Should it block again the test will fail gracefully; it will still be difficult to root cause but it is also now wholly unexpected that this should ever occur. See: #129612 (comment) Fixes #129612. Release note: None Epic: None
This test would occasionally (rarely) hang for 120s and then fail with an opaque AmbiguousResultError in case that some of the requests the test doesn't expect to block end up doing so. I'm fairly sure this test can sometimes deadlock because enough extra writes sneak in to cause the test's requests to block. Seeing how this is a tarpit test (once you start looking your work day is consumed), I addressed the following problems: - it claimed to verify metrics, but the regexp had rotted and so it wasn't verifying anything. - it ran into throttling in the client-side tenant rate limiter, so the test took 10s-15s. Now it runs in ~1s. - it also did >3k requests, so it was quite expensive. Now it does a couple hundred bigger ones, which should be a smidge cheaper. - it was flaky due to trying to make very strong claims about when requests would block, despite dozens of background requests that are hard to avoid sneaking in on every run. - it would be difficult to debug when hanging due to unexpected blocking. Now it's a somewhat weaker but still meaningful test that going forward we are paying a much smaller tax for. Should it block again the test will fail gracefully; it will still be difficult to root cause but it is also now wholly unexpected that this should ever occur. See: #129612 (comment) Fixes #129612. Release note: None Epic: None
kv/kvserver.TestTenantRateLimiter failed on release-24.1.4-rc @ a1cbd782a33a51f7500444373ccdaa81eb2ed971:
Parameters:
attempt=1
run=21
shard=24
Help
See also: How To Investigate a Go Test Failure (internal)
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-41613
The text was updated successfully, but these errors were encountered: