Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestTenantRateLimiter failed #129612

Closed
cockroach-teamcity opened this issue Aug 25, 2024 · 3 comments · Fixed by #129640
Closed

kv/kvserver: TestTenantRateLimiter failed #129612

cockroach-teamcity opened this issue Aug 25, 2024 · 3 comments · Fixed by #129640
Assignees
Labels
A-testing Testing tools and infrastructure branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.1.4-rc branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 25, 2024

kv/kvserver.TestTenantRateLimiter failed on release-24.1.4-rc @ a1cbd782a33a51f7500444373ccdaa81eb2ed971:

        	            	  | github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestTenantRateLimiter
        	            	  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_tenant_test.go:257
        	            	  | testing.tRunner
        	            	  | 	GOROOT/src/testing/testing.go:1689
        	            	  | runtime.goexit
        	            	  | 	src/runtime/asm_amd64.s:1695
        	            	Wraps: (2) aborted in DistSender
        	            	Wraps: (3) result is ambiguous: context deadline exceeded
        	            	Wraps: (4) secondary error attachment
        	            	  | aborted during DistSender.Send: context deadline exceeded
        	            	  | (1)
        	            	  |   | (opaque error wrapper)
        	            	  |   | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
        	            	  |   | reportable 0:
        	            	  |   |
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:3100
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2168
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1719
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1238
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:229
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:1138
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.(*DB).send
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:1121
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.sendAndFill
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:970
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:993
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Put
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:426
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestTenantRateLimiter
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_tenant_test.go:257
        	            	  |   | testing.tRunner
        	            	  |   | 	GOROOT/src/testing/testing.go:1689
        	            	  |   | runtime.goexit
        	            	  |   | 	src/runtime/asm_amd64.s:1695
        	            	  | Wraps: (2) aborted during DistSender.Send
        	            	  | Wraps: (3) context deadline exceeded
        	            	  | Error types: (1) *errbase.opaqueWrapper (2) *errutil.withPrefix (3) context.deadlineExceededError
        	            	Wraps: (5) context deadline exceeded
        	            	Error types: (1) *errbase.opaqueWrapper (2) *errutil.withPrefix (3) *kvpb.AmbiguousResultError (4) *secondary.withSecondaryError (5) context.deadlineExceededError
        	Test:       	TestTenantRateLimiter
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestTenantRateLimiter434590919
--- FAIL: TestTenantRateLimiter (120.70s)

Parameters:

  • attempt=1
  • run=21
  • shard=24
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-41613

@cockroach-teamcity cockroach-teamcity added branch-release-24.1.4-rc C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Aug 25, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Aug 25, 2024
@tbg
Copy link
Member

tbg commented Aug 26, 2024

Related to #127697 (comment) and #125728.
Remarkably, in all three failures the test takes almost exactly the same amount of time: 120.x seconds. That seems unlikely to be just overload.

Looking at the test, it's almost certainly that we're hitting the 120s context cancellation that's set up here:

/pkg/kv/kvserver/client_tenant_test.go#L243-L246

	// This test shouldn't take forever. If we're going to fail, better to
	// do it in minutes than in an hour.
	ctx, cancel := context.WithTimeout(ctx, 2*time.Minute)
	defer cancel()

This cancellation seems to have been put there because this is a test that tends to deadlock instead of fail. The test wants to see that a certain number of tenant requests is not blocked and that excess requests are blocked. But if background writes sneak in, maybe some of the ones that we think shouldn't get blocked would block, and this would block the test until the context cancellation hits.

In fact, the error above tells us that we're sitting on a block that verifies that "the first burst writes don't get blocked" here:

/pkg/kv/kvserver/client_tenant_test.go#L254-L258

	// Now ensure that in the same instant the write QPS limit does affect the
	// tenant. First issue requests that can happen without blocking.
	for i := 0; i < burstWrites; i++ {
		require.NoError(t, ts.DB().Put(ctx, mkKey(), 0))
	}

but from the looks of it they are certainly getting blocked, perhaps due to an errant request sneaking in. The test has plenty of evidence (various things disabled, etc) that suggests that this kind of thing has been an issue in the past.

I'll send an update to the test that fails it gracefully - i.e. with a clear error and without hanging for two minute first - but it will remain flaky. I'll add some debug logging that prints the source of everything that consumes from the limiter, so next failure should hand it to us on a platter so that we can react.

@github-project-automation github-project-automation bot moved this to roachtest/unit test backlog in KV Aug 28, 2024
@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 29, 2024
@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestTenantRateLimiter failed on release-24.1.4-rc @ 75d6ac620fb2d223454026e220a6999cedbba41c:

        	            	  | github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestTenantRateLimiter
        	            	  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_tenant_test.go:257
        	            	  | testing.tRunner
        	            	  | 	GOROOT/src/testing/testing.go:1689
        	            	  | runtime.goexit
        	            	  | 	src/runtime/asm_amd64.s:1695
        	            	Wraps: (2) aborted in DistSender
        	            	Wraps: (3) result is ambiguous: context deadline exceeded
        	            	Wraps: (4) secondary error attachment
        	            	  | aborted during DistSender.Send: context deadline exceeded
        	            	  | (1)
        	            	  |   | (opaque error wrapper)
        	            	  |   | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
        	            	  |   | reportable 0:
        	            	  |   |
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:3100
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2168
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1719
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1238
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:229
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:1138
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.(*DB).send
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:1121
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.sendAndFill
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:970
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:993
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Put
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:426
        	            	  |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestTenantRateLimiter
        	            	  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_tenant_test.go:257
        	            	  |   | testing.tRunner
        	            	  |   | 	GOROOT/src/testing/testing.go:1689
        	            	  |   | runtime.goexit
        	            	  |   | 	src/runtime/asm_amd64.s:1695
        	            	  | Wraps: (2) aborted during DistSender.Send
        	            	  | Wraps: (3) context deadline exceeded
        	            	  | Error types: (1) *errbase.opaqueWrapper (2) *errutil.withPrefix (3) context.deadlineExceededError
        	            	Wraps: (5) context deadline exceeded
        	            	Error types: (1) *errbase.opaqueWrapper (2) *errutil.withPrefix (3) *kvpb.AmbiguousResultError (4) *secondary.withSecondaryError (5) context.deadlineExceededError
        	Test:       	TestTenantRateLimiter
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestTenantRateLimiter389259603
--- FAIL: TestTenantRateLimiter (120.65s)

Parameters:

  • attempt=1
  • run=13
  • shard=24
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Sep 6, 2024
129640: kvserver: deflake, fix, speed up TestTenantRateLimiter r=tbg a=tbg

This test would occasionally (rarely) hang for 120s and then fail with
an opaque AmbiguousResultError in case that some of the requests the
test doesn't expect to block end up doing so.

I'm fairly sure this test can sometimes deadlock because enough extra
writes sneak in to cause the test's requests to block.

Seeing how this is a tarpit test (once you start looking your work day
is consumed), I addressed the following problems:

- it claimed to verify metrics, but the regexp had rotted
  and so it wasn't verifying anything.
- it ran into throttling in the client-side tenant rate limiter,
  so the test took 10s-15s. Now it runs in ~1s.
- it also did >3k requests, so it was quite expensive. Now
  it does a couple hundred bigger ones, which should be a
  smidge cheaper.
- it was flaky due to trying to make very strong claims about
  when requests would block, despite dozens of background requests
  that are hard to avoid sneaking in on every run.
- it would be difficult to debug when hanging due to unexpected
  blocking.

Now it's a somewhat weaker but still meaningful test that going forward
we are paying a much smaller tax for. Should it block again the test
will fail gracefully; it will still be difficult to root cause but
it is also now wholly unexpected that this should ever occur.

See: #129612 (comment)

Fixes #129612.

Release note: None
Epic: None

Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in ba875d5 Sep 6, 2024
Copy link

blathers-crl bot commented Sep 13, 2024

Based on the specified backports for linked PR #129640, I applied the following new label(s) to this issue: branch-release-23.2, branch-release-24.1, branch-release-24.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 labels Sep 13, 2024
blathers-crl bot pushed a commit that referenced this issue Sep 13, 2024
This test would occasionally (rarely) hang for 120s and then fail with
an opaque AmbiguousResultError in case that some of the requests the
test doesn't expect to block end up doing so.

I'm fairly sure this test can sometimes deadlock because enough extra
writes sneak in to cause the test's requests to block.

Seeing how this is a tarpit test (once you start looking your work day
is consumed), I addressed the following problems:

- it claimed to verify metrics, but the regexp had rotted
  and so it wasn't verifying anything.
- it ran into throttling in the client-side tenant rate limiter,
  so the test took 10s-15s. Now it runs in ~1s.
- it also did >3k requests, so it was quite expensive. Now
  it does a couple hundred bigger ones, which should be a
  smidge cheaper.
- it was flaky due to trying to make very strong claims about
  when requests would block, despite dozens of background requests
  that are hard to avoid sneaking in on every run.
- it would be difficult to debug when hanging due to unexpected
  blocking.

Now it's a somewhat weaker but still meaningful test that going forward
we are paying a much smaller tax for. Should it block again the test
will fail gracefully; it will still be difficult to root cause but
it is also now wholly unexpected that this should ever occur.

See: #129612 (comment)

Fixes #129612.

Release note: None
Epic: None
blathers-crl bot pushed a commit that referenced this issue Sep 13, 2024
This test would occasionally (rarely) hang for 120s and then fail with
an opaque AmbiguousResultError in case that some of the requests the
test doesn't expect to block end up doing so.

I'm fairly sure this test can sometimes deadlock because enough extra
writes sneak in to cause the test's requests to block.

Seeing how this is a tarpit test (once you start looking your work day
is consumed), I addressed the following problems:

- it claimed to verify metrics, but the regexp had rotted
  and so it wasn't verifying anything.
- it ran into throttling in the client-side tenant rate limiter,
  so the test took 10s-15s. Now it runs in ~1s.
- it also did >3k requests, so it was quite expensive. Now
  it does a couple hundred bigger ones, which should be a
  smidge cheaper.
- it was flaky due to trying to make very strong claims about
  when requests would block, despite dozens of background requests
  that are hard to avoid sneaking in on every run.
- it would be difficult to debug when hanging due to unexpected
  blocking.

Now it's a somewhat weaker but still meaningful test that going forward
we are paying a much smaller tax for. Should it block again the test
will fail gracefully; it will still be difficult to root cause but
it is also now wholly unexpected that this should ever occur.

See: #129612 (comment)

Fixes #129612.

Release note: None
Epic: None
blathers-crl bot pushed a commit that referenced this issue Sep 13, 2024
This test would occasionally (rarely) hang for 120s and then fail with
an opaque AmbiguousResultError in case that some of the requests the
test doesn't expect to block end up doing so.

I'm fairly sure this test can sometimes deadlock because enough extra
writes sneak in to cause the test's requests to block.

Seeing how this is a tarpit test (once you start looking your work day
is consumed), I addressed the following problems:

- it claimed to verify metrics, but the regexp had rotted
  and so it wasn't verifying anything.
- it ran into throttling in the client-side tenant rate limiter,
  so the test took 10s-15s. Now it runs in ~1s.
- it also did >3k requests, so it was quite expensive. Now
  it does a couple hundred bigger ones, which should be a
  smidge cheaper.
- it was flaky due to trying to make very strong claims about
  when requests would block, despite dozens of background requests
  that are hard to avoid sneaking in on every run.
- it would be difficult to debug when hanging due to unexpected
  blocking.

Now it's a somewhat weaker but still meaningful test that going forward
we are paying a much smaller tax for. Should it block again the test
will fail gracefully; it will still be difficult to root cause but
it is also now wholly unexpected that this should ever occur.

See: #129612 (comment)

Fixes #129612.

Release note: None
Epic: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.1.4-rc branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
No open projects
Status: roachtest/unit test backlog
Development

Successfully merging a pull request may close this issue.

2 participants