-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server, statusccl: de-flake span stats fan-out tests #106551
server, statusccl: de-flake span stats fan-out tests #106551
Conversation
@@ -141,7 +141,7 @@ var _ TenantTestHelper = &tenantTestHelper{} | |||
|
|||
// NewTestTenantHelper constructs a TenantTestHelper instance. | |||
func NewTestTenantHelper( | |||
t *testing.T, tenantClusterSize int, knobs base.TestingKnobs, | |||
t *testing.T, tenantClusterSize int, numNodes int, knobs base.TestingKnobs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is sus...
(it was already somewhat sus before your change)
This parameter "tenantClusterSize" indicates the number of tenant servers to start connecting to a single KV node.
On the one hand I get that the original goal was simply to obtain a test framework where there were multiple SQL pods running side-by-side.
On the other hand, this is not a realistic deployment scenario -- in practice a cluster should have multiple nodes. Multiple tenant servers on top of a one-node CockroachDB cluster is not very representative.
What should we do about this?
I'm not exactly sure, but the API extension you're proposing here is adding a problem on top of another problem: now with your change, if someone specifies "tenantClusterSize=3" and "numNodes=3" they get 9 SQL servers, and I don't think that's what we want.
If I were you I would reason about this backwards: as a CRL developer, what am I trying to achieve with my test code?
It seems to me (if my reading is right) that we want
- some tests in this package want multiple SQL servers
- some tests need multiple KV nodes
- there's no test that really requires to have just 1 KV node or 1 SQL server
So why not... removing these 2 parameters altogether, and always start a 3-node storage cluster with 1 tenant server per node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@herkolategan can I get your input on this too? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)
pkg/ccl/serverccl/tenant_test_utils.go
line 144 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
@herkolategan can I get your input on this too? Thanks
Something about selecting the first server and then starting tenants against it feels off to me, especially when numNodes > 1
.
Looking at https://github.com/cockroachdb/cockroach/blob/master/pkg/server/testserver.go#L1058 this selects one KV node.
It would be better if tenants had the full list of KV nodes when starting on top of a host test cluster. But I think @knz is right, this is a compounding problem.
Outside the scope of this PR, but in general I think TestCluster
needs an overhaul to have proper virtual cluster support (multiple tenants). It already has methods like WaitForTenantCapabilities
. Instead of having separate utility classes for managing tenants on top of test clusters we should provide a more general solution accessible to other parts of the codebase as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @herkolategan)
pkg/ccl/serverccl/tenant_test_utils.go
line 144 at r1 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
Something about selecting the first server and then starting tenants against it feels off to me, especially when
numNodes > 1
.
Looking at https://github.com/cockroachdb/cockroach/blob/master/pkg/server/testserver.go#L1058 this selects one KV node.
It would be better if tenants had the full list of KV nodes when starting on top of a host test cluster. But I think @knz is right, this is a compounding problem.Outside the scope of this PR, but in general I think
TestCluster
needs an overhaul to have proper virtual cluster support (multiple tenants). It already has methods likeWaitForTenantCapabilities
. Instead of having separate utility classes for managing tenants on top of test clusters we should provide a more general solution accessible to other parts of the codebase as well.
Thanks Herko.
In the interest of helping Zach make progress here, could you advise what is the smallest amount of incremental improvement that Zach can do in this PR, ahead of making the testserver API better?
Then, separately, I would welcome you explaining what you recommend for the API into a new (separate) issue under epic CRDB-18499. Cheers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @knz)
pkg/ccl/serverccl/tenant_test_utils.go
line 144 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Thanks Herko.
In the interest of helping Zach make progress here, could you advise what is the smallest amount of incremental improvement that Zach can do in this PR, ahead of making the testserver API better?
Then, separately, I would welcome you explaining what you recommend for the API into a new (separate) issue under epic CRDB-18499. Cheers
Thanks for catching this.
Since the the primary goal of this PR is to de-flake the mentioned tests, I'll revert the change I made to newTestTenantHelper
.
The fan-out path is already covered by the tests in pkg/server
, so I think the status quo (sans the flakes 😄 ) is acceptable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @knz)
pkg/ccl/serverccl/tenant_test_utils.go
line 144 at r1 (raw file):
Previously, zachlite wrote…
Thanks for catching this.
Since the the primary goal of this PR is to de-flake the mentioned tests, I'll revert the change I made to
newTestTenantHelper
.The fan-out path is already covered by the tests in
pkg/server
, so I think the status quo (sans the flakes 😄 ) is acceptable here.
Given the current state of things I think it's best just to fix the flake as @zachlite mentioned (Thanks!).
@knz I'll create an issue under the CRDB-18499 epic.
This commit fixes flaky behavior while running `TestSpanStatsFanOut` and `TestTenantSpanStats` under stress. Both tests have been updated to ensure the following behavior: - The tests make sure range splits occur before proceeding. - The tests will retry their assertions to give the new key-value pairs time to replicate. Additionally, `NewTestTenantHelper` was updated to accept a parameter for the number of nodes in the test host cluster. `TestTenantSpanStats` now uses a 3-node cluster to test a real fan-out. Resolves cockroachdb#99770 Resolves cockroachdb#99559 Epic:none Release note: None
68753c7
to
f1bc66e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 5 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @zachlite)
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.1-106551 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/106636/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit fixes flaky behavior while running
TestSpanStatsFanOut
andTestTenantSpanStats
under stress.Both tests have been updated to ensure the following behavior:
Additionally,
NewTestTenantHelper
was updated to accept a parameter for the number of nodes in the test host cluster.TestTenantSpanStats
now uses a 3-node cluster to test a real fan-out.Resolves #99770
Resolves #99559
Epic:none
Release note: None