Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server, statusccl: de-flake span stats fan-out tests #106551

Merged
merged 1 commit into from
Jul 11, 2023

Conversation

zachlite
Copy link
Contributor

This commit fixes flaky behavior while running
TestSpanStatsFanOut and TestTenantSpanStats under stress.

Both tests have been updated to ensure the following behavior:

  • The tests make sure range splits occur before proceeding.
  • The tests will retry their assertions to give the new key-value pairs time to replicate.

Additionally, NewTestTenantHelper was updated to accept a parameter for the number of nodes in the test host cluster. TestTenantSpanStats now uses a 3-node cluster to test a real fan-out.

Resolves #99770
Resolves #99559
Epic:none
Release note: None

@zachlite zachlite requested a review from a team July 10, 2023 21:09
@zachlite zachlite requested review from a team as code owners July 10, 2023 21:09
@zachlite zachlite requested review from a team and dhartunian and removed request for a team July 10, 2023 21:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -141,7 +141,7 @@ var _ TenantTestHelper = &tenantTestHelper{}

// NewTestTenantHelper constructs a TenantTestHelper instance.
func NewTestTenantHelper(
t *testing.T, tenantClusterSize int, knobs base.TestingKnobs,
t *testing.T, tenantClusterSize int, numNodes int, knobs base.TestingKnobs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is sus...
(it was already somewhat sus before your change)

This parameter "tenantClusterSize" indicates the number of tenant servers to start connecting to a single KV node.

On the one hand I get that the original goal was simply to obtain a test framework where there were multiple SQL pods running side-by-side.
On the other hand, this is not a realistic deployment scenario -- in practice a cluster should have multiple nodes. Multiple tenant servers on top of a one-node CockroachDB cluster is not very representative.

What should we do about this?

I'm not exactly sure, but the API extension you're proposing here is adding a problem on top of another problem: now with your change, if someone specifies "tenantClusterSize=3" and "numNodes=3" they get 9 SQL servers, and I don't think that's what we want.

If I were you I would reason about this backwards: as a CRL developer, what am I trying to achieve with my test code?

It seems to me (if my reading is right) that we want

  1. some tests in this package want multiple SQL servers
  2. some tests need multiple KV nodes
  3. there's no test that really requires to have just 1 KV node or 1 SQL server

So why not... removing these 2 parameters altogether, and always start a 3-node storage cluster with 1 tenant server per node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@herkolategan can I get your input on this too? Thanks

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/ccl/serverccl/tenant_test_utils.go line 144 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

@herkolategan can I get your input on this too? Thanks

Something about selecting the first server and then starting tenants against it feels off to me, especially when numNodes > 1.
Looking at https://github.com/cockroachdb/cockroach/blob/master/pkg/server/testserver.go#L1058 this selects one KV node.
It would be better if tenants had the full list of KV nodes when starting on top of a host test cluster. But I think @knz is right, this is a compounding problem.

Outside the scope of this PR, but in general I think TestCluster needs an overhaul to have proper virtual cluster support (multiple tenants). It already has methods like WaitForTenantCapabilities. Instead of having separate utility classes for managing tenants on top of test clusters we should provide a more general solution accessible to other parts of the codebase as well.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @herkolategan)


pkg/ccl/serverccl/tenant_test_utils.go line 144 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Something about selecting the first server and then starting tenants against it feels off to me, especially when numNodes > 1.
Looking at https://github.com/cockroachdb/cockroach/blob/master/pkg/server/testserver.go#L1058 this selects one KV node.
It would be better if tenants had the full list of KV nodes when starting on top of a host test cluster. But I think @knz is right, this is a compounding problem.

Outside the scope of this PR, but in general I think TestCluster needs an overhaul to have proper virtual cluster support (multiple tenants). It already has methods like WaitForTenantCapabilities. Instead of having separate utility classes for managing tenants on top of test clusters we should provide a more general solution accessible to other parts of the codebase as well.

Thanks Herko.

In the interest of helping Zach make progress here, could you advise what is the smallest amount of incremental improvement that Zach can do in this PR, ahead of making the testserver API better?

Then, separately, I would welcome you explaining what you recommend for the API into a new (separate) issue under epic CRDB-18499. Cheers

@dhartunian dhartunian removed their request for review July 11, 2023 14:56
Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @knz)


pkg/ccl/serverccl/tenant_test_utils.go line 144 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Thanks Herko.

In the interest of helping Zach make progress here, could you advise what is the smallest amount of incremental improvement that Zach can do in this PR, ahead of making the testserver API better?

Then, separately, I would welcome you explaining what you recommend for the API into a new (separate) issue under epic CRDB-18499. Cheers

Thanks for catching this.

Since the the primary goal of this PR is to de-flake the mentioned tests, I'll revert the change I made to newTestTenantHelper.

The fan-out path is already covered by the tests in pkg/server, so I think the status quo (sans the flakes 😄 ) is acceptable here.

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @knz)


pkg/ccl/serverccl/tenant_test_utils.go line 144 at r1 (raw file):

Previously, zachlite wrote…

Thanks for catching this.

Since the the primary goal of this PR is to de-flake the mentioned tests, I'll revert the change I made to newTestTenantHelper.

The fan-out path is already covered by the tests in pkg/server, so I think the status quo (sans the flakes 😄 ) is acceptable here.

Given the current state of things I think it's best just to fix the flake as @zachlite mentioned (Thanks!).
@knz I'll create an issue under the CRDB-18499 epic.

This commit fixes flaky behavior while running
`TestSpanStatsFanOut` and `TestTenantSpanStats` under stress.

Both tests have been updated to ensure the following behavior:
- The tests make sure range splits occur before proceeding.
- The tests will retry their assertions to give the new key-value pairs
time to replicate.

Additionally, `NewTestTenantHelper` was updated to accept a parameter
for the number of nodes in the test host cluster. `TestTenantSpanStats`
now uses a 3-node cluster to test a real fan-out.

Resolves cockroachdb#99770
Resolves cockroachdb#99559
Epic:none
Release note: None
@zachlite zachlite force-pushed the 230710.span-stats-fan-out-flake branch from 68753c7 to f1bc66e Compare July 11, 2023 16:52
@zachlite zachlite added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Jul 11, 2023
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @zachlite)

@zachlite
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 11, 2023

Build succeeded:

@craig craig bot merged commit 0b628fb into cockroachdb:master Jul 11, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 11, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1-106551 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/106636/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
4 participants