Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiregionccl: deflake the multi-region cold start test #103666

Merged

Conversation

jeffswenson
Copy link
Collaborator

@jeffswenson jeffswenson commented May 19, 2023

I ran the test 2500 times with no flakes. The core source of the flakes
is the test depends on every range having an available replica in every
region. Any amount of replica movement causes the test to fail because
many regions only have a single replica and temporary unavailability
causes the sql servers to pick a replica that is available, but
unreachable due to a networking hook injected by the test.

There are a few independent changes in this PR:

  1. It includes the settings from [DNM] multiregionccl: attempts to debug ColdStartLatencyTest #100320. The lease transfer setting
    improves test performance and disabling load based splitting helps
    avoid replica migration.
  2. Based on the insight from multiregionccl: flake in TestColdStartLatency due to timeout during user login #96334 (comment) the test
    checks the span conformance report for all servers instead of only
    one of the servers.
  3. The test no longer triggers scatters while waiting for the span
    config bounds to conform. The scatters increased test flakiness
    and are not required to achieve conformance as we proactively nudge
    replicas through the replicate queue after kvserver: nudge replicate queue on span config update #100349.
  4. WaitForFullReplication is called to ensure replication queues are
    empty and there are no pending range moves.

Fixes #96334

Release Note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jeffswenson jeffswenson force-pushed the jeffswenson-fix-cold-start-latency-test branch 2 times, most recently from 01a99cf to 34a125b Compare May 19, 2023 18:16
@jeffswenson jeffswenson requested review from rafiss and arulajmani May 19, 2023 18:18
@jeffswenson jeffswenson force-pushed the jeffswenson-fix-cold-start-latency-test branch 2 times, most recently from 62c1a72 to 9139b57 Compare May 19, 2023 18:23
@jeffswenson jeffswenson marked this pull request as ready for review May 19, 2023 18:23
@jeffswenson jeffswenson requested a review from a team as a code owner May 19, 2023 18:23
@jeffswenson jeffswenson force-pushed the jeffswenson-fix-cold-start-latency-test branch from 9139b57 to 04ab2fc Compare May 19, 2023 18:24
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: :

Nice job piecing together the all the history here and deflaking this!

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jeffswenson and @rafiss)


-- commits line 25 at r1:
nit: "and are not required to achieve conformance as we proactively nudge replicas through the replicate queue after ... "

@jeffswenson jeffswenson force-pushed the jeffswenson-fix-cold-start-latency-test branch 3 times, most recently from 655c6b8 to 859b6dd Compare May 19, 2023 18:58
I ran the test 2500 times with no flakes. The core source of the flakes
is the test depends on every range having an available replica in every
region. Any amount of replica movement causes the test to fail because
many regions only have a single replica and temporary unavailability
causes the sql servers to pick a replica that is available, but
unreachable due to a networking hook injected by the test.

There are a few independent changes in this PR:

1. It includes the settings from cockroachdb#100320. The lease transfer setting
   improves test performance and disabling load based splitting helps
   avoid replica migration.
2. Based on the insight from cockroachdb#96334#issuecomment-1492610753 the test
   checks the span conformance report for all servers instead of only
   one of the servers.
3. The test no longer triggers scatters while waiting for the span
   config bounds to conform. The scatters increased test flakiness
   and are not required to achieve conformance as we proactively nudge
   replicas through the replicate queue after cockroachdb#100349.
4. WaitForFullReplication is called to ensure replication queues are
   empty and there are no pending range moves.

Fixes cockroachdb#96334

Release Note: None
@jeffswenson jeffswenson force-pushed the jeffswenson-fix-cold-start-latency-test branch from 859b6dd to 2cbd35e Compare May 19, 2023 18:59
@jeffswenson
Copy link
Collaborator Author

bors r+

@jeffswenson
Copy link
Collaborator Author

Thanks for the review!

@craig
Copy link
Contributor

craig bot commented May 19, 2023

Build succeeded:

@craig craig bot merged commit a269433 into cockroachdb:master May 19, 2023
@jeffswenson jeffswenson deleted the jeffswenson-fix-cold-start-latency-test branch May 19, 2023 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

multiregionccl: flake in TestColdStartLatency due to timeout during user login
3 participants