Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase ticks per slot for test #3491

Merged
merged 1 commit into from
Nov 6, 2024

Conversation

bw-solana
Copy link

@bw-solana bw-solana commented Nov 6, 2024

Problem

fn test_wait_for_max_stake has been flaky for a while. See #3295 and #3483 for some context.

One of the new problems discovered is that there is a race condition that can lead to 1 or more nodes not participating in voting. The chain of events is supposed to look like this:

  1. Pull requests sent
  2. Gossip votes observed
  3. Insert repair tree
  4. Request orphan repairs
  5. Replay & freeze blocks
  6. Vote (once staked)
  7. OC blocks & make roots
  8. Generate leader schedule
  9. Keep activating stake

However, votes can get filtered out as part of step 2 such that they are never observed and thus we never repair/replay/vote/etc.

The filtering happens for a couple of reasons:

  1. At first, we reject the votes because we don't see the vote account key in epoch authorized voters. E.g. the voting validator doesn't show up as staked until epoch 3 but we're seeing votes for epoch 0.
  2. Later on, we fail because we don't have epoch stakes for the epoch because root bank never advances (because we're not voting and thus not rooting anything) and we only compute leader schedule 3 epochs ahead

This problem got worse with the changes in #3295 that reduced ticks per slot from 64 (default) to 16. This was an attempt to speed up this long running test, but widens the race condition window.

Summary of Changes

Increase ticks per slot from 16 to 32

@bw-solana bw-solana marked this pull request as ready for review November 6, 2024 03:23
Copy link

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this test!

@bw-solana bw-solana merged commit 22443e7 into anza-xyz:master Nov 6, 2024
40 checks passed
@bw-solana bw-solana deleted the fix_flaky_test branch November 6, 2024 17:06
@steviez
Copy link

steviez commented Nov 6, 2024

Definitely in favor of the change making the test less flaky. That being said, I'm wondering if we can simplify this test from being a local-cluster test. Looking back at the PR in which this was added (solana-labs#13532), the aim of this test appears to be just confirming the RPC client (which isn't really publicized) is functional.

This would certainly take some effort to build up, but a mock RPC server could seemingly accomplish this as well without spinning up the local cluster

@bw-solana
Copy link
Author

Definitely in favor of the change making the test less flaky. That being said, I'm wondering if we can simplify this test from being a local-cluster test. Looking back at the PR in which this was added (solana-labs#13532), the aim of this test appears to be just confirming the RPC client (which isn't really publicized) is functional.

This would certainly take some effort to build up, but a mock RPC server could seemingly accomplish this as well without spinning up the local cluster

Do we have a test anywhere that verifies stake activation behavior? This is the only one I'm aware of that implicitly covers that behavior.

If we have coverage elsewhere, I'm definitely in favor of simplifying this.

@steviez
Copy link

steviez commented Nov 6, 2024

Do we have a test anywhere that verifies stake activation behavior?

Haven't really looked around so not sure

that implicitly covers

😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants