increase ticks per slot for test #3491

bw-solana · 2024-11-06T00:44:28Z

Problem

fn test_wait_for_max_stake has been flaky for a while. See #3295 and #3483 for some context.

One of the new problems discovered is that there is a race condition that can lead to 1 or more nodes not participating in voting. The chain of events is supposed to look like this:

Pull requests sent
Gossip votes observed
Insert repair tree
Request orphan repairs
Replay & freeze blocks
Vote (once staked)
OC blocks & make roots
Generate leader schedule
Keep activating stake

However, votes can get filtered out as part of step 2 such that they are never observed and thus we never repair/replay/vote/etc.

The filtering happens for a couple of reasons:

At first, we reject the votes because we don't see the vote account key in epoch authorized voters. E.g. the voting validator doesn't show up as staked until epoch 3 but we're seeing votes for epoch 0.
Later on, we fail because we don't have epoch stakes for the epoch because root bank never advances (because we're not voting and thus not rooting anything) and we only compute leader schedule 3 epochs ahead

This problem got worse with the changes in #3295 that reduced ticks per slot from 64 (default) to 16. This was an attempt to speed up this long running test, but widens the race condition window.

Summary of Changes

Increase ticks per slot from 16 to 32

AshwinSekar

Thanks for fixing this test!

steviez · 2024-11-06T17:51:35Z

Definitely in favor of the change making the test less flaky. That being said, I'm wondering if we can simplify this test from being a local-cluster test. Looking back at the PR in which this was added (solana-labs#13532), the aim of this test appears to be just confirming the RPC client (which isn't really publicized) is functional.

This would certainly take some effort to build up, but a mock RPC server could seemingly accomplish this as well without spinning up the local cluster

bw-solana · 2024-11-06T18:21:22Z

Definitely in favor of the change making the test less flaky. That being said, I'm wondering if we can simplify this test from being a local-cluster test. Looking back at the PR in which this was added (solana-labs#13532), the aim of this test appears to be just confirming the RPC client (which isn't really publicized) is functional.

This would certainly take some effort to build up, but a mock RPC server could seemingly accomplish this as well without spinning up the local cluster

Do we have a test anywhere that verifies stake activation behavior? This is the only one I'm aware of that implicitly covers that behavior.

If we have coverage elsewhere, I'm definitely in favor of simplifying this.

steviez · 2024-11-06T20:32:01Z

Do we have a test anywhere that verifies stake activation behavior?

Haven't really looked around so not sure

that implicitly covers

😅

increase ticks per slot for test

ad135c2

bw-solana marked this pull request as ready for review November 6, 2024 03:23

bw-solana requested review from steviez and AshwinSekar November 6, 2024 03:23

AshwinSekar approved these changes Nov 6, 2024

View reviewed changes

bw-solana merged commit 22443e7 into anza-xyz:master Nov 6, 2024
40 checks passed

bw-solana deleted the fix_flaky_test branch November 6, 2024 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increase ticks per slot for test #3491

increase ticks per slot for test #3491

bw-solana commented Nov 6, 2024 •

edited

Loading

AshwinSekar left a comment

steviez commented Nov 6, 2024

bw-solana commented Nov 6, 2024

steviez commented Nov 6, 2024

increase ticks per slot for test #3491

increase ticks per slot for test #3491

Conversation

bw-solana commented Nov 6, 2024 • edited Loading

Problem

Summary of Changes

AshwinSekar left a comment

Choose a reason for hiding this comment

steviez commented Nov 6, 2024

bw-solana commented Nov 6, 2024

steviez commented Nov 6, 2024

bw-solana commented Nov 6, 2024 •

edited

Loading