fix flaky test #3295

bw-solana · 2024-10-24T00:17:12Z

Problem

test_wait_for_max_stake is notoriously flaky. Even worse, when it fails, it takes an hour to timeout.

The following is observed when the test fails:

Stake of the highest staked validator gets stuck, usually around 71%
Stake is stuck because stake from the other 3 validators is not activating
Stake is not activating because epochs are not advancing
Epochs are not advancing because nobody thinks they are leader
Nobody thinks they are leader because the new leader schedule is not generated
New leader schedule is not generated because we have not rooted a new slot in the current epoch
We haven't rooted a new slot in the new epoch because we have skipped at least one slot and have zero margin
We have zero margin because tower height is 32, slots per epoch is 32 (so we can advance epochs quickly), and we only compute the leader slot 1 epoch worth of slots ahead of time
It's not 100% clear why we skip slots, but it may be partially due to running 4 validators on one set of HW. We see replay thread not executing in time and PoH ticking past the leader slot.

Summary of Changes

Compute the leader schedule 3 epochs worth of slots ahead of time. This will provide margin where we can skip a slot here and there and still not get stuck.
Reduce ticks per slot to 16. This isn't necessary to fix the issue (in fact, it potentially makes us skip more slots), but it will reduce the test time by a factor of 4.
Add a 5 minute timout to the test. This will prevent hanging CI for an hour if we still fail.

cli/src/cluster_query.rs

rpc-client/src/nonblocking/rpc_client.rs

steviez

Your summary just about covered it. As for why there are skipped slots, yeah, there is a bit of turmoil in the beginning. Inspecting individual logs (like new banks) show that there is some time discrepancy when new banks are getting made (and thus forking).

Not sure I care to bisect, but this test was added nearly four years ago in solana-labs#13532. I see mention of this test in May 2022 in Discord along with mention of cooldown / warmup, which is the only thing I can think of that would blow the test time up to 4 minutes ... I doubt this took 4 min when it was first implemented

local-cluster/tests/local_cluster.rs

rpc-client/src/nonblocking/rpc_client.rs

ilya-bobyr · 2024-10-24T04:16:41Z

rpc-client/src/nonblocking/rpc_client.rs

@@ -2197,12 +2199,20 @@ impl RpcClient {
            current_percent = 100f32 * max as f32 / total_active_stake as f32;
            if current_percent < max_stake_percent {
                break;


style

If you are going to be modifying this, I suggest writing it completely in the imperative style or completely in the functional style. A mix of two styles is negatively affecting my ability to read the code, as it does not work fully with either of the existing intuitions that I have.

Imperative style:

use itertools::chain; /* ... */ let mut max = 0; let mut total = 0; for account in chain(&vote_accounts.current, &vote_accounts.delinquent) { let activated_stake = account.activated_stake; max = std::cmp::max(max, activated_stake); total += activated_stake; }

Functional style:

use itertools::chain; /* ... */ let (max, total) = chain(&vote_accounts.current, &vote_accounts.delinquent) .map(|account| account.activated_stake) .fold((0, 0), |(max, total), stake| { (std::cmp::max(max, stake), total + stake) });

Is it okay if we do this in a follow-up? Hoping to limit to functional change for this commit

Absolutely.
This was a suggestion in case you were going to change this code.

steviez

I'm being a bit pickier than usual given the pub-ness of rpc-client crate.

Also, I think ilya-bobyr has some valid comments about the impl / test, but I agree with Brennan and am inclined to keep this PR limited to fixing the test and then a separate PR to clean it up

local-cluster/tests/local_cluster.rs

steviez · 2024-10-24T15:46:49Z

rpc-client/src/rpc_client.rs

+        &self,
+        commitment: CommitmentConfig,
+        max_stake_percent: f32,
+        timeout: Option<Duration>,


If someone is specifically using the _with_timeout() variant when there is a no-timeout variant as well, I would think they'd have a timeout in mind and the parameter should be Duration, not Option<Duration>.

As an example, another function from this file:

agave/rpc-client/src/rpc_client.rs

Lines 297 to 302 in ec781d3

pub fn new_with_timeout<U: ToString>(url: U, timeout: Duration) -> Self {

Self::new_sender(

HttpSender::new_with_timeout(url, timeout),

RpcClientConfig::with_commitment(CommitmentConfig::default()),

)

}

steviez · 2024-10-24T15:49:30Z

rpc-client/src/nonblocking/rpc_client.rs

@@ -2179,8 +2179,19 @@ impl RpcClient {
        &self,
        commitment: CommitmentConfig,
        max_stake_percent: f32,
+    ) -> ClientResult<()> {
+        self.wait_for_max_stake_below_threshold_with_timeout(commitment, max_stake_percent, None)


I'm definitely pro-code reuse tho; if we want to keep the timeout around, maybe a non-pub helper that takes an Option<Duration>, and then:

wait_for_max_stake_below_threshold calls helper with None

wait_for_max_stake_below_threshold_with_timeout calls helper with Some(timeout)

AshwinSekar

test changes look good thanks for tackling this

AshwinSekar · 2024-10-24T17:47:12Z

local-cluster/tests/local_cluster.rs

-        .is_ok());
+    // This is based on the percentage of stake that is allowed to be activated
+    // each epoch.
+    let num_expected_epochs = 14;


If you wanna be fancy can derive this from the constant in sdk:

let num_expected_epochs = 3f64.log(1. + NEW_WARMUP_COOLDOWN_RATE).ceil() as u32 + 1;

The +1 on the end is a buffer in case stake doesn't start activating immediately.

~~The 3 here is essentially the 3 non-bootstrap validators, right?~~

~~Does this compute the time for the stake to completely activate? Because I think the test just waits for the bootstrap node stake to fall below 1/3, which would be significantly sooner~~

No, actually I think the 3 is right. It just represents increasing the total amount of stake 3x, right?

Also, I'm not sure if we need the +1 because this isn't trying to find the epoch in which we will hit the threshold. It's just computing number of epochs it will take once we start activating the stake

ya the 3 is from the 1/3. Original equation is

Where:

S is DEFAULT_NODE_STAKE the bootstrap node's stake

R is 9% aka NEW_WARMUP_COOLDOWN_RATE

X is num_expected_epochs

Essentially the denominator represents the total stake of the system at each epoch, it increases by 9%

steviez

LGTM, I didn't read up on the calculating num epochs tho so maybe some confirmation from Ashwin that it looks reasonable would be good

ilya-bobyr

Sorry for a slow response.
I was double-checking the formula, and it took me some time.

ilya-bobyr · 2024-10-24T22:43:10Z

rpc-client/src/nonblocking/rpc_client.rs

@@ -2197,12 +2199,20 @@ impl RpcClient {
            current_percent = 100f32 * max as f32 / total_active_stake as f32;
            if current_percent < max_stake_percent {
                break;


Absolutely.
This was a suggestion in case you were going to change this code.

ilya-bobyr · 2024-10-25T01:15:34Z

local-cluster/tests/local_cluster.rs

+    let num_expected_epochs = (num_validators_activating_stake as f64)
+        .log(1. + NEW_WARMUP_COOLDOWN_RATE)
+        .ceil() as u32
+        + 1;


I misunderstood the stake activation process.

But I still think it is worth explaining the formula.
Even if it is as simple as

3f64.log(1. + NEW_WARMUP_COOLDOWN_RATE).ceil() as u32 + 1;

Also, I'm not sure that we should not include the validator count.

The original formula was indeed correct.

Here is a suggested explanation that helps others follow along, I think:

// `NEW_WARMUP_COOLDOWN_RATE` is the percentage of stake that can be added relative to the // currently active staked. Currently set to `0.09`. // // Initially the genesis node has `DEFAULT_NODE_STAKE` of the stake, and we want // `num_validators_activating_stake` new validators to activate the same amount of stake as the // genesis node. `DEFAULT_NODE_STAKE` is the individual node stake. It total we want to // activate `DEFAULT_NODE_STAKE * num_validators_activating_stake` new stake. // // Starting with `DEFAULT_NODE_STAKE` and activating `NEW_WARMUP_COOLDOWN_RATE` per epoch, we // need to following to happen: // // DEFAULT_NODE_STAKE * (1 + NEW_WARMUP_COOLDOWN_RATE) ^ num_expected_epochs >= // DEFAULT_NODE_STAKE * num_validators_activating_stake // // Simplifying: // // (1 + NEW_WARMUP_COOLDOWN_RATE) ^ num_expected_epochs >= num_validators_activating_stake // // num_expected_epochs * log (1 + NEW_WARMUP_COOLDOWN_RATE) >= // log (num_validators_activating_stake) // // num_expected_epochs >= // log (num_validators_activating_stake) / log (1 + NEW_WARMUP_COOLDOWN_RATE) // // We can only wait for an integer number of epochs, so we round up. And we add 1 more epoch in // case the stake is not activated in the first epoch. let num_expected_epochs = (num_validators_activating_stake as f64) .log(1. + NEW_WARMUP_COOLDOWN_RATE) .ceil() as u32 + 1;

ilya-bobyr · 2024-10-25T01:15:34Z

local-cluster/tests/local_cluster.rs

+    let num_expected_epochs = (num_validators_activating_stake as f64)
+        .log(1. + NEW_WARMUP_COOLDOWN_RATE)
+        .ceil() as u32
+        + 1;


I misunderstood the stake activation process.

But I still think it is worth explaining the formula.
Even if it is as simple as

3f64.log(1. + NEW_WARMUP_COOLDOWN_RATE).ceil() as u32 + 1;

Also, I'm not sure that we should not include the validator count.

steviez · 2024-10-31T02:39:19Z

Bit of a bummer, saw this one fail again today:
https://buildkite.com/anza/agave/builds/13057#0192dfde-60c8-44f1-91d9-c0cb38b08374

bw-solana · 2024-10-31T17:10:14Z

Bit of a bummer, saw this one fail again today: https://buildkite.com/anza/agave/builds/13057#0192dfde-60c8-44f1-91d9-c0cb38b08374

I've been able to repro some failures on my dev machine. Seems to fail maybe 20% of the time for me.

It looks like the following is happening:

one or more validators are getting stuck (looks like it's never the bootstrap node)
after a while, nodes can't vote because they fail the 8 deep vote threshold check
we can't OC blocks so we stop making roots
we eventually stop generating leader schedule
we stop activating stake

So we need to figure out why these nodes seem to get stuck. I've seen some cases where the validator appears to start up properly, freezes some blocks, and then stops receiving shreds (usually around slot 55-60). I've also seen cases where validator appears to start up properly but never freezes any blocks.

bw-solana · 2024-10-31T22:29:05Z

Progress update: The validators get stuck because they aren't receiving the shreds. ~~Turbine is completely busted for these local cluster environments, so~~ we rely 100% on repair. In the failing case, some node(s) don't request repairs, so they never get shreds, replay, vote, etc.

~~I'm assuming turbine is broken because all the nodes are using the same localhost IP.~~

It's not clear why repair seems to be broken in some cases. I'll keep digging

EDIT: Turbine is not completely busted. We skip the loopback IP checks because we are using the Unspecified Socket Addr Space. I believe the reason we don't see shreds passed via turbine to start is because the nodes appear unstaked and thus leader sends to nobody. After a few epochs, we start sending shreds out via turbine.

bw-solana · 2024-11-01T20:21:25Z

This is the typical happy flow for the non-bootstrap nodes

Gossip votes observed
Insert repair tree
Request orphan repairs
Replay & freeze blocks
Vote (once staked)
OC blocks & make roots
Generate leader schedule
Keep activating stake

It looks like gossip votes often either:

Stop getting observed
Take a long time to start being observed

In the case of 1, many times the node starts receiving shreds over turbine, which then allowes them to fill in the gaps via repair.

However, it seems like the cases where a node (and thus the cluster) get stuck are when the distance between last shreds received and the new shreds received (either via turbine or repair) are > 2 epochs away.

This is because we filter out shreds in TVU fetch when they are more than 2 epochs away from the highest slot. Given epochs are only 32 slots, I see this case happen fairly often.

The 2 different cases I've observed seem to go like this:

Highest bank fork slot is 0, we don't observe votes until slot >64, we issue an orphan repair, we filter out the repair shreds, RIP.
We observe votes, we stop observing votes at slot X, turbine kicks on once the cluster sees us as staked, but the first epoch slot >X+64 and we filter out the shreds, RIP. Note: It's not clear why we stop observing votes via gossip for this case.

I've tested out a quick hack to only filter out shreds that are >500 slots higher than the highest slot observed in bank forks and haven't seen the test fail with this configuration yet.

* fix flaky test

fix flaky test

3f7d15f

bw-solana requested review from ilya-bobyr and AshwinSekar October 24, 2024 00:23

bw-solana marked this pull request as ready for review October 24, 2024 00:23

bw-solana commented Oct 24, 2024

View reviewed changes

cli/src/cluster_query.rs Outdated Show resolved Hide resolved

bw-solana commented Oct 24, 2024

View reviewed changes

rpc-client/src/nonblocking/rpc_client.rs Show resolved Hide resolved

steviez reviewed Oct 24, 2024

View reviewed changes

local-cluster/tests/local_cluster.rs Show resolved Hide resolved

local-cluster/tests/local_cluster.rs Show resolved Hide resolved

local-cluster/tests/local_cluster.rs Outdated Show resolved Hide resolved

rpc-client/src/nonblocking/rpc_client.rs Show resolved Hide resolved

ilya-bobyr reviewed Oct 24, 2024

View reviewed changes

rpc-client/src/nonblocking/rpc_client.rs Show resolved Hide resolved

ilya-bobyr reviewed Oct 24, 2024

View reviewed changes

PR feedback

3b570bc

bw-solana requested review from steviez and ilya-bobyr October 24, 2024 14:03

steviez reviewed Oct 24, 2024

View reviewed changes

PR feedback

bfe5d2f

bw-solana requested a review from steviez October 24, 2024 17:23

AshwinSekar reviewed Oct 24, 2024

View reviewed changes

PR feedback

c989d93

steviez approved these changes Oct 24, 2024

View reviewed changes

bw-solana requested a review from AshwinSekar October 24, 2024 21:27

AshwinSekar approved these changes Oct 24, 2024

View reviewed changes

bw-solana merged commit b1a5438 into anza-xyz:master Oct 24, 2024
40 checks passed

bw-solana deleted the fix_test_wait_for_max_stake branch October 24, 2024 22:13

ilya-bobyr reviewed Oct 25, 2024

View reviewed changes

bw-solana mentioned this pull request Nov 1, 2024

increase minimum shred filtering distance #3430

Merged

steviez mentioned this pull request Nov 4, 2024

Marks local-cluster tests with #[serial] #3466

Merged

This was referenced Nov 6, 2024

increase ticks per slot for test #3491

Merged

pre stake and skip warmup #3509

Merged

ray-kast pushed a commit to abklabs/agave that referenced this pull request Nov 27, 2024

fix flaky test (anza-xyz#3295)

81e351f

* fix flaky test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix flaky test #3295

fix flaky test #3295

bw-solana commented Oct 24, 2024 •

edited

Loading

steviez left a comment

ilya-bobyr Oct 24, 2024 •

edited

Loading

bw-solana Oct 24, 2024

ilya-bobyr Oct 24, 2024

steviez left a comment •

edited

Loading

steviez Oct 24, 2024

steviez Oct 24, 2024

AshwinSekar left a comment •

edited

Loading

AshwinSekar Oct 24, 2024 •

edited

Loading

bw-solana Oct 24, 2024 •

edited

Loading

bw-solana Oct 24, 2024

AshwinSekar Oct 24, 2024 •

edited

Loading

steviez left a comment

ilya-bobyr left a comment

ilya-bobyr Oct 24, 2024

ilya-bobyr Oct 25, 2024 •

edited

Loading

ilya-bobyr Oct 25, 2024

ilya-bobyr Oct 25, 2024 •

edited

Loading

steviez commented Oct 31, 2024

bw-solana commented Oct 31, 2024

bw-solana commented Oct 31, 2024 •

edited

Loading

bw-solana commented Nov 1, 2024 •

edited

Loading

	pub fn new_with_timeout<U: ToString>(url: U, timeout: Duration) -> Self {
	Self::new_sender(
	HttpSender::new_with_timeout(url, timeout),
	RpcClientConfig::with_commitment(CommitmentConfig::default()),
	)
	}

fix flaky test #3295

fix flaky test #3295

Conversation

bw-solana commented Oct 24, 2024 • edited Loading

Problem

Summary of Changes

steviez left a comment

Choose a reason for hiding this comment

ilya-bobyr Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

bw-solana Oct 24, 2024

Choose a reason for hiding this comment

ilya-bobyr Oct 24, 2024

Choose a reason for hiding this comment

steviez left a comment • edited Loading

Choose a reason for hiding this comment

steviez Oct 24, 2024

Choose a reason for hiding this comment

steviez Oct 24, 2024

Choose a reason for hiding this comment

AshwinSekar left a comment • edited Loading

Choose a reason for hiding this comment

AshwinSekar Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

bw-solana Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

bw-solana Oct 24, 2024

Choose a reason for hiding this comment

AshwinSekar Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

steviez left a comment

Choose a reason for hiding this comment

ilya-bobyr left a comment

Choose a reason for hiding this comment

ilya-bobyr Oct 24, 2024

Choose a reason for hiding this comment

ilya-bobyr Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

ilya-bobyr Oct 25, 2024

Choose a reason for hiding this comment

ilya-bobyr Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

steviez commented Oct 31, 2024

bw-solana commented Oct 31, 2024

bw-solana commented Oct 31, 2024 • edited Loading

bw-solana commented Nov 1, 2024 • edited Loading

bw-solana commented Oct 24, 2024 •

edited

Loading

ilya-bobyr Oct 24, 2024 •

edited

Loading

steviez left a comment •

edited

Loading

AshwinSekar left a comment •

edited

Loading

AshwinSekar Oct 24, 2024 •

edited

Loading

bw-solana Oct 24, 2024 •

edited

Loading

AshwinSekar Oct 24, 2024 •

edited

Loading

ilya-bobyr Oct 25, 2024 •

edited

Loading

ilya-bobyr Oct 25, 2024 •

edited

Loading

bw-solana commented Oct 31, 2024 •

edited

Loading

bw-solana commented Nov 1, 2024 •

edited

Loading