Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix flaky test #3295

Merged
merged 4 commits into from
Oct 24, 2024
Merged

Conversation

bw-solana
Copy link

@bw-solana bw-solana commented Oct 24, 2024

Problem

test_wait_for_max_stake is notoriously flaky. Even worse, when it fails, it takes an hour to timeout.

The following is observed when the test fails:

  1. Stake of the highest staked validator gets stuck, usually around 71%
  2. Stake is stuck because stake from the other 3 validators is not activating
  3. Stake is not activating because epochs are not advancing
  4. Epochs are not advancing because nobody thinks they are leader
  5. Nobody thinks they are leader because the new leader schedule is not generated
  6. New leader schedule is not generated because we have not rooted a new slot in the current epoch
  7. We haven't rooted a new slot in the new epoch because we have skipped at least one slot and have zero margin
  8. We have zero margin because tower height is 32, slots per epoch is 32 (so we can advance epochs quickly), and we only compute the leader slot 1 epoch worth of slots ahead of time
  9. It's not 100% clear why we skip slots, but it may be partially due to running 4 validators on one set of HW. We see replay thread not executing in time and PoH ticking past the leader slot.

Summary of Changes

  1. Compute the leader schedule 3 epochs worth of slots ahead of time. This will provide margin where we can skip a slot here and there and still not get stuck.
  2. Reduce ticks per slot to 16. This isn't necessary to fix the issue (in fact, it potentially makes us skip more slots), but it will reduce the test time by a factor of 4.
  3. Add a 5 minute timout to the test. This will prevent hanging CI for an hour if we still fail.

@bw-solana bw-solana marked this pull request as ready for review October 24, 2024 00:23
cli/src/cluster_query.rs Outdated Show resolved Hide resolved
Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your summary just about covered it. As for why there are skipped slots, yeah, there is a bit of turmoil in the beginning. Inspecting individual logs (like new banks) show that there is some time discrepancy when new banks are getting made (and thus forking).

Not sure I care to bisect, but this test was added nearly four years ago in solana-labs#13532. I see mention of this test in May 2022 in Discord along with mention of cooldown / warmup, which is the only thing I can think of that would blow the test time up to 4 minutes ... I doubt this took 4 min when it was first implemented

local-cluster/tests/local_cluster.rs Show resolved Hide resolved
local-cluster/tests/local_cluster.rs Show resolved Hide resolved
local-cluster/tests/local_cluster.rs Outdated Show resolved Hide resolved
rpc-client/src/nonblocking/rpc_client.rs Show resolved Hide resolved
@@ -2197,12 +2199,20 @@ impl RpcClient {
current_percent = 100f32 * max as f32 / total_active_stake as f32;
if current_percent < max_stake_percent {
break;
Copy link

@ilya-bobyr ilya-bobyr Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style

If you are going to be modifying this, I suggest writing it completely in the imperative style or completely in the functional style. A mix of two styles is negatively affecting my ability to read the code, as it does not work fully with either of the existing intuitions that I have.

Imperative style:

use itertools::chain;
/* ... */

            let mut max = 0;
            let mut total = 0;
            for account in chain(&vote_accounts.current, &vote_accounts.delinquent) {
                let activated_stake = account.activated_stake;
                max = std::cmp::max(max, activated_stake);
                total += activated_stake;
            }

Functional style:

use itertools::chain;
/* ... */
            let (max, total) = chain(&vote_accounts.current, &vote_accounts.delinquent)
                .map(|account| account.activated_stake)
                .fold((0, 0), |(max, total), stake| {
                    (std::cmp::max(max, stake), total + stake)
                });

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay if we do this in a follow-up? Hoping to limit to functional change for this commit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely.
This was a suggestion in case you were going to change this code.

Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm being a bit pickier than usual given the pub-ness of rpc-client crate.

Also, I think ilya-bobyr has some valid comments about the impl / test, but I agree with Brennan and am inclined to keep this PR limited to fixing the test and then a separate PR to clean it up

local-cluster/tests/local_cluster.rs Show resolved Hide resolved
&self,
commitment: CommitmentConfig,
max_stake_percent: f32,
timeout: Option<Duration>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone is specifically using the _with_timeout() variant when there is a no-timeout variant as well, I would think they'd have a timeout in mind and the parameter should be Duration, not Option<Duration>.

As an example, another function from this file:

pub fn new_with_timeout<U: ToString>(url: U, timeout: Duration) -> Self {
Self::new_sender(
HttpSender::new_with_timeout(url, timeout),
RpcClientConfig::with_commitment(CommitmentConfig::default()),
)
}

@@ -2179,8 +2179,19 @@ impl RpcClient {
&self,
commitment: CommitmentConfig,
max_stake_percent: f32,
) -> ClientResult<()> {
self.wait_for_max_stake_below_threshold_with_timeout(commitment, max_stake_percent, None)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely pro-code reuse tho; if we want to keep the timeout around, maybe a non-pub helper that takes an Option<Duration>, and then:

  • wait_for_max_stake_below_threshold calls helper with None
  • wait_for_max_stake_below_threshold_with_timeout calls helper with Some(timeout)

@bw-solana bw-solana requested a review from steviez October 24, 2024 17:23
Copy link

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test changes look good thanks for tackling this

.is_ok());
// This is based on the percentage of stake that is allowed to be activated
// each epoch.
let num_expected_epochs = 14;
Copy link

@AshwinSekar AshwinSekar Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanna be fancy can derive this from the constant in sdk:

let num_expected_epochs = 3f64.log(1. + NEW_WARMUP_COOLDOWN_RATE).ceil() as u32 + 1;

The +1 on the end is a buffer in case stake doesn't start activating immediately.

Copy link
Author

@bw-solana bw-solana Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3 here is essentially the 3 non-bootstrap validators, right?

Does this compute the time for the stake to completely activate? Because I think the test just waits for the bootstrap node stake to fall below 1/3, which would be significantly sooner

No, actually I think the 3 is right. It just represents increasing the total amount of stake 3x, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure if we need the +1 because this isn't trying to find the epoch in which we will hit the threshold. It's just computing number of epochs it will take once we start activating the stake

Copy link

@AshwinSekar AshwinSekar Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya the 3 is from the 1/3. Original equation is
image
Where:

  • S is DEFAULT_NODE_STAKE the bootstrap node's stake
  • R is 9% aka NEW_WARMUP_COOLDOWN_RATE
  • X is num_expected_epochs

Essentially the denominator represents the total stake of the system at each epoch, it increases by 9%

Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I didn't read up on the calculating num epochs tho so maybe some confirmation from Ashwin that it looks reasonable would be good

@bw-solana bw-solana requested a review from AshwinSekar October 24, 2024 21:27
@bw-solana bw-solana merged commit b1a5438 into anza-xyz:master Oct 24, 2024
40 checks passed
@bw-solana bw-solana deleted the fix_test_wait_for_max_stake branch October 24, 2024 22:13
Copy link

@ilya-bobyr ilya-bobyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for a slow response.
I was double-checking the formula, and it took me some time.

@@ -2197,12 +2199,20 @@ impl RpcClient {
current_percent = 100f32 * max as f32 / total_active_stake as f32;
if current_percent < max_stake_percent {
break;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely.
This was a suggestion in case you were going to change this code.

Comment on lines +1565 to +1568
let num_expected_epochs = (num_validators_activating_stake as f64)
.log(1. + NEW_WARMUP_COOLDOWN_RATE)
.ceil() as u32
+ 1;
Copy link

@ilya-bobyr ilya-bobyr Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood the stake activation process.

But I still think it is worth explaining the formula.
Even if it is as simple as

    3f64.log(1. + NEW_WARMUP_COOLDOWN_RATE).ceil() as u32 + 1;

Also, I'm not sure that we should not include the validator count.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original formula was indeed correct.

Here is a suggested explanation that helps others follow along, I think:

    // `NEW_WARMUP_COOLDOWN_RATE` is the percentage of stake that can be added relative to the
    // currently active staked.  Currently set to `0.09`.
    //
    // Initially the genesis node has `DEFAULT_NODE_STAKE` of the stake, and we want
    // `num_validators_activating_stake` new validators to activate the same amount of stake as the
    // genesis node.  `DEFAULT_NODE_STAKE` is the individual node stake.  It total we want to
    // activate `DEFAULT_NODE_STAKE * num_validators_activating_stake` new stake.
    //
    // Starting with `DEFAULT_NODE_STAKE` and activating `NEW_WARMUP_COOLDOWN_RATE` per epoch, we
    // need to following to happen:
    //
    //   DEFAULT_NODE_STAKE * (1 + NEW_WARMUP_COOLDOWN_RATE) ^ num_expected_epochs >=
    //     DEFAULT_NODE_STAKE * num_validators_activating_stake
    //
    // Simplifying:
    //
    //   (1 + NEW_WARMUP_COOLDOWN_RATE) ^ num_expected_epochs >= num_validators_activating_stake
    //
    //   num_expected_epochs * log (1 + NEW_WARMUP_COOLDOWN_RATE) >=
    //      log (num_validators_activating_stake)
    //
    //   num_expected_epochs >=
    //       log (num_validators_activating_stake) / log (1 + NEW_WARMUP_COOLDOWN_RATE) 
    //
    // We can only wait for an integer number of epochs, so we round up.  And we add 1 more epoch in
    // case the stake is not activated in the first epoch.
    let num_expected_epochs = (num_validators_activating_stake as f64)
        .log(1. + NEW_WARMUP_COOLDOWN_RATE)
        .ceil() as u32
        + 1;

Comment on lines +1565 to +1568
let num_expected_epochs = (num_validators_activating_stake as f64)
.log(1. + NEW_WARMUP_COOLDOWN_RATE)
.ceil() as u32
+ 1;
Copy link

@ilya-bobyr ilya-bobyr Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood the stake activation process.

But I still think it is worth explaining the formula.
Even if it is as simple as

    3f64.log(1. + NEW_WARMUP_COOLDOWN_RATE).ceil() as u32 + 1;

Also, I'm not sure that we should not include the validator count.

@steviez
Copy link

steviez commented Oct 31, 2024

Bit of a bummer, saw this one fail again today:
https://buildkite.com/anza/agave/builds/13057#0192dfde-60c8-44f1-91d9-c0cb38b08374

@bw-solana
Copy link
Author

Bit of a bummer, saw this one fail again today: https://buildkite.com/anza/agave/builds/13057#0192dfde-60c8-44f1-91d9-c0cb38b08374

I've been able to repro some failures on my dev machine. Seems to fail maybe 20% of the time for me.

It looks like the following is happening:

  1. one or more validators are getting stuck (looks like it's never the bootstrap node)
  2. after a while, nodes can't vote because they fail the 8 deep vote threshold check
  3. we can't OC blocks so we stop making roots
  4. we eventually stop generating leader schedule
  5. we stop activating stake

So we need to figure out why these nodes seem to get stuck. I've seen some cases where the validator appears to start up properly, freezes some blocks, and then stops receiving shreds (usually around slot 55-60). I've also seen cases where validator appears to start up properly but never freezes any blocks.

@bw-solana
Copy link
Author

bw-solana commented Oct 31, 2024

Progress update: The validators get stuck because they aren't receiving the shreds. Turbine is completely busted for these local cluster environments, so we rely 100% on repair. In the failing case, some node(s) don't request repairs, so they never get shreds, replay, vote, etc.

I'm assuming turbine is broken because all the nodes are using the same localhost IP.

It's not clear why repair seems to be broken in some cases. I'll keep digging

EDIT: Turbine is not completely busted. We skip the loopback IP checks because we are using the Unspecified Socket Addr Space. I believe the reason we don't see shreds passed via turbine to start is because the nodes appear unstaked and thus leader sends to nobody. After a few epochs, we start sending shreds out via turbine.

@bw-solana
Copy link
Author

bw-solana commented Nov 1, 2024

This is the typical happy flow for the non-bootstrap nodes

  1. Gossip votes observed
  2. Insert repair tree
  3. Request orphan repairs
  4. Replay & freeze blocks
  5. Vote (once staked)
  6. OC blocks & make roots
  7. Generate leader schedule
  8. Keep activating stake

It looks like gossip votes often either:

  1. Stop getting observed
  2. Take a long time to start being observed

In the case of 1, many times the node starts receiving shreds over turbine, which then allowes them to fill in the gaps via repair.

However, it seems like the cases where a node (and thus the cluster) get stuck are when the distance between last shreds received and the new shreds received (either via turbine or repair) are > 2 epochs away.

This is because we filter out shreds in TVU fetch when they are more than 2 epochs away from the highest slot. Given epochs are only 32 slots, I see this case happen fairly often.

The 2 different cases I've observed seem to go like this:

  1. Highest bank fork slot is 0, we don't observe votes until slot >64, we issue an orphan repair, we filter out the repair shreds, RIP.
  2. We observe votes, we stop observing votes at slot X, turbine kicks on once the cluster sees us as staked, but the first epoch slot >X+64 and we filter out the shreds, RIP. Note: It's not clear why we stop observing votes via gossip for this case.

I've tested out a quick hack to only filter out shreds that are >500 slots higher than the highest slot observed in bank forks and haven't seen the test fail with this configuration yet.

ray-kast pushed a commit to abklabs/agave that referenced this pull request Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants