-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix flaky test #3295
fix flaky test #3295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your summary just about covered it. As for why there are skipped slots, yeah, there is a bit of turmoil in the beginning. Inspecting individual logs (like new banks) show that there is some time discrepancy when new banks are getting made (and thus forking).
Not sure I care to bisect, but this test was added nearly four years ago in solana-labs#13532. I see mention of this test in May 2022 in Discord along with mention of cooldown / warmup, which is the only thing I can think of that would blow the test time up to 4 minutes ... I doubt this took 4 min when it was first implemented
@@ -2197,12 +2199,20 @@ impl RpcClient { | |||
current_percent = 100f32 * max as f32 / total_active_stake as f32; | |||
if current_percent < max_stake_percent { | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style
If you are going to be modifying this, I suggest writing it completely in the imperative style or completely in the functional style. A mix of two styles is negatively affecting my ability to read the code, as it does not work fully with either of the existing intuitions that I have.
Imperative style:
use itertools::chain;
/* ... */
let mut max = 0;
let mut total = 0;
for account in chain(&vote_accounts.current, &vote_accounts.delinquent) {
let activated_stake = account.activated_stake;
max = std::cmp::max(max, activated_stake);
total += activated_stake;
}
Functional style:
use itertools::chain;
/* ... */
let (max, total) = chain(&vote_accounts.current, &vote_accounts.delinquent)
.map(|account| account.activated_stake)
.fold((0, 0), |(max, total), stake| {
(std::cmp::max(max, stake), total + stake)
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay if we do this in a follow-up? Hoping to limit to functional change for this commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely.
This was a suggestion in case you were going to change this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm being a bit pickier than usual given the pub-ness of rpc-client
crate.
Also, I think ilya-bobyr has some valid comments about the impl / test, but I agree with Brennan and am inclined to keep this PR limited to fixing the test and then a separate PR to clean it up
rpc-client/src/rpc_client.rs
Outdated
&self, | ||
commitment: CommitmentConfig, | ||
max_stake_percent: f32, | ||
timeout: Option<Duration>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone is specifically using the _with_timeout()
variant when there is a no-timeout variant as well, I would think they'd have a timeout in mind and the parameter should be Duration
, not Option<Duration>
.
As an example, another function from this file:
agave/rpc-client/src/rpc_client.rs
Lines 297 to 302 in ec781d3
pub fn new_with_timeout<U: ToString>(url: U, timeout: Duration) -> Self { | |
Self::new_sender( | |
HttpSender::new_with_timeout(url, timeout), | |
RpcClientConfig::with_commitment(CommitmentConfig::default()), | |
) | |
} |
@@ -2179,8 +2179,19 @@ impl RpcClient { | |||
&self, | |||
commitment: CommitmentConfig, | |||
max_stake_percent: f32, | |||
) -> ClientResult<()> { | |||
self.wait_for_max_stake_below_threshold_with_timeout(commitment, max_stake_percent, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely pro-code reuse tho; if we want to keep the timeout around, maybe a non-pub helper that takes an Option<Duration>
, and then:
wait_for_max_stake_below_threshold
calls helper withNone
wait_for_max_stake_below_threshold_with_timeout
calls helper withSome(timeout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test changes look good thanks for tackling this
local-cluster/tests/local_cluster.rs
Outdated
.is_ok()); | ||
// This is based on the percentage of stake that is allowed to be activated | ||
// each epoch. | ||
let num_expected_epochs = 14; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanna be fancy can derive this from the constant in sdk:
let num_expected_epochs = 3f64.log(1. + NEW_WARMUP_COOLDOWN_RATE).ceil() as u32 + 1;
The +1 on the end is a buffer in case stake doesn't start activating immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 3 here is essentially the 3 non-bootstrap validators, right?
Does this compute the time for the stake to completely activate? Because I think the test just waits for the bootstrap node stake to fall below 1/3, which would be significantly sooner
No, actually I think the 3 is right. It just represents increasing the total amount of stake 3x, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm not sure if we need the +1 because this isn't trying to find the epoch in which we will hit the threshold. It's just computing number of epochs it will take once we start activating the stake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I didn't read up on the calculating num epochs tho so maybe some confirmation from Ashwin that it looks reasonable would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for a slow response.
I was double-checking the formula, and it took me some time.
@@ -2197,12 +2199,20 @@ impl RpcClient { | |||
current_percent = 100f32 * max as f32 / total_active_stake as f32; | |||
if current_percent < max_stake_percent { | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely.
This was a suggestion in case you were going to change this code.
let num_expected_epochs = (num_validators_activating_stake as f64) | ||
.log(1. + NEW_WARMUP_COOLDOWN_RATE) | ||
.ceil() as u32 | ||
+ 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood the stake activation process.
But I still think it is worth explaining the formula.
Even if it is as simple as
3f64.log(1. + NEW_WARMUP_COOLDOWN_RATE).ceil() as u32 + 1;
Also, I'm not sure that we should not include the validator count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original formula was indeed correct.
Here is a suggested explanation that helps others follow along, I think:
// `NEW_WARMUP_COOLDOWN_RATE` is the percentage of stake that can be added relative to the
// currently active staked. Currently set to `0.09`.
//
// Initially the genesis node has `DEFAULT_NODE_STAKE` of the stake, and we want
// `num_validators_activating_stake` new validators to activate the same amount of stake as the
// genesis node. `DEFAULT_NODE_STAKE` is the individual node stake. It total we want to
// activate `DEFAULT_NODE_STAKE * num_validators_activating_stake` new stake.
//
// Starting with `DEFAULT_NODE_STAKE` and activating `NEW_WARMUP_COOLDOWN_RATE` per epoch, we
// need to following to happen:
//
// DEFAULT_NODE_STAKE * (1 + NEW_WARMUP_COOLDOWN_RATE) ^ num_expected_epochs >=
// DEFAULT_NODE_STAKE * num_validators_activating_stake
//
// Simplifying:
//
// (1 + NEW_WARMUP_COOLDOWN_RATE) ^ num_expected_epochs >= num_validators_activating_stake
//
// num_expected_epochs * log (1 + NEW_WARMUP_COOLDOWN_RATE) >=
// log (num_validators_activating_stake)
//
// num_expected_epochs >=
// log (num_validators_activating_stake) / log (1 + NEW_WARMUP_COOLDOWN_RATE)
//
// We can only wait for an integer number of epochs, so we round up. And we add 1 more epoch in
// case the stake is not activated in the first epoch.
let num_expected_epochs = (num_validators_activating_stake as f64)
.log(1. + NEW_WARMUP_COOLDOWN_RATE)
.ceil() as u32
+ 1;
let num_expected_epochs = (num_validators_activating_stake as f64) | ||
.log(1. + NEW_WARMUP_COOLDOWN_RATE) | ||
.ceil() as u32 | ||
+ 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood the stake activation process.
But I still think it is worth explaining the formula.
Even if it is as simple as
3f64.log(1. + NEW_WARMUP_COOLDOWN_RATE).ceil() as u32 + 1;
Also, I'm not sure that we should not include the validator count.
Bit of a bummer, saw this one fail again today: |
I've been able to repro some failures on my dev machine. Seems to fail maybe 20% of the time for me. It looks like the following is happening:
So we need to figure out why these nodes seem to get stuck. I've seen some cases where the validator appears to start up properly, freezes some blocks, and then stops receiving shreds (usually around slot 55-60). I've also seen cases where validator appears to start up properly but never freezes any blocks. |
Progress update: The validators get stuck because they aren't receiving the shreds.
It's not clear why repair seems to be broken in some cases. I'll keep digging EDIT: Turbine is not completely busted. We skip the loopback IP checks because we are using the Unspecified Socket Addr Space. I believe the reason we don't see shreds passed via turbine to start is because the nodes appear unstaked and thus leader sends to nobody. After a few epochs, we start sending shreds out via turbine. |
This is the typical happy flow for the non-bootstrap nodes
It looks like gossip votes often either:
In the case of 1, many times the node starts receiving shreds over turbine, which then allowes them to fill in the gaps via repair. However, it seems like the cases where a node (and thus the cluster) get stuck are when the distance between last shreds received and the new shreds received (either via turbine or repair) are > 2 epochs away. This is because we filter out shreds in TVU fetch when they are more than 2 epochs away from the highest slot. Given epochs are only 32 slots, I see this case happen fairly often. The 2 different cases I've observed seem to go like this:
I've tested out a quick hack to only filter out shreds that are >500 slots higher than the highest slot observed in bank forks and haven't seen the test fail with this configuration yet. |
* fix flaky test
Problem
test_wait_for_max_stake
is notoriously flaky. Even worse, when it fails, it takes an hour to timeout.The following is observed when the test fails:
Summary of Changes