-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replay: extend last fec set check for 32+ retransmitter signed shreds #2101
replay: extend last fec set check for 32+ retransmitter signed shreds #2101
Conversation
9b3728d
to
ef2185c
Compare
ledger/src/blockstore.rs
Outdated
"incomplete_final_fec_set", | ||
("slot", slot, i64), | ||
("hash", hash.to_string(), String) | ||
); | ||
} | ||
if !self.is_retransmitter_signed { | ||
datapoint_warn!( | ||
"invalid_retransmitter_signature_final_fec_set", | ||
("slot", slot, i64), | ||
("hash", hash.to_string(), String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently these hash
things are pretty expensive for metrics server because they do not compress.
And until chained merkle shreds are rolled out every node is going to submit this metric for every slot which is too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good i've gated this with should_chain_merkle_shreds
so that we don't wastefully submit this metric prematurely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then the metric cannot be used to evaluate if we are ready to activate the feature or not because the metric itself is behind the feature.
I think for now there is no point in adding the metric because it will spam metrics server until chained Merkle shreds are rolled out.
Putting it behind the feature gate is not helpful either because then we don't know if we can activate the feature or not.
So I think the best is to remove it entirely for now, wait until chained Merkle shreds are fully rolled out, then we can add this metric to confirm the cluster is ready to activate duplicate shreds features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought the should_chain_merkle_shreds
function is the one that controls the chained merkle rollout?
As in, you will slowly update this function to target different cluster types and increasing slot ranges until it is always true
.
By gating this metric behind should_chain_merkle_shreds
we can verify if it is performing correctly on chained merkle shred slots only, and once should_chain_merkle_shreds
is always true and the metric looks correct we can turn on feature_set::vote_only_retransmitter_signed_fec_sets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there would be some metrics spam as the cluster upgrades to a version that changes should_chain_merkle_shreds
as there would be disagreement on whether this slot should use the new shred format.
What do you think about using feature_set::verify_retransmitter_signature::id()
to gate the metric instead? since I assume we will activate that feature flag first (after chained merkle roots are fully rolled out) but before turning on this one.
If not, I can do as you suggest and remove the metric entirely and add it back before we intend to activate the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there would be some metrics spam as the cluster upgrades to a version that changes should_chain_merkle_shreds as there would be disagreement on whether this slot should use the new shred format.
yeah, that is the issue with using should_chain_merkle_shreds.
What do you think about using feature_set::verify_retransmitter_signature::id() to gate the metric instead?
I think the code is over crowded with feature gate branches already, and I would lean to keep it simple and not to add more branches.
ef2185c
to
509cdb9
Compare
ledger/src/shred.rs
Outdated
@@ -1267,6 +1283,10 @@ pub fn verify_test_data_shred( | |||
} | |||
} | |||
|
|||
pub fn should_chain_merkle_shreds(_slot: Slot, cluster_type: ClusterType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved from standard_broadcast_run
in order to resolve dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note to move this back if the metric gate on should_chain_merkle_shreds
is removed.
ledger/src/blockstore.rs
Outdated
"incomplete_final_fec_set", | ||
("slot", slot, i64), | ||
("hash", hash.to_string(), String) | ||
); | ||
} | ||
if !self.is_retransmitter_signed { | ||
datapoint_warn!( | ||
"invalid_retransmitter_signature_final_fec_set", | ||
("slot", slot, i64), | ||
("hash", hash.to_string(), String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then the metric cannot be used to evaluate if we are ready to activate the feature or not because the metric itself is behind the feature.
I think for now there is no point in adding the metric because it will spam metrics server until chained Merkle shreds are rolled out.
Putting it behind the feature gate is not helpful either because then we don't know if we can activate the feature or not.
So I think the best is to remove it entirely for now, wait until chained Merkle shreds are fully rolled out, then we can add this metric to confirm the cluster is ready to activate duplicate shreds features.
ledger/src/blockstore.rs
Outdated
// These metrics are expensive to send because hash does not compress well. | ||
// Only send these metrics when we are sure the appropriate shred format is being sent | ||
if !self.is_retransmitter_signed && shred::should_chain_merkle_shreds(slot, cluster_type) { | ||
datapoint_warn!( | ||
"invalid_retransmitter_signature_final_fec_set", | ||
("slot", slot, i64), | ||
("bank_hash", bank_hash.to_string(), String) | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in the other comment, maybe we should remove this metric entirely for now and add it back once chained merkle shreds are fully rolled out.
The replay check that introduced this metric: On testnet however we do observe it quite frequently: It seems that something is wrong with the check. I will hold off on this change and investigate the previous check and get back to you. |
False alarm, I checked the block producer for blocks that have been failing the check on testnet today and they are all running firedancer:
It seems that FD is not padding the last fec set yet, so this is to be expected. Also I realized that there is no reason to send the bank hash in metrics, as this check is happening before the bank is frozen so the hash is not yet populated 🤦 EDIT:
|
7daf78c
to
acac298
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, aside from minor comments.
&bank.feature_set, | ||
) { | ||
Ok(block_id) => block_id, | ||
Err(result_err) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of match
, it might be more readable to use inspect_err
:
https://doc.rust-lang.org/std/result/enum.Result.html#method.inspect_err
followed by unwrap_or_default
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work because we can't continue
from the inspect_err
closure.
ledger/src/blockstore.rs
Outdated
} | ||
// These metrics are expensive to send because hash does not compress well. | ||
// Only send these metrics when we are sure the appropriate shred format is being sent | ||
if !results.is_retransmitter_signed && shred::should_chain_merkle_shreds(slot, cluster_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the other comment, gating on should_chain_merkle_shreds
has cluster upgrade issues.
ledger/src/blockstore.rs
Outdated
@@ -170,6 +174,31 @@ impl<T> AsRef<T> for WorkingEntry<T> { | |||
} | |||
} | |||
|
|||
#[derive(Clone, Copy, PartialEq, Eq, Debug)] | |||
pub struct LastFECSetCheckResults { | |||
block_id: Option<Hash>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place this block_id
is defined ?
Would it make sense to call this last_fec_set_merkle_root
or something similar to be more expressive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agave/sdk/program/src/vote/state/mod.rs
Lines 235 to 238 in a2eaf6d
/// the unique identifier for the chain up to and | |
/// including this block. Does not require replaying | |
/// in order to compute. | |
pub block_id: Hash, |
This is where it is defined. I'll keep it as
last_fec_set_merkle_root
here and we can start calling it block_id
in the replay pipeline.
ledger/src/shred.rs
Outdated
@@ -1267,6 +1283,10 @@ pub fn verify_test_data_shred( | |||
} | |||
} | |||
|
|||
pub fn should_chain_merkle_shreds(_slot: Slot, cluster_type: ClusterType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note to move this back if the metric gate on should_chain_merkle_shreds
is removed.
…nt, false for legacy
acac298
to
df32e7b
Compare
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root (cherry picked from commit 93edb65) # Conflicts: # ledger/src/shred.rs # sdk/src/feature_set.rs
…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root
…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root
…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root
…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root
…shreds (backport of #2101) (#2192) replay: extend last fec set check for 32+ retransmitter signed shreds (#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root Co-authored-by: Ashwin Sekar <[email protected]>
Problem
When the feature flag is turned on, we need to verify that the last fec set uses the new shred format.
Because the last fec set is not known when receiving shreds, and the new shred format is only used for the last set, we cannot perform this check until replay has completed.
Summary of Changes
Extend the check which compares the merkle roots of the last fec set to additionally check that they are all of the retransmitter shred variant.