-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Export number of expected chunks/blocks in epoch to prometheus #8759
Conversation
only focus on the last commit. The first two will be merged in this PR.
|
5740721
to
41a5d13
Compare
chain/client/src/info.rs
Outdated
.map_or(0..0, |epoch_start_height| { | ||
epoch_start_height..(epoch_start_height + blocks_in_epoch) | ||
}) | ||
.fold( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be easier to read if we just change it to a regular for loop; the fold is only used to iterate over the indexes, not to, say, fold some function over the actual elements in the container.
@@ -177,6 +181,45 @@ impl InfoHelper { | |||
} | |||
} | |||
|
|||
fn record_epoch_settlement_info(head: &Tip, client: &crate::client::Client) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how long this takes? For 43200 heights in an epoch I just wanna make sure this won't take too long and make validators miss block/chunk production, since I assume this is run on ClientActor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran perf on the RPC node and the function does not show up in the graph. I even looked for log_summary
call. I believe they are negligible.
The function is also called once per epoch to avoid unnecessary computation.
Check the CPU flamegraph below.
https://drive.google.com/file/d/1x7HSYTFWBZOgHlh1g7_3ev5r1IPsIT-L/view?usp=sharing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well my concern isn't about the overall performance impact on CPU. My concern is about exactly that one time latency on epoch transitions. Validators are very keen on not missing even a single chunk or block, so if this introduces say 100ms delay then this may cause that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a localnet with 4 nodes to run perf during the epoch transition. The results are here but I cannot see a difference.
Let me know if you have any suggestion on how to test this more accurately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option here to avoid overloading the ClientActor is to compute this metric async when a new epoch begins. The drawback here is that a lot of stuff happens at epoch start as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I am not sure how long something runs, I run it 1000 times. Just loop your function 1000 times, run your tests like that, see the huge delay and estimate the real delay by dividing by 1000.
41a5d13
to
f21019f
Compare
Looks like sample_chunk_producer hashes 48 bytes and then uses that to index into an array. The bottleneck would seem to be the hash rate on 48 byte inputs. If we can benchmark/calculate how long that takes on a typical machine then that'd be enough of a justification. |
I measured the time spent in the sample functions for chunk and blocks. Setup:
Result:The call costs 1.3 ms for a 43200 iteration
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the benchmark! Seems fine for now, but when we scale to 100 shards this will become 25x more expensive and then it may be a potential issue.
Bad news... I ran the test again because my node was out of date and the result does not look good.
|
I see, that's unfortunate :( Let's take a step back. Do we really need to have such an exact count? Is it OK to have an approximate? Because we know the exact sampling weights so we can calculate the expected ("expected" as in probability theory) number of blocks or chunks. |
The benchmark that you based your approve on was not accurate.
I got some time to get back to this. In this paste you can see the comparison. If we chose to go with the approximation, @nikurt are these values going to be helpful or misleading? This is the code from my fork that I used to generate the metrics. |
In the provided sample I see the approximations being off by 40% (6 expected instead of 10 expected). I vote for the approximation. |
f21019f
to
d022978
Compare
chain/client/src/info.rs
Outdated
@@ -243,6 +284,11 @@ impl InfoHelper { | |||
InfoHelper::record_tracked_shards(&head, &client); | |||
InfoHelper::record_block_producers(&head, &client); | |||
InfoHelper::record_chunk_producers(&head, &client); | |||
if self.epoch_id.as_ref().map_or(true, |epoch_id| epoch_id != &head.epoch_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.epoch_id.as_ref().map_or(true, |epoch_id| epoch_id != &head.epoch_id) { | |
if self.epoch_id.ne(&head.epoch_id) { |
let blocks_in_epoch = client.config.epoch_length; | ||
let number_of_shards = | ||
client.runtime_adapter.num_shards(&head.epoch_id).unwrap_or_default(); | ||
if let Ok(epoch_info) = epoch_info { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments that this is an approximation and how it's computed.
d022978
to
5f9abe1
Compare
The producer to block/chunk assignment is known ahead of time so we can estimate what we expect to see at the end of the epoch. We are computing these numbers only once per epoch.
5f9abe1
to
33296dc
Compare
…8759) The producer to block/chunk assignment is known ahead of time so we can estimate what we expect to see at the end of the epoch. We are computing these numbers only once per epoch. Tested on mainnet and localnet. On localnet: - started 4 nodes - stopped one of them for a delta of 200 blocks - started it again
…8759) The producer to block/chunk assignment is known ahead of time so we can estimate what we expect to see at the end of the epoch. We are computing these numbers only once per epoch. Tested on mainnet and localnet. On localnet: - started 4 nodes - stopped one of them for a delta of 200 blocks - started it again
The producer to block/chunk assignment is known ahead of time so we can
estimate what we expect to see at the end of the epoch.
We are computing these numbers only once per epoch.
Tested on mainnet and localnet.
On localnet: