-
Notifications
You must be signed in to change notification settings - Fork 68
[metrics] add metrics to monitor concurrent tasks in network #472
Conversation
use prometheus::{default_registry, register_int_gauge_vec_with_registry, IntGaugeVec, Registry}; | ||
use std::sync::Arc; | ||
|
||
pub trait NetworkMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I would use only one struct to report the metrics both for worker and primary nodes as those would normally be deployed as separate nodes. However, currently the worker & primary are deployed as part of the same binary in SUI and this would lead to re-registering the same metric (which would lead to an error). So had to split those but to minimise the implementation I used a common interface .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be OK with the theory of what we're doing here (keeping a finger on the pulse of concurrency), but I'm wary of the very extensive code changes this is introducing and actually concerned we may never manage to either remove those metrics or maintain them well.
- is there another tack we could take to reduce the maintenance burden, using e.g. the recently introduced tokio-metrics? https://tokio.rs/blog/2022-02-announcing-tokio-metrics
- if not, is there a reduced, simpler variant of the metrics reporting that would e.g. wake up a background task every second and report the concurrency value in the bounded executors of each of the worker & primary? this would not require instrumentation of the whole code base I feel.
@velvia may have other ideas.
@huitseeker @akichidis just had a look at this. Here's my take: I think in terms of manually maintaining our own metrics -- most of the change in this PR really has to do with adding an inner Metrics struct to each network facility. My guess is that this is going to be hard to avoid with this approach. Is there a higher level place which could collect this metric periodically? Then each network eg primary, etc only has to expose a method, and the higher level thing could call down to get the metric. This would be less intrusive in that one would not need to pass in a separate Metrics struct just for monitoring this one thing. As for Tokio Metrics, I think that is a good approach, though that requires some investment too. That should return much more metrics overall and be really good and rich, but that will likely be a few days. |
@huitseeker @velvia thanks both for your comments. So my thoughts regarding the above:
That being said I am open to give a quick try on the approach (2) mentioned above and compare the complexity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, after the experience of #559 I recognize how wise this is and I am now in favor of introducing this PR with minor changes.
I think this mostly needs a rebase.
&["module", "network"], | ||
registry | ||
) | ||
.unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to not panic the node if this fails: metrics are not enough to justify a crash, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that should happen primarily if we have already registered the same metric in the codebase somewhere else. Having this there will protect us from accidentally declaring the same metric in two different places and go unnoticed - which I believe is quite important. So my expectation is that this will be caught early in our e2e tests before it even hits staging - or production worst case.
network/src/metrics.rs
Outdated
network_concurrent_tasks: register_int_gauge_vec_with_registry!( | ||
"worker_network_concurrent_tasks", | ||
"The number of concurrent tasks running in the network connector", | ||
&["module", "network"], | ||
registry | ||
) | ||
.unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above re: the avoidable panic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to this #472 (comment)
@akichidis you may need to rebase this one to pass checks: we changed a linter in #485 |
c6a5b00
to
04a2e75
Compare
@huitseeker @asonnino @bmwill @velvia I've rebased the PR in the latest master, if you could please re-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much 🎉
25a0d25
to
51b02a0
Compare
51b02a0
to
e48a271
Compare
We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706). PR MystenLabs#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs#759
) We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706). PR #472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes #759
…ystenLabs#763) We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706). PR MystenLabs#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs#759
) We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706). PR #472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes #759
…ystenLabs/narwhal#763) We operate an executor with a bound on the concurrent number of messages (see MystenLabs/narwhal#463, MystenLabs/narwhal#559, MystenLabs/narwhal#706). PR MystenLabs/narwhal#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs/narwhal#759
Following the work on #463 I believe that @velvia made a good point on adding metrics to monitor when we reach the maximum of capacity on concurrent tasks. Took the opportunity to add those since we are still hot on adding metrics and keep them until we stabilise things. Once we gain strong confidence on our nodes and we consider this an implementation detail we can probably remove.