[metrics] add metrics to monitor concurrent tasks in network #472

akichidis · 2022-07-11T12:29:24Z

Following the work on #463 I believe that @velvia made a good point on adding metrics to monitor when we reach the maximum of capacity on concurrent tasks. Took the opportunity to add those since we are still hot on adding metrics and keep them until we stabilise things. Once we gain strong confidence on our nodes and we consider this an implementation detail we can probably remove.

akichidis · 2022-07-11T12:31:43Z

network/src/metrics.rs

+use prometheus::{default_registry, register_int_gauge_vec_with_registry, IntGaugeVec, Registry};
+use std::sync::Arc;
+
+pub trait NetworkMetrics {


Ideally I would use only one struct to report the metrics both for worker and primary nodes as those would normally be deployed as separate nodes. However, currently the worker & primary are deployed as part of the same binary in SUI and this would lead to re-registering the same metric (which would lead to an error). So had to split those but to minimise the implementation I used a common interface .

huitseeker

I'd be OK with the theory of what we're doing here (keeping a finger on the pulse of concurrency), but I'm wary of the very extensive code changes this is introducing and actually concerned we may never manage to either remove those metrics or maintain them well.

is there another tack we could take to reduce the maintenance burden, using e.g. the recently introduced tokio-metrics? https://tokio.rs/blog/2022-02-announcing-tokio-metrics
if not, is there a reduced, simpler variant of the metrics reporting that would e.g. wake up a background task every second and report the concurrency value in the bounded executors of each of the worker & primary? this would not require instrumentation of the whole code base I feel.

@velvia may have other ideas.

velvia · 2022-07-12T21:28:44Z

@huitseeker @akichidis just had a look at this. Here's my take:

I think in terms of manually maintaining our own metrics -- most of the change in this PR really has to do with adding an inner Metrics struct to each network facility. My guess is that this is going to be hard to avoid with this approach.
I had a thought where in the BoundedExecutor itself where it is acquiring and releasing we could possibly update the gauge within there, which would only help a little bit.

Is there a higher level place which could collect this metric periodically? Then each network eg primary, etc only has to expose a method, and the higher level thing could call down to get the metric. This would be less intrusive in that one would not need to pass in a separate Metrics struct just for monitoring this one thing.

As for Tokio Metrics, I think that is a good approach, though that requires some investment too. That should return much more metrics overall and be really good and rich, but that will likely be a few days.

akichidis · 2022-07-19T12:55:59Z

@huitseeker @velvia thanks both for your comments. So my thoughts regarding the above:

Tokio metrics they seem to be super useful and something we can/should evaluate. From a quick look though they don't seem to offer a way to inject customer metrics, but rather report a series of standard ones (more info here). Here we want to monitor custom values which don't fit in the realm of task / runtime metrics - or at least in the obvious way we need them
@huitseeker your idea around having a separate task to periodically report the values sounds definitely a way for us to avoid injecting all this "inner" code to the handlers. The thing is that in the end we'll need to build and maintain this mechanism which to be honest I don't know how much less hassle will be compared to what we currently have. We'll need pretty much the tagging I've introduced (a bounded executor needs to be identified by the corresponding networking handler) and we'll definitely need a way to "push" the values to this background collector. I mean now it seems that this PR is introducing many changes, but keep in mind that all the boilerplate now is there. If want to add additional metrics this will be only a few lines of change.

That being said I am open to give a quick try on the approach (2) mentioned above and compare the complexity

huitseeker

OK, after the experience of #559 I recognize how wise this is and I am now in favor of introducing this PR with minor changes.
I think this mostly needs a rebase.

huitseeker · 2022-07-22T22:01:47Z

network/src/metrics.rs

+                &["module", "network"],
+                registry
+            )
+            .unwrap(),


I'd like to not panic the node if this fails: metrics are not enough to justify a crash, IMHO.

Ah that should happen primarily if we have already registered the same metric in the codebase somewhere else. Having this there will protect us from accidentally declaring the same metric in two different places and go unnoticed - which I believe is quite important. So my expectation is that this will be caught early in our e2e tests before it even hits staging - or production worst case.

huitseeker · 2022-07-22T22:02:05Z

network/src/metrics.rs

+            network_concurrent_tasks: register_int_gauge_vec_with_registry!(
+                "worker_network_concurrent_tasks",
+                "The number of concurrent tasks running in the network connector",
+                &["module", "network"],
+                registry
+            )
+            .unwrap(),


see above re: the avoidable panic

Similar to this #472 (comment)

huitseeker · 2022-07-25T20:37:25Z

@akichidis you may need to rebase this one to pass checks: we changed a linter in #485

akichidis · 2022-07-28T14:36:34Z

@huitseeker @asonnino @bmwill @velvia I've rebased the PR in the latest master, if you could please re-review

huitseeker

Thank you very much 🎉

We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706). PR MystenLabs#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs#759

) We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706). PR #472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes #759

…ystenLabs#763) We operate an executor with a bound on the concurrent number of messages (see MystenLabs#463, MystenLabs#559, MystenLabs#706). PR MystenLabs#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs#759

) We operate an executor with a bound on the concurrent number of messages (see #463, #559, #706). PR #472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes #759

…#472)

…ystenLabs/narwhal#763) We operate an executor with a bound on the concurrent number of messages (see MystenLabs/narwhal#463, MystenLabs/narwhal#559, MystenLabs/narwhal#706). PR MystenLabs/narwhal#472 added logging for the bound being hit. We expect the executors to operate for a long time at this limit (e.g. in recovery situation). The spammy logging is not usfeful This removes the logging of the concurrency bound being hit. Fixes MystenLabs/narwhal#759

akichidis requested a review from bmwill as a code owner July 11, 2022 12:29

akichidis requested review from asonnino, huitseeker and velvia July 11, 2022 12:29

akichidis commented Jul 11, 2022

View reviewed changes

huitseeker reviewed Jul 12, 2022

View reviewed changes

huitseeker reviewed Jul 22, 2022

View reviewed changes

akichidis force-pushed the feat/add-metrics-on-bounded-executor branch 2 times, most recently from c6a5b00 to 04a2e75 Compare July 28, 2022 14:29

akichidis requested a review from huitseeker July 28, 2022 14:34

huitseeker approved these changes Jul 28, 2022

View reviewed changes

akichidis force-pushed the feat/add-metrics-on-bounded-executor branch from 25a0d25 to 51b02a0 Compare July 29, 2022 16:56

adding metrics on the network queue. Refactored code to include metrics

e48a271

akichidis force-pushed the feat/add-metrics-on-bounded-executor branch from 51b02a0 to e48a271 Compare July 29, 2022 17:36

akichidis merged commit c17ba23 into main Jul 29, 2022

akichidis deleted the feat/add-metrics-on-bounded-executor branch July 29, 2022 17:58

huitseeker mentioned this pull request Aug 14, 2022

fix: Avoid spammy logging on hitting the network's BoundedExecutor #763

Merged

mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 30, 2022

[metrics] adding metrics on the network executors (MystenLabs/narwhal…

d10928e

…#472)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metrics] add metrics to monitor concurrent tasks in network #472

[metrics] add metrics to monitor concurrent tasks in network #472

akichidis commented Jul 11, 2022 •

edited

Loading

akichidis Jul 11, 2022

huitseeker left a comment •

edited

Loading

velvia commented Jul 12, 2022

akichidis commented Jul 19, 2022

huitseeker left a comment

huitseeker Jul 22, 2022

akichidis Jul 25, 2022

huitseeker Jul 22, 2022

akichidis Jul 25, 2022

huitseeker commented Jul 25, 2022

akichidis commented Jul 28, 2022

huitseeker left a comment

[metrics] add metrics to monitor concurrent tasks in network #472

[metrics] add metrics to monitor concurrent tasks in network #472

Conversation

akichidis commented Jul 11, 2022 • edited Loading

akichidis Jul 11, 2022

Choose a reason for hiding this comment

huitseeker left a comment • edited Loading

Choose a reason for hiding this comment

velvia commented Jul 12, 2022

akichidis commented Jul 19, 2022

huitseeker left a comment

Choose a reason for hiding this comment

huitseeker Jul 22, 2022

Choose a reason for hiding this comment

akichidis Jul 25, 2022

Choose a reason for hiding this comment

huitseeker Jul 22, 2022

Choose a reason for hiding this comment

akichidis Jul 25, 2022

Choose a reason for hiding this comment

huitseeker commented Jul 25, 2022

akichidis commented Jul 28, 2022

huitseeker left a comment

Choose a reason for hiding this comment

akichidis commented Jul 11, 2022 •

edited

Loading

huitseeker left a comment •

edited

Loading