expose more libp2p performance and queuing metrics #678

cskiraly · 2021-12-20T15:28:56Z

This goal of this PR is to add metrics at the libp2p level allowing deeper insight into message dynamics.

To see all the metrics, compile with: -d:libp2p_network_protocols_metrics

The PR also includes a Grafana dashboard (in the tools folder) to show some libp2p level metrics.

libp2p/protocols/pubsub/gossipsub/behavior.nim

Menduist · 2021-12-20T16:24:09Z

I like the idea of having a libp2p grafana dashboard, though maybe it could go in the tools folder?

dryajov · 2021-12-20T16:27:40Z

I like the idea of having a libp2p grafana dashboard, though maybe it could go in the tools folder?

My problem with this has always been that you can't really use it directly with libp2p since there is no executable and there doesn't seem to be any way of importing definitions from a top level project, which is a bit of a shame.

cskiraly · 2021-12-20T18:08:10Z

We could have some demo DummyApp that e.g. gossips a bit. Obviously not as integral part of the code, but in a separate folder. Or maybe some network performance testing app that sends around data.

Menduist · 2021-12-21T09:24:28Z

can't really use it directly with libp2p

Sure, but that could be a nice "first dashboard" for anyone using libp2p for their project, or even a dedicated dashboard for troubleshooting libp2p, which could be used by any project

arnetheduck · 2021-12-21T09:28:11Z

+1 for a libp2p metrics dashboard that exposes as many metrics as possible - all apps using libp2p will benefit from it because it shows the full measurement capabilities available - they can import it in a "subsection" of their "app dashboard" or use it as copy-paste source

cskiraly · 2021-12-21T11:04:03Z

Should I rebase it on top of some current branch? I've intentionally started from the commit used in Nimbus-eth2 stable branch.

Menduist · 2021-12-21T11:11:00Z

Nimbus unstable should currently point to master/unstable (except the last commit)
Use unstable as a target here

dryajov · 2021-12-24T02:26:26Z

+1 for a libp2p metrics dashboard that exposes as many metrics as possible - all apps using libp2p will benefit from it because it shows the full measurement capabilities available - they can import it in a "subsection" of their "app dashboard" or use it as copy-paste source

Oh! So you can import the definitions in grafana? That is indeed awesome.

arnetheduck · 2021-12-28T06:59:05Z

import the definitions in grafana

https://grafana.com/docs/grafana/latest/panels/panel-library/

or you can just copy-paste the json snippet that is available via buttons in the grafana ui.

cskiraly · 2022-01-11T12:52:53Z

@Menduist , I would say this is ready for merge.

Menduist · 2022-01-11T13:09:54Z

libp2p/muxers/mplex/lpchannel.nim

@@ -23,6 +23,10 @@ logScope:

 when defined(libp2p_network_protocols_metrics):
  declareCounter libp2p_protocols_bytes, "total sent or received bytes", ["protocol", "direction"]
+  declareHistogram libp2p_protocols_qlen, "message queue length", ["protocol"],


Does it really make sense to have theses metrics be per protocol?
Since the queue is global, they should be about the same for every protocol, no?

The queue should be per stream tho, which can be mapped to a protocol.

Having it be per-protocol means 12 metrics instead of one (for nimbus), we have to be sure it make sense

You're right, there are limits to how many metrics we can collect in practice, even for tracing/debugging purposes, for example prometheus gets quickly overwhelmed if too many metrics are being pushed to it.

dryajov

Anything stopping this from being merged?

arnetheduck · 2022-01-14T16:29:01Z

We talked about this a little bit with tanguy - libp2p is clearly layered and it would make sense that the exposed metrics have a structure that follows these layers "naturally" - for example, mplex is a layer and metrics that relate to it would have "mplex" in their name - the "layering" would also help determine the granularity of metrics etc.

With that in mind, it's good to be mindful that it's easier to add metrics than it is to remove them, though it's not a strict requirement that obsolete metrics hang around - it's a "soft" compatibility thing.

arnetheduck · 2022-01-14T16:30:08Z

the other point about the layering was that one should be able to read a story from the metrics - just like one API uses another and translates the abstraction levels, so should the metrics relate to each other layer by layer

Menduist · 2022-01-14T16:56:58Z

Though, the metric added here is a bit different
It's not really tied to mplex, we just put it there because that's where the "buffering" happens to be

The goal of this metric is only to know how long our queue is. And this queue will be per-peer, basically

Hence my question. My gut feel is that if anything, having it be per-protocol will be error inducing, since the buffering is not actually per-protocol, but per-peer. But if it exhibits interesting behaviors, happy to keep it that way

arnetheduck · 2022-01-14T18:48:59Z

oh, sorry, mplex was just the first layer that came to mind, mostly because it's in the middle :)

dryajov · 2022-01-14T21:04:30Z

Though, the metric added here is a bit different It's not really tied to mplex, we just put it there because that's where the "buffering" happens to be

The goal of this metric is only to know how long our queue is. And this queue will be per-peer, basically

Hence my question. My gut feel is that if anything, having it be per-protocol will be error inducing, since the buffering is not actually per-protocol, but per-peer. But if it exhibits interesting behaviors, happy to keep it that way

Good point, in this context it does make sense to make it per peer rather than per protocol.

I agree with @arnetheduck about layering and in general would like to see our metrics re-engineered both at the level of information we collect as well as the overall architecture. We're quickly reaching the limits of our simplistic approach and its only becoming harder to keep track as we keep adding more.

cskiraly · 2022-01-18T09:12:04Z

Though, the metric added here is a bit different It's not really tied to mplex, we just put it there because that's where the "buffering" happens to be
The goal of this metric is only to know how long our queue is. And this queue will be per-peer, basically
Hence my question. My gut feel is that if anything, having it be per-protocol will be error inducing, since the buffering is not actually per-protocol, but per-peer. But if it exhibits interesting behaviors, happy to keep it that way

Good point, in this context it does make sense to make it per peer rather than per protocol.

I agree with @arnetheduck about layering and in general would like to see our metrics re-engineered both at the level of information we collect as well as the overall architecture. We're quickly reaching the limits of our simplistic approach and its only becoming harder to keep track as we keep adding more.

Indeed, it would be great to have metrics per layer with some naming convention, and feature flags for anything that is somewhat heavy in processing.

For the queuing related metrics, they are in mplex, so I would rename them libp2p_mplex_* (or libp2p_muxers_* ?). Per protocol queue metrics indeed does not make sense here. Per peer is instead heavy, so that should definitely be behind a flag. I change it to be just a simple metric by default, and see whether a per peer behind a compile time flag makes sense.
Duplicate related metrics are already called libp2p_gossipsub_*

libp2p/muxers/mplex/lpchannel.nim

Menduist · 2022-02-21T15:17:41Z

Sorry I don't remember, what was the state of this? @cskiraly

cskiraly · 2022-03-08T14:15:44Z

Sorry I don't remember, what was the state of this? @cskiraly

I think we've resolved everything, and it can be merged.

cskiraly · 2022-03-15T14:07:59Z

3 groups of metrics:

libp2p_mplex_: here we add metrics related to queuing that happens in code. These metrics are behind when defined(libp2p_mplex_metrics) to avoid overhead when not looking into queuing behavior
libp2p_gossipsub_: new metrics are here to verify assumptions on overhead due to blind push behaviour, a protocol choice that has its reasons and costs.
libp2p_gossipsub_mcache_: new metric to provide insight into the pull behaviour of the protocol

I haven't seen anything I would consider an overlap.
Naming is the best I can come up with. I could imagine other names for the mplex part, like stream for example.

Menduist · 2022-03-15T15:07:46Z

libp2p/protocols/pubsub/gossipsub/behavior.nim

@@ -25,6 +25,7 @@ declareGauge(libp2p_gossipsub_no_peers_topics, "number of topics in mesh with no
 declareGauge(libp2p_gossipsub_low_peers_topics, "number of topics in mesh with at least one but below dlow peers")
 declareGauge(libp2p_gossipsub_healthy_peers_topics, "number of topics in mesh with at least dlow peers (but below dhigh)")
 declareCounter(libp2p_gossipsub_above_dhigh_condition, "number of above dhigh pruning branches ran", labels = ["topic"])
+declareSummary(libp2p_gossipsub_mcache_hit, "ratio of successful IWANT message cache lookups", labels = ["topic"])


Is the "topic" label used?

not set. This would make sense as a per-topic meric, but I see IHAVE and other messages have a topicID field, but IWANT does not.
If we still have the message in mcache, we could get the topic ID. But if it is missing, we just simply don't know, I'm afraid.
Consequence: if you agree, lets just remove labels here

if you don't set declared labels, metrics will crash at runtime

Yeah, let's remove it

Adding counters for received deduplicated messages and for duplicates recognized by the seen cache. Note that duplicates that are not recognized (arrive after seenTTL) are not counted as duplicates here either.

It is generally assumed that IWANT messages arrive when mcache still has the message. These stats are to verify this assumption.

Messages are queued in TX before getting written on the stream, but we have no statistics about these queues. This patch adds some queue length and queuing time related statistics.

Adding Grafana dashboard with newly exposed metrics.

libp2p_protocols_qtime is keeping metrics per-protocol, thus we put it behind the same compile time flag as similar per-protocol metrics.

Since the default queue size is 1024, it is better to have buckets up to 512.

Queuing is currently happening in the mplex layer. We better make this clear in the name of the metrics and associated flags.

This would make sense as a per-topic metric, but while IHAVE and other messages have a topicID field, IWANT does not. If we still have the message in mcache, we could get the topic ID. But if it is missing, we just simply don't know the topic.

Signed-off-by: Csaba Kiraly <[email protected]>

Menduist changed the base branch from master to unstable December 20, 2021 16:20

Menduist reviewed Dec 20, 2021

View reviewed changes

libp2p/protocols/pubsub/gossipsub/behavior.nim Show resolved Hide resolved

cskiraly changed the title ~~expose more libp2p performance and queuing metrics (WIP)~~ expose more libp2p performance and queuing metrics Jan 10, 2022

Menduist reviewed Jan 11, 2022

View reviewed changes

dryajov previously approved these changes Jan 14, 2022

View reviewed changes

Menduist reviewed Jan 18, 2022

View reviewed changes

libp2p/muxers/mplex/lpchannel.nim Outdated Show resolved Hide resolved

cskiraly dismissed dryajov’s stale review via 1a5057f January 19, 2022 23:21

Menduist reviewed Jan 20, 2022

View reviewed changes

libp2p/muxers/mplex/lpchannel.nim Show resolved Hide resolved

Menduist self-requested a review March 8, 2022 14:24

Menduist previously approved these changes Mar 8, 2022

View reviewed changes

cskiraly dismissed Menduist’s stale review via bbbc127 March 15, 2022 13:53

Menduist reviewed Mar 15, 2022

View reviewed changes

Menduist previously approved these changes Mar 17, 2022

View reviewed changes

cskiraly added 12 commits April 6, 2022 01:07

gossipsub: adding duplicate arrival metrics

aab0a8e

Adding counters for received deduplicated messages and for duplicates recognized by the seen cache. Note that duplicates that are not recognized (arrive after seenTTL) are not counted as duplicates here either.

gossipsub: adding mcache (message cache for responding IWANT) stats

5600423

It is generally assumed that IWANT messages arrive when mcache still has the message. These stats are to verify this assumption.

libp2p: adding internal TX queuing stats

90c403d

Messages are queued in TX before getting written on the stream, but we have no statistics about these queues. This patch adds some queue length and queuing time related statistics.

adding Grafana libp2p dashboard

8787708

Adding Grafana dashboard with newly exposed metrics.

updating Grafana dashboard

ef6e8a6

moving Grafana dashboard under tools

addfb34

fixup: declare libp2p_protocols_qtime only if protocol metrics defined

b139142

libp2p_protocols_qtime is keeping metrics per-protocol, thus we put it behind the same compile time flag as similar per-protocol metrics.

fix qlen histogram buckets

661d931

Since the default queue size is 1024, it is better to have buckets up to 512.

renaming metrics and adding flag libp2p_mplex_metrics

9bbe8fc

Queuing is currently happening in the mplex layer. We better make this clear in the name of the metrics and associated flags.

updating grafana dashboard with new metric names

043c185

fixup metric description

bd9c5d7

cskiraly dismissed Menduist’s stale review via b2d85d4 April 5, 2022 23:20

cskiraly force-pushed the metrics-wip branch from bf587af to b2d85d4 Compare April 5, 2022 23:20

metrics: enable libp2p_mplex_metrics in nimble test

0b171eb

Signed-off-by: Csaba Kiraly <[email protected]>

cskiraly requested a review from Menduist April 6, 2022 10:46

Menduist approved these changes Apr 6, 2022

View reviewed changes

cskiraly merged commit 9973b94 into unstable Apr 6, 2022

cskiraly deleted the metrics-wip branch April 6, 2022 14:00

Menduist mentioned this pull request Jun 7, 2022

Bump libp2p status-im/nimbus-eth2#3709

Merged

etan-status mentioned this pull request May 20, 2023

Add Csaba Kiraly (Status) protocolguild/documentation#81

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expose more libp2p performance and queuing metrics #678

expose more libp2p performance and queuing metrics #678

cskiraly commented Dec 20, 2021 •

edited

Loading

Menduist commented Dec 20, 2021

dryajov commented Dec 20, 2021

cskiraly commented Dec 20, 2021

Menduist commented Dec 21, 2021

arnetheduck commented Dec 21, 2021

cskiraly commented Dec 21, 2021

Menduist commented Dec 21, 2021

dryajov commented Dec 24, 2021

arnetheduck commented Dec 28, 2021

cskiraly commented Jan 11, 2022

Menduist Jan 11, 2022

dryajov Jan 14, 2022

Menduist Jan 14, 2022 •

edited

Loading

dryajov Jan 14, 2022

dryajov left a comment

arnetheduck commented Jan 14, 2022

arnetheduck commented Jan 14, 2022

Menduist commented Jan 14, 2022

arnetheduck commented Jan 14, 2022

dryajov commented Jan 14, 2022

cskiraly commented Jan 18, 2022

Menduist commented Feb 21, 2022

cskiraly commented Mar 8, 2022

cskiraly commented Mar 15, 2022 •

edited

Loading

Menduist Mar 15, 2022

cskiraly Mar 17, 2022

arnetheduck Mar 17, 2022

Menduist Mar 17, 2022

expose more libp2p performance and queuing metrics #678

expose more libp2p performance and queuing metrics #678

Conversation

cskiraly commented Dec 20, 2021 • edited Loading

Menduist commented Dec 20, 2021

dryajov commented Dec 20, 2021

cskiraly commented Dec 20, 2021

Menduist commented Dec 21, 2021

arnetheduck commented Dec 21, 2021

cskiraly commented Dec 21, 2021

Menduist commented Dec 21, 2021

dryajov commented Dec 24, 2021

arnetheduck commented Dec 28, 2021

cskiraly commented Jan 11, 2022

Menduist Jan 11, 2022

Choose a reason for hiding this comment

dryajov Jan 14, 2022

Choose a reason for hiding this comment

Menduist Jan 14, 2022 • edited Loading

Choose a reason for hiding this comment

dryajov Jan 14, 2022

Choose a reason for hiding this comment

dryajov left a comment

Choose a reason for hiding this comment

arnetheduck commented Jan 14, 2022

arnetheduck commented Jan 14, 2022

Menduist commented Jan 14, 2022

arnetheduck commented Jan 14, 2022

dryajov commented Jan 14, 2022

cskiraly commented Jan 18, 2022

Menduist commented Feb 21, 2022

cskiraly commented Mar 8, 2022

cskiraly commented Mar 15, 2022 • edited Loading

Menduist Mar 15, 2022

Choose a reason for hiding this comment

cskiraly Mar 17, 2022

Choose a reason for hiding this comment

arnetheduck Mar 17, 2022

Choose a reason for hiding this comment

Menduist Mar 17, 2022

Choose a reason for hiding this comment

cskiraly commented Dec 20, 2021 •

edited

Loading

Menduist Jan 14, 2022 •

edited

Loading

cskiraly commented Mar 15, 2022 •

edited

Loading