Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose more libp2p performance and queuing metrics #678

Merged
merged 13 commits into from
Apr 6, 2022
Merged

Conversation

cskiraly
Copy link
Contributor

@cskiraly cskiraly commented Dec 20, 2021

This goal of this PR is to add metrics at the libp2p level allowing deeper insight into message dynamics.

To see all the metrics, compile with: -d:libp2p_network_protocols_metrics

The PR also includes a Grafana dashboard (in the tools folder) to show some libp2p level metrics.

@Menduist Menduist changed the base branch from master to unstable December 20, 2021 16:20
@Menduist
Copy link
Contributor

I like the idea of having a libp2p grafana dashboard, though maybe it could go in the tools folder?

@dryajov
Copy link
Contributor

dryajov commented Dec 20, 2021

I like the idea of having a libp2p grafana dashboard, though maybe it could go in the tools folder?

My problem with this has always been that you can't really use it directly with libp2p since there is no executable and there doesn't seem to be any way of importing definitions from a top level project, which is a bit of a shame.

@cskiraly
Copy link
Contributor Author

We could have some demo DummyApp that e.g. gossips a bit. Obviously not as integral part of the code, but in a separate folder. Or maybe some network performance testing app that sends around data.

@Menduist
Copy link
Contributor

can't really use it directly with libp2p

Sure, but that could be a nice "first dashboard" for anyone using libp2p for their project, or even a dedicated dashboard for troubleshooting libp2p, which could be used by any project

@arnetheduck
Copy link
Contributor

+1 for a libp2p metrics dashboard that exposes as many metrics as possible - all apps using libp2p will benefit from it because it shows the full measurement capabilities available - they can import it in a "subsection" of their "app dashboard" or use it as copy-paste source

@cskiraly
Copy link
Contributor Author

Should I rebase it on top of some current branch? I've intentionally started from the commit used in Nimbus-eth2 stable branch.

@Menduist
Copy link
Contributor

Nimbus unstable should currently point to master/unstable (except the last commit)
Use unstable as a target here

@dryajov
Copy link
Contributor

dryajov commented Dec 24, 2021

+1 for a libp2p metrics dashboard that exposes as many metrics as possible - all apps using libp2p will benefit from it because it shows the full measurement capabilities available - they can import it in a "subsection" of their "app dashboard" or use it as copy-paste source

Oh! So you can import the definitions in grafana? That is indeed awesome.

@arnetheduck
Copy link
Contributor

import the definitions in grafana

https://grafana.com/docs/grafana/latest/panels/panel-library/

or you can just copy-paste the json snippet that is available via buttons in the grafana ui.

@cskiraly cskiraly changed the title expose more libp2p performance and queuing metrics (WIP) expose more libp2p performance and queuing metrics Jan 10, 2022
@cskiraly
Copy link
Contributor Author

@Menduist , I would say this is ready for merge.

@@ -23,6 +23,10 @@ logScope:

when defined(libp2p_network_protocols_metrics):
declareCounter libp2p_protocols_bytes, "total sent or received bytes", ["protocol", "direction"]
declareHistogram libp2p_protocols_qlen, "message queue length", ["protocol"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really make sense to have theses metrics be per protocol?
Since the queue is global, they should be about the same for every protocol, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue should be per stream tho, which can be mapped to a protocol.

Copy link
Contributor

@Menduist Menduist Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having it be per-protocol means 12 metrics instead of one (for nimbus), we have to be sure it make sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, there are limits to how many metrics we can collect in practice, even for tracing/debugging purposes, for example prometheus gets quickly overwhelmed if too many metrics are being pushed to it.

dryajov
dryajov previously approved these changes Jan 14, 2022
Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything stopping this from being merged?

@arnetheduck
Copy link
Contributor

We talked about this a little bit with tanguy - libp2p is clearly layered and it would make sense that the exposed metrics have a structure that follows these layers "naturally" - for example, mplex is a layer and metrics that relate to it would have "mplex" in their name - the "layering" would also help determine the granularity of metrics etc.

With that in mind, it's good to be mindful that it's easier to add metrics than it is to remove them, though it's not a strict requirement that obsolete metrics hang around - it's a "soft" compatibility thing.

@arnetheduck
Copy link
Contributor

the other point about the layering was that one should be able to read a story from the metrics - just like one API uses another and translates the abstraction levels, so should the metrics relate to each other layer by layer

@Menduist
Copy link
Contributor

Though, the metric added here is a bit different
It's not really tied to mplex, we just put it there because that's where the "buffering" happens to be

The goal of this metric is only to know how long our queue is. And this queue will be per-peer, basically

Hence my question. My gut feel is that if anything, having it be per-protocol will be error inducing, since the buffering is not actually per-protocol, but per-peer. But if it exhibits interesting behaviors, happy to keep it that way

@arnetheduck
Copy link
Contributor

oh, sorry, mplex was just the first layer that came to mind, mostly because it's in the middle :)

@dryajov
Copy link
Contributor

dryajov commented Jan 14, 2022

Though, the metric added here is a bit different It's not really tied to mplex, we just put it there because that's where the "buffering" happens to be

The goal of this metric is only to know how long our queue is. And this queue will be per-peer, basically

Hence my question. My gut feel is that if anything, having it be per-protocol will be error inducing, since the buffering is not actually per-protocol, but per-peer. But if it exhibits interesting behaviors, happy to keep it that way

Good point, in this context it does make sense to make it per peer rather than per protocol.

I agree with @arnetheduck about layering and in general would like to see our metrics re-engineered both at the level of information we collect as well as the overall architecture. We're quickly reaching the limits of our simplistic approach and its only becoming harder to keep track as we keep adding more.

@cskiraly
Copy link
Contributor Author

Though, the metric added here is a bit different It's not really tied to mplex, we just put it there because that's where the "buffering" happens to be
The goal of this metric is only to know how long our queue is. And this queue will be per-peer, basically
Hence my question. My gut feel is that if anything, having it be per-protocol will be error inducing, since the buffering is not actually per-protocol, but per-peer. But if it exhibits interesting behaviors, happy to keep it that way

Good point, in this context it does make sense to make it per peer rather than per protocol.

I agree with @arnetheduck about layering and in general would like to see our metrics re-engineered both at the level of information we collect as well as the overall architecture. We're quickly reaching the limits of our simplistic approach and its only becoming harder to keep track as we keep adding more.

Indeed, it would be great to have metrics per layer with some naming convention, and feature flags for anything that is somewhat heavy in processing.

  • For the queuing related metrics, they are in mplex, so I would rename them libp2p_mplex_* (or libp2p_muxers_* ?). Per protocol queue metrics indeed does not make sense here. Per peer is instead heavy, so that should definitely be behind a flag. I change it to be just a simple metric by default, and see whether a per peer behind a compile time flag makes sense.
  • Duplicate related metrics are already called libp2p_gossipsub_*

@Menduist
Copy link
Contributor

Sorry I don't remember, what was the state of this? @cskiraly

@cskiraly
Copy link
Contributor Author

cskiraly commented Mar 8, 2022

Sorry I don't remember, what was the state of this? @cskiraly

I think we've resolved everything, and it can be merged.

@Menduist Menduist self-requested a review March 8, 2022 14:24
Menduist
Menduist previously approved these changes Mar 8, 2022
@cskiraly
Copy link
Contributor Author

cskiraly commented Mar 15, 2022

3 groups of metrics:

  • libp2p_mplex_: here we add metrics related to queuing that happens in code. These metrics are behind when defined(libp2p_mplex_metrics) to avoid overhead when not looking into queuing behavior
  • libp2p_gossipsub_: new metrics are here to verify assumptions on overhead due to blind push behaviour, a protocol choice that has its reasons and costs.
  • libp2p_gossipsub_mcache_: new metric to provide insight into the pull behaviour of the protocol

I haven't seen anything I would consider an overlap.
Naming is the best I can come up with. I could imagine other names for the mplex part, like stream for example.

@@ -25,6 +25,7 @@ declareGauge(libp2p_gossipsub_no_peers_topics, "number of topics in mesh with no
declareGauge(libp2p_gossipsub_low_peers_topics, "number of topics in mesh with at least one but below dlow peers")
declareGauge(libp2p_gossipsub_healthy_peers_topics, "number of topics in mesh with at least dlow peers (but below dhigh)")
declareCounter(libp2p_gossipsub_above_dhigh_condition, "number of above dhigh pruning branches ran", labels = ["topic"])
declareSummary(libp2p_gossipsub_mcache_hit, "ratio of successful IWANT message cache lookups", labels = ["topic"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "topic" label used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not set. This would make sense as a per-topic meric, but I see IHAVE and other messages have a topicID field, but IWANT does not.
If we still have the message in mcache, we could get the topic ID. But if it is missing, we just simply don't know, I'm afraid.
Consequence: if you agree, lets just remove labels here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you don't set declared labels, metrics will crash at runtime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's remove it

Menduist
Menduist previously approved these changes Mar 17, 2022
cskiraly added 12 commits April 6, 2022 01:07
Adding counters for received deduplicated messages and for
duplicates recognized by the seen cache. Note that duplicates that
are not recognized (arrive after seenTTL) are not counted as
duplicates here either.
It is generally assumed that IWANT messages arrive when mcache still
has the message. These stats are to verify this assumption.
Messages are queued in TX before getting written on the stream,
but we have no statistics about these queues. This patch adds
some queue length and queuing time related statistics.
Adding Grafana dashboard with newly exposed metrics.
libp2p_protocols_qtime is keeping metrics per-protocol, thus we put
it behind the same compile time flag as similar per-protocol metrics.
Since the default queue size is 1024, it is better to have buckets up to
512.
Queuing is currently happening in the mplex layer. We better
make this clear in the name of the metrics and associated flags.
This would make sense as a per-topic metric, but while IHAVE and
other messages have a topicID field, IWANT does not.

If we still have the message in mcache, we could get the topic ID.
But if it is missing, we just simply don't know the topic.
@cskiraly cskiraly requested a review from Menduist April 6, 2022 10:46
@cskiraly cskiraly merged commit 9973b94 into unstable Apr 6, 2022
@cskiraly cskiraly deleted the metrics-wip branch April 6, 2022 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants