-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expose more libp2p performance and queuing metrics #678
Conversation
I like the idea of having a libp2p grafana dashboard, though maybe it could go in the |
My problem with this has always been that you can't really use it directly with libp2p since there is no executable and there doesn't seem to be any way of importing definitions from a top level project, which is a bit of a shame. |
We could have some demo DummyApp that e.g. gossips a bit. Obviously not as integral part of the code, but in a separate folder. Or maybe some network performance testing app that sends around data. |
Sure, but that could be a nice "first dashboard" for anyone using libp2p for their project, or even a dedicated dashboard for troubleshooting libp2p, which could be used by any project |
+1 for a libp2p metrics dashboard that exposes as many metrics as possible - all apps using libp2p will benefit from it because it shows the full measurement capabilities available - they can import it in a "subsection" of their "app dashboard" or use it as copy-paste source |
Should I rebase it on top of some current branch? I've intentionally started from the commit used in Nimbus-eth2 stable branch. |
Nimbus unstable should currently point to master/unstable (except the last commit) |
Oh! So you can import the definitions in grafana? That is indeed awesome. |
https://grafana.com/docs/grafana/latest/panels/panel-library/ or you can just copy-paste the json snippet that is available via buttons in the grafana ui. |
@Menduist , I would say this is ready for merge. |
libp2p/muxers/mplex/lpchannel.nim
Outdated
@@ -23,6 +23,10 @@ logScope: | |||
|
|||
when defined(libp2p_network_protocols_metrics): | |||
declareCounter libp2p_protocols_bytes, "total sent or received bytes", ["protocol", "direction"] | |||
declareHistogram libp2p_protocols_qlen, "message queue length", ["protocol"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it really make sense to have theses metrics be per protocol?
Since the queue is global, they should be about the same for every protocol, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queue should be per stream tho, which can be mapped to a protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having it be per-protocol means 12 metrics instead of one (for nimbus), we have to be sure it make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, there are limits to how many metrics we can collect in practice, even for tracing/debugging purposes, for example prometheus gets quickly overwhelmed if too many metrics are being pushed to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything stopping this from being merged?
We talked about this a little bit with tanguy - libp2p is clearly layered and it would make sense that the exposed metrics have a structure that follows these layers "naturally" - for example, With that in mind, it's good to be mindful that it's easier to add metrics than it is to remove them, though it's not a strict requirement that obsolete metrics hang around - it's a "soft" compatibility thing. |
the other point about the layering was that one should be able to read a story from the metrics - just like one API uses another and translates the abstraction levels, so should the metrics relate to each other layer by layer |
Though, the metric added here is a bit different The goal of this metric is only to know how long our queue is. And this queue will be per-peer, basically Hence my question. My gut feel is that if anything, having it be per-protocol will be error inducing, since the buffering is not actually per-protocol, but per-peer. But if it exhibits interesting behaviors, happy to keep it that way |
oh, sorry, |
Good point, in this context it does make sense to make it per peer rather than per protocol. I agree with @arnetheduck about layering and in general would like to see our metrics re-engineered both at the level of information we collect as well as the overall architecture. We're quickly reaching the limits of our simplistic approach and its only becoming harder to keep track as we keep adding more. |
Indeed, it would be great to have metrics per layer with some naming convention, and feature flags for anything that is somewhat heavy in processing.
|
Sorry I don't remember, what was the state of this? @cskiraly |
I think we've resolved everything, and it can be merged. |
3 groups of metrics:
I haven't seen anything I would consider an overlap. |
@@ -25,6 +25,7 @@ declareGauge(libp2p_gossipsub_no_peers_topics, "number of topics in mesh with no | |||
declareGauge(libp2p_gossipsub_low_peers_topics, "number of topics in mesh with at least one but below dlow peers") | |||
declareGauge(libp2p_gossipsub_healthy_peers_topics, "number of topics in mesh with at least dlow peers (but below dhigh)") | |||
declareCounter(libp2p_gossipsub_above_dhigh_condition, "number of above dhigh pruning branches ran", labels = ["topic"]) | |||
declareSummary(libp2p_gossipsub_mcache_hit, "ratio of successful IWANT message cache lookups", labels = ["topic"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the "topic" label used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not set. This would make sense as a per-topic meric, but I see IHAVE and other messages have a topicID field, but IWANT does not.
If we still have the message in mcache, we could get the topic ID. But if it is missing, we just simply don't know, I'm afraid.
Consequence: if you agree, lets just remove labels here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you don't set declared labels, metrics will crash at runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's remove it
Adding counters for received deduplicated messages and for duplicates recognized by the seen cache. Note that duplicates that are not recognized (arrive after seenTTL) are not counted as duplicates here either.
It is generally assumed that IWANT messages arrive when mcache still has the message. These stats are to verify this assumption.
Messages are queued in TX before getting written on the stream, but we have no statistics about these queues. This patch adds some queue length and queuing time related statistics.
Adding Grafana dashboard with newly exposed metrics.
libp2p_protocols_qtime is keeping metrics per-protocol, thus we put it behind the same compile time flag as similar per-protocol metrics.
Since the default queue size is 1024, it is better to have buckets up to 512.
Queuing is currently happening in the mplex layer. We better make this clear in the name of the metrics and associated flags.
This would make sense as a per-topic metric, but while IHAVE and other messages have a topicID field, IWANT does not. If we still have the message in mcache, we could get the topic ID. But if it is missing, we just simply don't know the topic.
Signed-off-by: Csaba Kiraly <[email protected]>
This goal of this PR is to add metrics at the libp2p level allowing deeper insight into message dynamics.
To see all the metrics, compile with:
-d:libp2p_network_protocols_metrics
The PR also includes a Grafana dashboard (in the tools folder) to show some libp2p level metrics.