-
Notifications
You must be signed in to change notification settings - Fork 2.6k
R4R: prometheus exporter in substrate #4511
Conversation
It looks like @nodebreaker0-0 signed our Contributor License Agreement. 👍 Many thanks, Parity Technologies CLA Bot |
6686b92
to
1974f94
Compare
Hi, thanks for your PR! Have you considered integrating these changes with the existing metrics recording set up? #4187 |
@expenses To expand, it would be nice to be able to do record_metrics!(
a => 1,
b => 2,
...
) and have it go to either the built-in metrics server or the Prometheus server depending on a
|
Integrating the metric server is a good idea. but, The reason for using prometheus is due to grafana extensibility and to use the function completely, you must use prometheus code as below. But the heart of the project is v0.4, which will be integrated later. Further consideration is needed for integrations that take into account extended metrics. First, If the code runs without any problem, |
This shouldn't be a problem, as I think that the metrics recording will only happen outside the runtime. |
Hi. I think the metric serving part should be separated from grafana-data-source so that we can connect our metric into the separated module. If you can separate it soon(2weeks from now)(we spent too much time on this project as now), we can migrate the metric part to newly separated metric serving module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am one of the maintainers of the Prometheus project, thus naturally I am very happy about this effort.
I don't think we should expose both a Granfana endpoint as well as a Prometheus endpoint. Given that Grafana can connect to a Prometheus server and given that the Prometheus metric format is adopted by many software projects and is pretty much the industry standard, I would prefer a Prometheus endpoint. But before we rush anything, this should be properly discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks decent in general. A few nitpicks and things to tidy up around gauge creation etc.
I've made a PR on this fork with a few changes :^) nodebreaker0-0#2 |
client/tracing/src/lib.rs
Outdated
log::warn!("Unable to send metrics to grafana: {:?}", e); | ||
} | ||
fn send_prometheus(span_datum: SpanDatum) { | ||
unimplemented!() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mxinden do you know of a way that we could implement this API? It seems like prometheus gauges etc are intended to created as static refs etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that you can't implement it without creating a macro.
Or you can simply make a static reference.
https://github.com/tikv/rust-prometheus/blob/master/examples/example_edition_2018.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't yet understand fully how prometheus works, but to properly support this I think we would need to be able to:
- batch together all values (spans) collected since last scrape
- partition by target and name (by the use of labels perhaps).
With all targets currently in the code enabled, we can be approaching 200 measurements per second just across the runtime. Is this feasible to do while preserving timestamp (which we would need to explicitly provide for this receiver)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'll have a go at this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be approaching 200 measurements
@mattrutherford 200 measurements are not an issue for Prometheus. Prometheus can easily handle couple of mega bytes of metric data per scrape. See kubernetes/kube-state-metrics#498 for some numbers.
batch together all values (spans) collected since last scrape
This sounds more like we want a Histogram, right?
Would it be possible to tackle the effort within client/tracing
as a follow up pull request? I would like to keep this one small. What do you think @mattrutherford @expenses @nodebreaker0-0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's good to be clear on terminology, but that wasn't my main concern - only that if we're ripping out the Grafana server as part of the introduction of this Prometheus feature, is whether we'd lose this functionality to plot individual observations* - however that seems unavoidable based on what you said.
*For my use-case it's OK because I use the Telemetry Receiver in substrate-tracing to send the data to substrate-analytics which uses a PostgreSQL datasource, which in turn can be queried by Grafana (because I want to archive the data); however this is not trivial to set up, so the question is - do we really want to kill substrate Grafana server yet? It's a relatively new feature so maybe not many people use it, but I think it's something we should consider as part of our decision to remove it. Particularly as the current primary use-case for substrate-tracing
is for profiling and that might be handy for developers to have any easy way to look at this.
Pinging @marcio-diaz @shawntabrizi @DarkEld3r in case you have opinion on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo the Grafana server in its current state isn't really good at doing anything. If Prometheus isn't going to work for tracing, we should probably find something that is. I'm happy enough to leave the Grafana server in for the moment until that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, good to know - and just to be clear that I have no problem removing it, if it's not going to be a big detriment to people - just wanted to make sure everyone was aware of full implications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tracing other's nodes is not (should not) the purpose of the prometheus exporter. It is purely for monitoring operator's own nodes. Extracting information from nodes operated by others is not sounding right as a perspective of information privacy.
I guess tracing functionality should remain in telemetry, and it should be also "off as default" for privacy of node operators.
And storing and querying historical data for own nodes can be done on monitoring server(grafana), and it should not burden the node itself.
Please let me know if I misunderstood this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I can see what the misunderstanding is here and it warrants some more explanation - tracing is only capturing local data; which we then have the option to send somewhere, either Log (output tracing data via logger), Grafana server (built-in) or Telemetry. When using telemetry to send the tracing data, it is not expected (or desired) to send to the default telemetry url, so we override that via the cli, with eg: --telemetry-url 'ws://localhost:8080 9'
to send the data to an analytics server (often running on the same machine, but can be anywhere you want).
Co-Authored-By: Max Inden <[email protected]>
Fixed input validation in / metrics and enabling --no-prometheus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for introducing the --no-prometheus
flag and restricting exposition to /metrics
!
utils/prometheus/src/lib.rs
Outdated
.map_err(Error::Http) | ||
} else { | ||
Response::builder() | ||
.status(StatusCode::NOT_FOUND) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line and the ones below would need to be indented. See Response::builder().status(...)
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complete
client/cli/src/params.rs
Outdated
#[structopt(long = "prometheus-port", value_name = "PORT")] | ||
pub prometheus_port: Option<u16>, | ||
|
||
/// Disable connecting to the Substrate prometheus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Disable connecting to the Substrate prometheus. | |
/// Do not expose a Prometheus metric endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One does not disable others to connect to it, but the exposition in itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complete
client/cli/src/params.rs
Outdated
|
||
/// Disable connecting to the Substrate prometheus. | ||
/// | ||
/// prometheus is on by default on global chains. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// prometheus is on by default on global chains. | |
/// Prometheus metric endpoint is enabled by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complete
As far as I can tell ccb3179 is missing the following change in order to compile: diff --git a/client/cli/src/params.rs b/client/cli/src/params.rs
index 128c66b30..1a4b28443 100644
--- a/client/cli/src/params.rs
+++ b/client/cli/src/params.rs
@@ -299,7 +299,6 @@ arg_enum! {
pub enum TracingReceiver {
Log,
Telemetry,
- Prometheus,
}
}
@@ -308,7 +307,6 @@ impl Into<sc_tracing::TracingReceiver> for TracingReceiver {
match self {
TracingReceiver::Log => sc_tracing::TracingReceiver::Log,
TracingReceiver::Telemetry => sc_tracing::TracingReceiver::Telemetry,
- TracingReceiver::Prometheus => sc_tracing::TracingReceiver::Prometheus,
}
}
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me once:
-
The branch compiles (see small diff above).
-
The merge conflicts have been addressed.
Thanks for the hard work and again thanks for bearing with us.
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to go from my side. (One small comment, but that can be deferred to a follow up.)
@expenses do you have any further comments?
client/service/src/builder.rs
Outdated
|
||
let _ = to_spawn_tx.unbounded_send(( | ||
Box::pin(future), | ||
From::from("prometheus-on-block") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what "prometheus-on-block" relates to. I would expect something like prometheus-server
or prometheus-endpoint
. Can you explain your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand a little bit about what this was, so I copied the Telemetry, but I think it's the prometheus-endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I have a few nitpicks around style (missing newlines at EOF etc) but I don't want to delay the PR further.
@nodebreaker0-0 you need to update Cargo.lock so that CI happy run |
Real Done!. |
please merge master and we are ready to merge. |
We are B-Harvest working on w3f granted project
(https://github.com/w3f/Web3-collaboration/blob/master/grants/speculative/substrate%20x%20(prometheus%20%2B%20grafana)%20by%20B-Harvest.md).
This PR is a draft implementation to request a review to be merged from parity team. We will do necessary modification if raised.
Please refer to README.md for detailed function information.
Create new due to error:
[#4505 ]