-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[native] Expose REST API to fetch worker stats in Prometheus format #22360
[native] Expose REST API to fetch worker stats in Prometheus format #22360
Conversation
f218fc1
to
6470454
Compare
Codenotify: Notifying subscribers in CODENOTIFY files for diff bccb13c...c73c5aa. No notifications. |
6470454
to
711cfb6
Compare
#include <gtest/gtest.h> | ||
|
||
namespace facebook::presto::prometheus { | ||
class PrometheusReporterTest : public testing::Test { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will not run in the CI since that flag is off by default, can we do anything else to make it run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have to create a new CI job that builds code with prometheus enabled and run this test. Or, always build the code in existing CI jobs with prometheus enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just add a comment stating that this test is NoOp in CI for now
// the ::prometheus::Registry allows us to Add new metric object each time | ||
// which overwrites the existing metric object in its internal mapping. | ||
mutable CounterMap countersMap_; | ||
mutable GaugeMap gaugeMap_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to use folly::ConcurrentHashMaps directly instead of having a custom mutex? That way concurrency will be better since all maps do not need to share a single mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using Synchronous<> decorator.
mutable std::unordered_map<std::string, ::facebook::velox::StatType> | ||
registeredMetrics_; | ||
// @TODO: may be add StatType::HISTOGRAM to avoid this member. | ||
mutable std::unordered_set<std::string> registeredHistogramMetrics_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this set? Can we just use histogramMap_ instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we don't need it. I wanted to isolate the registered metrics Map which is modified once at startup from the actual metrics map (which can be modified concurrently).
// If percentiles are provided, create a Summary type metric and register. | ||
if (pcts.size() > 0) { | ||
auto& summaryFamily = ::prometheus::BuildSummary() | ||
.Name(keyStr.append(kSummarySuffix)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wont the metric name changed if we add a ksummarySuffix? Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is it so that maybe we don't need the histogram if pcts are present and we can just register summary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check prometheus, Histograms and Summaries are 2 different metric types. If the incoming metric has quantiles (pcts) specified, then we use summary type as well.
https://prometheus.io/docs/concepts/metric_types/#histogram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://prometheus.io/docs/practices/histograms/#:~:text=The%20essential%20difference%20between%20summaries,the%20server%20side%20using%20the
The essential difference between summaries and histograms is that summaries calculate streaming φ-quantiles on the client side and expose them directly, while histograms expose bucketed observation counts and the calculation of quantiles from the buckets of a histogram happens on the server side using the [histogram_quantile() function](https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile).
So maybe we do not need to register histogram if quantiles are present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this^ just fyi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can revisit this. Not sure at this point if both give different viewpoints or Summary is sufficient over histograms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thats fine, just check the kSummarySuffix
i think the code here should be keyStr + "_Suffix" by value and we should not change the reference here
case facebook::velox::StatType::SUM: | ||
case facebook::velox::StatType::AVG: | ||
case facebook::velox::StatType::RATE: { | ||
auto& gaugeFamily = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure about the code style but if we don't write std::string as ::std::string
then why should we write ::prometheus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a required thing to resolve global namespace. Compiler complains if we remove this. I don't have an exact reason why we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karteekmurthys : Can you add "using namespace prometheus" statement to avoid this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
// Prometheus format requires '.' to be replaced. | ||
std::string keyStr = std::string(key); | ||
std::replace(keyStr.begin(), keyStr.end(), '.', '_'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the replacement of '.' being used in addMetricValue
function do we need this there?
presto-native-execution/presto_cpp/main/common/prometheus-metrics/PrometheusReporter.cpp
Outdated
Show resolved
Hide resolved
// If percentiles are provided, create a Summary type metric and register. | ||
if (pcts.size() > 0) { | ||
auto& summaryFamily = ::prometheus::BuildSummary() | ||
.Name(keyStr.append(kSummarySuffix)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have keyStr.append(kSummarySuffix)
which means that the reference changed, then you have summaryMap_.emplace(std::string(key), summaryMetric);
. So i think you are storing the key+suffix in the summary map
.Register(*registry_); | ||
::prometheus::Summary::Quantiles quantiles; | ||
for (auto pct : pcts) { | ||
quantiles.push_back( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just use quantiles.emplace_back(pct / (double)100, 0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments + I think we need to setup a CI job for this (maybe make unittest with flag on just for this unit test temoporariliy)
711cfb6
to
a4a7bd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @karteekmurthys. Only few minor comments.
// Initialize singleton for the reporter | ||
folly::Singleton<facebook::velox::BaseStatsReporter> reporter([]() { | ||
auto nodeConfig = facebook::presto::NodeConfig::instance(); | ||
std::string cluster = nodeConfig->nodeEnvironment(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const for cluster, hostName and worker variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -1093,6 +1109,10 @@ void PrestoServer::populateMemAndCPUInfo() { | |||
|
|||
// Fill global pool fields. | |||
poolInfo.maxBytes = nodeMemoryGb * 1024 * 1024 * 1024; | |||
// TODO(sperhsin): If 'current bytes' is the same as we get by summing up all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit spelling spershin
@@ -1146,6 +1165,20 @@ void PrestoServer::reportServerInfo(proxygen::ResponseHandler* downstream) { | |||
http::sendOkResponse(downstream, json(serverInfo)); | |||
} | |||
|
|||
void PrestoServer::reportWorkerMetrics(proxygen::ResponseHandler* downstream) { | |||
auto nodeConfig = facebook::presto::NodeConfig::instance(); | |||
std::string cluster = nodeConfig->nodeEnvironment(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
char* hostName = std::getenv("HOSTNAME"); | ||
std::string worker = !hostName ? "" : hostName; | ||
#ifdef PRESTO_ENABLE_PROMETHEUS_REPORTER | ||
auto reporter = std::dynamic_pointer_cast< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe all variables above should be within #ifdef. Else it can be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -328,6 +336,14 @@ void PrestoServer::run() { | |||
proxygen::ResponseHandler* downstream) { | |||
server->reportServerInfo(downstream); | |||
}); | |||
httpServer_->registerGet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this will always be registered irrespective of whether you are using Prometheus. Is that desired ? We could avoid it in the Meta environment where they are not using Prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now it is registered and if the Prometheus is not enabled it is a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is odd to make it a no-op. It is better to return an invalid HTTP request error than an empty result. The later would be confusing.
Let's move the above if(systemConfig->enableRuntimeMetricsCollection())
check here and include this.
auto bound = min + bucketWidth; | ||
std::string keyStr = std::string(key); | ||
std::replace(keyStr.begin(), keyStr.end(), '.', '_'); | ||
auto& histogramFamily = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : Add newline before
} | ||
|
||
void PrometheusReporter::registerHistogramMetricExportType( | ||
folly::StringPiece key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not const& ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defined in interface.
"test_key4{" + labelsSerialized + "} 0"}; | ||
|
||
auto verifySerializedResult = [&](std::string& fullSerializedResult, | ||
std::vector<std::string>& expected) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const &
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"# TYPE test_key4 gauge", | ||
"test_key4{" + labelsSerialized + "} 0"}; | ||
|
||
auto verifySerializedResult = [&](std::string& fullSerializedResult, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel you should send a const& here and make a copy in the verify routine. It makes it more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"# TYPE test_key4 gauge", | ||
"test_key4{" + labelsSerialized + "} 0"}; | ||
|
||
auto verifySerializedResult = [&](std::string& fullSerializedResult, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can even make this a separate function instead of a lambda
22c1820
to
12302f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments related to build integration.
presto-native-execution/presto_cpp/main/common/prometheus-metrics/PrometheusReporter.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karteekmurthys some high-level comments.
@@ -328,6 +336,14 @@ void PrestoServer::run() { | |||
proxygen::ResponseHandler* downstream) { | |||
server->reportServerInfo(downstream); | |||
}); | |||
httpServer_->registerGet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is odd to make it a no-op. It is better to return an invalid HTTP request error than an empty result. The later would be confusing.
Let's move the above if(systemConfig->enableRuntimeMetricsCollection())
check here and include this.
presto-native-execution/presto_cpp/main/common/prometheus-metrics/PrometheusReporter.h
Outdated
Show resolved
Hide resolved
12302f1
to
1d56b00
Compare
@tanjialiang, @amitkdutta Can you make another pass? All your feedback has been addressed. |
fdf16dc
to
2600215
Compare
presto-native-execution/presto_cpp/main/runtime-metrics/PrometheusStatsReporter.cpp
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/runtime-metrics/PrometheusStatsReporter.cpp
Show resolved
Hide resolved
presto-native-execution/presto_cpp/main/runtime-metrics/PrometheusStatsReporter.cpp
Show resolved
Hide resolved
2600215
to
b3a24cc
Compare
@amitkdutta, please import this PR and let us know if you see any other issues inside Meta. Thanks! |
@amitkdutta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. The only issue is renaming the virtual function from enableRuntimeMetricReporting
to enableWorkerStatsReporting
, which we can fix inside Meta.
Thanks @karteekmurthys for all the iterations.
@karteekmurthys Let's fix the commit name to match the new naming
|
@amitkdutta @tanjialiang Thanks for the review. |
b3a24cc
to
7e49c3f
Compare
@tdcmeehan can you please approve this PR? The circleci change requires a committer approval. Thanks! |
nit: Should use "[native]"(lowercase) as a tag instead of "[Native]". |
Should we add documentation on how to enable reporting? |
I have opened another PR for that https://github.com/prestodb/presto/pull/23107/files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to this feature.
presto-native-execution/presto_cpp/main/runtime-metrics/PrometheusStatsReporter.cpp
Show resolved
Hide resolved
size_t value) const { | ||
auto metricIterator = registeredMetricsMap_.find(key); | ||
if (metricIterator == registeredMetricsMap_.end()) { | ||
VLOG(1) << "addMetricValue for unregistered metric " << key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addMetricValue
-> addHistogramMetricValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Add Prometheus Reporter using the prometheus-cpp library. Add a CMake flag PRESTO_ENABLE_PROMETHEUS_REPORTER to enable Prometheus Reporter. Add REST API '/v1/info/metrics' to fetch worker metrics from Prometheus Reporter in prometheus format. This endpoint is only enabled if the Prometheus Reporter is enabled. Co-authored-by:jaystarshot <[email protected]>
7e49c3f
to
c73c5aa
Compare
Description
Related RFC.