-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added histogram latency information to Hystrix dashboard stream #3986
Conversation
Signed-off-by: trabetti <[email protected]>
Adding timing information from upstream_rq_time histogram.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks for working on this!! Some general comments. In general, we seem to be creating a bunch of temporary strings in the concatenation process. It would be nice to refactor to try and concatenate everything directly to the underlying stringstream
if you think that's doable. But it might make sense to save this refactor for a later PR.
@@ -41,6 +42,16 @@ void ClusterStatsCache::printRollingWindow(absl::string_view name, RollingWindow | |||
out_str << std::endl; | |||
} | |||
|
|||
std::string ClusterStatsCache::printTimingHistogram() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this addHistogramToStream
and let it take in a stringstream
?
std::unordered_map<std::string, double> hist_map; | ||
for (size_t i = 0; i < histogram->cumulativeStatistics().supportedQuantiles().size(); ++i) { | ||
if (histogram->cumulativeStatistics().supportedQuantiles()[i] == 0.999) { | ||
// Envoy provide 99.9 percentile, while Hystrix shows 99.5, so there is a small |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just add that to the supported quantiles? It seems like a reasonable percentile to care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can? I have no idea if this will effect anything else.. I'll try and see if any test fails.
// i.e. "cluster.service1.upstream_rq_time". | ||
std::vector<std::string> split_name = absl::StrSplit(histogram->name(), "."); | ||
if (split_name[0] == "clusters" && split_name[2] == "upstream_rq_time") { | ||
std::unordered_map<std::string, double> hist_map; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid the copy later by doing the following:
std::unordered_map<std::string, double>& hist_map = time_histograms[split_name[1]];
for (const Stats::ParentHistogramSharedPtr histogram : server_.stats().histograms()) { | ||
// histogram->name() on clusters of the format "cluster.cluster_name.histogram_name" | ||
// i.e. "cluster.service1.upstream_rq_time". | ||
std::vector<std::string> split_name = absl::StrSplit(histogram->name(), "."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set the return type to std::vector<absl::string_view>
(weird StrSplit
return type allows casting to a bunch of different types), this won't require us to do a bunch of string copies.
Also, I think it's slightly better to pass in the the delimiter as a char ('.'
) rather than a string (really it's a char*
) if it's only a single character because it will allow the library to choose a non-multi character overload.
@@ -34,6 +36,8 @@ struct ClusterStatsCache { | |||
RollingWindow total_; | |||
RollingWindow timeouts_; | |||
RollingWindow rejected_; | |||
|
|||
std::unordered_map<std::string, double> timing_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this to be cached if we're just going to regenerate the map each time?
Also, why not make this a double to double instead by storing the quantile as a double rather than a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this to be cached if we're just going to regenerate the map each time?
I still have an open question if it worth to do any kind of calculation over the percentile values, instead of giving the values since the beginning of the cluster's operation. Otherwise, I can pass it directly without using the cache.
why not make this a double to double instead by storing the quantile as a double rather than a string?
Just to overcome the inconsistent representation - "99.5" vs. all other values which are integers (25, 50, 99, etc..), and if I pass them as doubles, dashboard do not accept. It was convenient to handle this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just mapping "99.5" to 99.5? and "25" to 25.0?
Unless there's a big efficiency issue I'd vote for doing this type coercion closer to where it's consumed. Otherwise commenting why you are doing this would be really helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is mapping the string representation of the quantile to its current value from the histogram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean in the PR; comments; I mean adding comments in the code. Document both what it's doing and also why you think this is helpful from a code-structure or performance perspective.
Signed-off-by: trabetti <[email protected]>
Signed-off-by: trabetti <[email protected]>
Signed-off-by: trabetti <[email protected]>
Signed-off-by: trabetti <[email protected]>
docs/root/operations/admin.rst
Outdated
@@ -352,6 +352,7 @@ The fields are: | |||
In Envoy, service unavailable response will cause **outlier detection** - removing a node off the | |||
load balancer pool, but requests are not rejected as a result. Therefore, this counter is always | |||
set to '0'. | |||
* Latency information is currently unavailable. | |||
* Latency information represent cumulative data from start of the clusters operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/represent/represents/
s/start/the start/
@@ -34,6 +36,8 @@ struct ClusterStatsCache { | |||
RollingWindow total_; | |||
RollingWindow timeouts_; | |||
RollingWindow rejected_; | |||
|
|||
std::unordered_map<std::string, double> timing_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just mapping "99.5" to 99.5? and "25" to 25.0?
Unless there's a big efficiency issue I'd vote for doing this type coercion closer to where it's consumed. Otherwise commenting why you are doing this would be really helpful.
@@ -41,6 +43,14 @@ void ClusterStatsCache::printRollingWindow(absl::string_view name, RollingWindow | |||
out_str << std::endl; | |||
} | |||
|
|||
void ClusterStatsCache::addHistogramToStream(std::stringstream& ss) { | |||
bool is_first = true; | |||
for (std::pair<std::string, double> element : timing_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::pair<std::string, double>&
@@ -65,7 +75,8 @@ uint64_t HystrixSink::getRollingValue(RollingWindow rolling_window) { | |||
} | |||
|
|||
void HystrixSink::updateRollingWindowMap(const Upstream::ClusterInfo& cluster_info, | |||
ClusterStatsCache& cluster_stats_cache) { | |||
ClusterStatsCache& cluster_stats_cache, | |||
std::unordered_map<std::string, double>& histogram) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a 'using' shortcut for this map type, declared in the HystrixSink scope, I think.
ss << ", \"latencyExecute\": {"; | ||
cluster_stats_cache.addHistogramToStream(ss); | ||
ss << "}"; | ||
// addInfoToStream("latencyExecute", "{" + cluster_stats_cache.printTimingHistogram() + "}", ss); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line?
// histogram->name() on clusters of the format "cluster.cluster_name.histogram_name" | ||
// i.e. "cluster.service1.upstream_rq_time". | ||
const std::vector<absl::string_view> split_name = absl::StrSplit(histogram->name(), '.'); | ||
if (split_name[0] == "cluster" && split_name[2] == "upstream_rq_time") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check split_name.size()>2 first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean there could be a case where the histogram name is made up of more parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more parts would be OK. less parts would crash. I would prefer to be locally paranoid and have an assertion rather than a SEGV if that occurs.
histogram->cumulativeStatistics().supportedQuantiles()[i]) != | ||
hystrix_quantiles.end()) { | ||
if (histogram->cumulativeStatistics().supportedQuantiles()[i] == 0.995) { | ||
hist_map["99.5"] = histogram->cumulativeStatistics().computedQuantiles()[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this require special-casing? add comment in code.
std::unordered_map<std::string, double>& hist_map = time_histograms[split_name[1]]; | ||
for (size_t i = 0; i < histogram->cumulativeStatistics().supportedQuantiles().size(); ++i) { | ||
if (std::find(hystrix_quantiles.begin(), hystrix_quantiles.end(), | ||
histogram->cumulativeStatistics().supportedQuantiles()[i]) != |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a temp ref for supportedQuartiles()? Would make this long std::find call easier to read.
for (size_t i = 0; i < histogram->cumulativeStatistics().supportedQuantiles().size(); ++i) { | ||
if (std::find(hystrix_quantiles.begin(), hystrix_quantiles.end(), | ||
histogram->cumulativeStatistics().supportedQuantiles()[i]) != | ||
hystrix_quantiles.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be handling for a case where the find() fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't it suppose to return hystrix_quantiles.end() if it fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, but is that expected? Should you handle that case?
"latencyExecute", | ||
"{\"0\":0,\"25\":0,\"50\":0,\"75\":0,\"90\":0,\"95\":0,\"99\":0,\"99.5\":0,\"100\":0}", ss); | ||
ss << ", \"latencyExecute\": {"; | ||
cluster_stats_cache.addHistogramToStream(ss); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass the key into the method as a string_view as you do for the others? This would allow the method to handle the brackets, etc. You can still concatenate them directly to the underlying stringstream directly IIUC.
Signed-off-by: trabetti <[email protected]>
hystrix_quantiles.end()) { | ||
if (supported_quantiles[i] == 0.995) { | ||
// The only non int quantile value | ||
hist_map["99.5"] = histogram->cumulativeStatistics().computedQuantiles()[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the Hystrix formatting requirement for these strings? Could we possibly take advantage of sprintf
formatting options to format all of them the same way instead of using an if statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only the 99.5 is not an int. all others are integers and it didn't work when i tried sending these as double.
{"95": 12.500000, "100": 16000.000000, "99": 15300.000000, "90": 7.100000, "0": 1.000000, "99.5": 15650.000000, "25": 1.060345, "50": 2.040000, "75": 3.070833}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the requirements? Would "95.0" be okay? Or does it have to be exactly "95"? There are some float formatting options in sprintf
that allow you to limit the number of decimals on floats, which might allow you to get what you want without the if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sprintf with %g
might do what you want. But are you using that? Or fmt::format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you might want to consider using integers scaled by 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing, you have the solution for everything 😀 it worked. I am keeping it as a double to make it more clear, does it improve a lot to scale by 10 and use integer?
@@ -304,6 +320,33 @@ void HystrixSink::flush(Stats::Source&) { | |||
incCounter(); | |||
std::stringstream ss; | |||
Upstream::ClusterManager::ClusterInfoMap clusters = server_.clusterManager().clusters(); | |||
|
|||
// Save a map of the relevant histograms per cluster in a convenient format. | |||
std::unordered_map<absl::string_view, std::unordered_map<std::string, double>, StringViewHash> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a string_view
as the key for a map is a little dangerous. You need to be sure that the string that's being used to generate the string_view
lasts longer than that element in the map. This is mostly true here, but I think there's a small edge case where it might be possible to cause a segfault. You're grabbing a vector of histogram shared pointers, but that goes away at the end of the scope, which means it's possible for one of the histograms that's holding one of the referenced strings to be destroyed by the time you do the lookup in the map. I'd suggest just using a regular string as the key.
Also, can't we use QuantileLatencyMap here instead of typing out the entire type?
// Save a map of the relevant histograms per cluster in a convenient format. | ||
std::unordered_map<absl::string_view, std::unordered_map<std::string, double>, StringViewHash> | ||
time_histograms; | ||
for (const Stats::ParentHistogramSharedPtr histogram : server_.stats().histograms()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we should be using the Stats::Source::cachedHistograms()
method on the Source
that's being passed in (
envoy/include/envoy/stats/stats.h
Line 301 in 3460595
virtual const std::vector<ParentHistogramSharedPtr>& cachedHistograms() PURE; |
…lative extreme values stay forever, real problems can't be distinguihsed from temporary Signed-off-by: trabetti <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! LGTM once the comments below are resolved.
if (callbacks_list_.empty()) { | ||
return; | ||
} | ||
incCounter(); | ||
std::stringstream ss; | ||
Upstream::ClusterManager::ClusterInfoMap clusters = server_.clusterManager().clusters(); | ||
|
||
// Save a map of the relevant histograms per cluster in a convenient format. | ||
std::unordered_map<std::string, QuantileLatencyMap, StringViewHash> time_histograms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need the StringViewHash
template argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from when it was absl::string_view..
@@ -34,6 +38,9 @@ struct ClusterStatsCache { | |||
RollingWindow total_; | |||
RollingWindow timeouts_; | |||
RollingWindow rejected_; | |||
|
|||
// Map string representation of the quantile to its recently read value from the histogram | |||
QuantileLatencyMap timing_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we decided whether we need to store this between flushes? My vote would be to not store it until we implement the use-case for it.
… out of cache Signed-off-by: trabetti <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we testing this new feature?
// histogram->name() on clusters of the format "cluster.cluster_name.histogram_name" | ||
// i.e. "cluster.service1.upstream_rq_time". | ||
const std::vector<std::string> split_name = absl::StrSplit(histogram->name(), '.'); | ||
if (split_name.size() >= 2 && split_name[0] == "cluster" && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that you're essentially repeating work already done by Envoy. Rather than manually parsing the string, you could check the tag and tag_extracted_name since Envoy parses those for you. It may be a bit more code, but I think it's a bit safer since the regexes are a little more specific than splitting on .
. It's up to you if you'd rather use it or not, but I figured I'd present the alternative. For example (this may not compile since I haven't tried, but the rough idea is there :) ):
if (histogram->tag_extracted_name() == "cluster.upstream_rq_time") {
// TODO(mrice32): add an Envoy utility function to look up and return a tag for a metric.
auto it = std::find_if(histogram->tags().begin(), histogram->tags().end(), [](const Stats::Tag& tag) {
return tag.name_ == Config::TagNames::get().CLUSTER_NAME;
});
// Should always be true, so we probably want to log or something if not.
if (it == histogram.tags().end()) {
// We didn't find the cluster name despite finding the correct stat.
} else {
QuantileLatencyMap& hist_map = time_histograms[it->value_];
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we should log if a cluster_name tag is not found. What could be the reason? Anything the user can do about it? If not, I think we can just verify it is not histogram.tags().end().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would imply that something is broken since we expect that this should never happen. I wouldn't say it's a catastrophic failure, so it's not a crashable event, but logging an error to the console might catch someone's attention so that they could file a bug against Envoy. I'm not super opinionated, so it's ultimately up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is what ASSERT is for. It should stop tests from passing so you can investigate. I suspect a log would just slow things down and no one would notice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for ASSERT
for (const Stats::ParentHistogramSharedPtr histogram : source.cachedHistograms()) { | ||
// histogram->name() on clusters of the format "cluster.cluster_name.histogram_name" | ||
// i.e. "cluster.service1.upstream_rq_time". | ||
const std::vector<std::string> split_name = absl::StrSplit(histogram->name(), '.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can still be a vector of string_view
s. If we need to put them into a map, it's easy to convert them when needed. This means we only need to store them as strings if we need them to be permanent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took your suggestion to use tags, removed this vector.
const std::vector<std::string> split_name = absl::StrSplit(histogram->name(), '.'); | ||
if (split_name.size() >= 2 && split_name[0] == "cluster" && | ||
split_name[2] == "upstream_rq_time") { | ||
QuantileLatencyMap& hist_map = time_histograms[split_name[1]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we only expect to do this emplacement once per flush? What happens if you get multiple histograms that happen to satisfy your format checks and produce the same cluster name on the same flush (because of the dynamic nature of stats strings, things like this are always possible :) )? Won't the second overwrite the first? How do you want to handle that case?
If you want to do something other than overwrite, this case can be checked pretty easily by using time_histograms.emplace()
and checking the return value. See http://www.cplusplus.com/reference/unordered_map/unordered_map/emplace/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's no problem with overwriting. We want to show latency information in dashboard, it doesn't have to be exact data. If there is an extreme value (very high latency) it could be either a temporary situation that has already recovered, or a consistent problem the user wants to find and fix. In the case it was temporary, if we missed it because of the overwrite, it doesn't matter. If it is consistent, user will see it on next flush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I think overwriting is fine. However, it seems like the code assumes that each cluster should only have one histogram that matches this filter (a pretty good assumption IMO). If there's another histogram that somehow gets inside the if statement, it might be worthwhile to log an error because it means that something is probably broken here. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem, I can log in both cases, can you point me to a similar example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
@@ -26,6 +28,7 @@ struct ClusterStatsCache { | |||
void printToStream(std::stringstream& out_str); | |||
void printRollingWindow(absl::string_view name, RollingWindow rolling_window, | |||
std::stringstream& out_str); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good; just a few nits remaining.
// histogram->name() on clusters of the format "cluster.cluster_name.histogram_name" | ||
// i.e. "cluster.service1.upstream_rq_time". | ||
const std::vector<std::string> split_name = absl::StrSplit(histogram->name(), '.'); | ||
if (split_name.size() >= 2 && split_name[0] == "cluster" && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split_name.size() > 2
You are looking at split_name[2] so if size==2 that's a crasher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the code to use tags instead of this split, following @mrice32's suggestion
const std::vector<double>& supported_quantiles = | ||
histogram->cumulativeStatistics().supportedQuantiles(); | ||
for (size_t i = 0; i < supported_quantiles.size(); ++i) { | ||
if (std::find(hystrix_quantiles.begin(), hystrix_quantiles.end(), supported_quantiles[i]) != |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is hystrix_quantiles sorted? If so, binary-search (via std::lower_bound and an equality-test) might be faster, but it's probably a small enough N that it isn't a big deal I guess. Still it's worth a comment that this is a deliberate choice. E.g.
// binary-search here is likely not worth it, as hystrix_quantiles has <10 elements.
static void addIntToStream(absl::string_view key, uint64_t value, std::stringstream& info, | ||
bool is_first = false); | ||
|
||
static void addHistogramToStream(QuantileLatencyMap latency_map, absl::string_view key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const QuantileLatencyMap&
@@ -41,6 +44,18 @@ void ClusterStatsCache::printRollingWindow(absl::string_view name, RollingWindow | |||
out_str << std::endl; | |||
} | |||
|
|||
void HystrixSink::addHistogramToStream(QuantileLatencyMap latency_map, absl::string_view key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const QuantileLatencyMap&
std::stringstream& ss) { | ||
ss << ", \"" << key << "\": {"; | ||
bool is_first = true; | ||
for (const std::pair<double, double> element : latency_map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::pair<double, double>&
Signed-off-by: trabetti <[email protected]>
Pushed a new version with fixes. Will look at testing tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than a few small comments and missing tests.
@@ -72,7 +74,8 @@ class HystrixSink : public Stats::Sink, public Logger::Loggable<Logger::Id::hyst | |||
void addClusterStatsToStream(ClusterStatsCache& cluster_stats_cache, | |||
absl::string_view cluster_name, uint64_t max_concurrent_requests, | |||
uint64_t reporting_hosts, | |||
std::chrono::milliseconds rolling_window_ms, std::stringstream& ss); | |||
std::chrono::milliseconds rolling_window_ms, | |||
QuantileLatencyMap& histogram, std::stringstream& ss); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we make this const
?
@@ -228,10 +245,10 @@ void HystrixSink::addClusterStatsToStream(ClusterStatsCache& cluster_stats_cache | |||
uint64_t max_concurrent_requests, | |||
uint64_t reporting_hosts, | |||
std::chrono::milliseconds rolling_window_ms, | |||
std::stringstream& ss) { | |||
QuantileLatencyMap& histogram, std::stringstream& ss) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we make this const
?
@@ -129,7 +150,7 @@ void HystrixSink::addHystrixCommand(ClusterStatsCache& cluster_stats_cache, | |||
absl::string_view cluster_name, | |||
uint64_t max_concurrent_requests, uint64_t reporting_hosts, | |||
std::chrono::milliseconds rolling_window_ms, | |||
std::stringstream& ss) { | |||
QuantileLatencyMap& histogram, std::stringstream& ss) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const
?
/** | ||
* Generate HystrixCommand event stream. | ||
*/ | ||
void addHystrixCommand(ClusterStatsCache& cluster_stats_cache, absl::string_view cluster_name, | ||
uint64_t max_concurrent_requests, uint64_t reporting_hosts, | ||
std::chrono::milliseconds rolling_window_ms, std::stringstream& ss); | ||
std::chrono::milliseconds rolling_window_ms, QuantileLatencyMap& histogram, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const QuantileLatencyMap&
?
// Make sure we found the cluster name tag | ||
ASSERT(it != histogram->tags().end()); | ||
// Make sure histogram with this name was not already added | ||
ASSERT(time_histograms.find(it->value_) == time_histograms.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can refactor to remove the double lookup:
auto it_bool_pair = time_histograms.emplace(std::make_pair(it->value_, QuantileLatencyMap()));
// Make sure histogram with this name was not already added
ASSERT(it_bool_pair.second);
QuantileLatencyMap& hist_map = it_bool_pair.first->second;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to do it->value_
in the std::make_pair
without the assert first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be, depending on your definition of 'safe' :). A false return in it_bool_pair.second just means an entry was already in the map for the specified key. In that case, the assert will fire and you'll crash. Is that safe?
Are we sure that duplicates are not possible?
Signed-off-by: trabetti <[email protected]>
Signed-off-by: trabetti <[email protected]>
Signed-off-by: trabetti <[email protected]>
Any updates? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits. LGTM. Passing it to @jmarantz and a maintainer for final review.
}); | ||
|
||
// Make sure we found the cluster name tag | ||
ASSERT(it != histogram->tags().end()); // Can remove this?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the "Can remove this??" comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to ask you if that's what you meant
// Data was added such that the value equals the quantile: | ||
// "latencyExecute": {"99.5": 99.500000, "95": 95.000000, "90": 90.000000, "100": 100.000000, "0": | ||
// 0.000000, "25": 25.000000, "99": 99.000000, "50": 50.000000, "75": 75.000000}. | ||
for (auto it = begin(hystrix_quantiles); it != end(hystrix_quantiles); ++it) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not a simpler range-based loop:
for (const double quantile : hystrix_quantiles) {
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm modulo Matt's comments.
Signed-off-by: trabetti <[email protected]>
Signed-off-by: trabetti <[email protected]>
Any idea why the intervalStatistics histogram latency data for lower quantiles is almost always '0' or nan?
@ramaraochavali maybe you know? |
…rd freezes with stale values + sync to upstream and resolved conflicts Signed-off-by: trabetti <[email protected]>
Signed-off-by: trabetti <[email protected]>
@trabetti |
thanks @ramaraochavali , i don't have raw values but that makes sense. By the way, is there any reasonable way for me to calculate histogram of accumulated latency values across several flush intervals (e.g. i want to accumulate and show accumulated data of the last 10 intervals)? |
No. There is no way currently that gives the accumulated value of last n flush cycles. It has only values computed in that the interval or from envoy start. |
@ramaraochavali I thought so.. thanks for confirming. |
Signed-off-by: trabetti <[email protected]>
Hi, can we finish with this one? (or are you all on vacations..?) Thanks! |
Sorry, this dropped off my radar. Thanks for all the hard work - assigning @htuch for senior review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, mostly a nits pass.
void HystrixSink::addHistogramToStream(const QuantileLatencyMap& latency_map, absl::string_view key, | ||
std::stringstream& ss) { | ||
ss << ", \"" << key << "\": {"; | ||
bool is_first = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a place where a join operation would make sense, but I can see that there is already a stylized approach to building these strings in this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to consider changing the code to use join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you, TODO is fine as well.
|
||
// Save a map of the relevant histograms per cluster in a convenient format. | ||
std::unordered_map<std::string, QuantileLatencyMap> time_histograms; | ||
for (const Stats::ParentHistogramSharedPtr histogram : source.cachedHistograms()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should this be const Stats::ParentHistogramSharedPtr&
, since we don't want to take an additional refcount in the body below?
// Save a map of the relevant histograms per cluster in a convenient format. | ||
std::unordered_map<std::string, QuantileLatencyMap> time_histograms; | ||
for (const Stats::ParentHistogramSharedPtr histogram : source.cachedHistograms()) { | ||
if (histogram->tagExtractedName() == "cluster.upstream_rq_time") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be hardcoding strings like this? @mrice32 do we have a better way of handling this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not find anything searching the code
}); | ||
|
||
// Make sure we found the cluster name tag | ||
ASSERT(it != histogram->tags().end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed? I.e. is this truly an invariant, or should we have error handling behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we discussed it here #3986 (comment)
// binary-search here is likely not worth it, as hystrix_quantiles has <10 elements. | ||
if (std::find(hystrix_quantiles.begin(), hystrix_quantiles.end(), supported_quantiles[i]) != | ||
hystrix_quantiles.end()) { | ||
double value = histogram->intervalStatistics().computedQuantiles()[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: const double
ss << ", \"" << key << "\": {"; | ||
bool is_first = true; | ||
for (const std::pair<double, double>& element : latency_map) { | ||
std::string quantile = fmt::sprintf("%g", element.first * 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: const std::string
@@ -209,6 +212,14 @@ class HystrixSinkTest : public testing::Test { | |||
return cluster_message_map; | |||
} | |||
|
|||
histogram_t* makeHistogram(const std::vector<uint64_t>& values) { | |||
histogram_t* histogram = hist_alloc(); // The histogram needs to be freed later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a RAII wrapper for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it something like
class HistogramWrapper {
public:
HistogramWrapper() {
histogram = hist_alloc();
}
~HistogramWrapper() {
hist_free(histogram);
}
histogram_t* histogram;
};
and
HistogramWrapper hist1_interval;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; if you use hist_alloc()
in other places in your source, I recommend the same pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't, but I am tempted to use it in the test I borrowed from, thread_local_store_test.
Signed-off-by: trabetti <[email protected]>
@htuch pushed a new version with fixes to your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo tiny nits.
} | ||
|
||
private: | ||
histogram_t* histogram; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit: s/histogram/histogram_/
@@ -47,6 +47,24 @@ class StatsThreadLocalStoreTest : public testing::Test { | |||
std::unique_ptr<ThreadLocalStoreImpl> store_; | |||
}; | |||
|
|||
class HistogramWrapper { | |||
public: | |||
HistogramWrapper() { histogram = hist_alloc(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit: HIstogramWrapper() : histogram_(hist_alloc()) {}
|
||
const histogram_t* getHistogram() { return histogram; } | ||
|
||
void initHistogram(const std::vector<uint64_t>& values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not in constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just since it can be called several times to set different values (IIUC). Although we don't do that in code. So maybe I'll change the name to setHistogramValues
Signed-off-by: trabetti <[email protected]>
fixed comments in 3d6395a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Pulling the following changes from github.com/envoyproxy/envoy: f936fc6 ssl: serialize accesses to SSL socket factory contexts (envoyproxy#4345) e34dcd6 Fix crash in tcp_proxy (envoyproxy#4323) ae6a252 router: fix matching when all domains have wildcards (envoyproxy#4326) aa06142 test: Stop fake_upstream methods from accidentally succeeding (envoyproxy#4232) 5d73187 rbac: update the authenticated.user to a StringMatcher. (envoyproxy#4250) c6bfc7d time: Event::TimeSystem abstraction to make it feasible to inject time with simulated timers (envoyproxy#4257) 752483e Fixing the fix (envoyproxy#4333) 83487f6 tls: update BoringSSL to ab36a84b (3497). (envoyproxy#4338) 7bc210e test: fixing interactions between waitFor and ignore_spurious_events (envoyproxy#4309) 69474b3 admin: order stats in clusters json admin (envoyproxy#4306) 2d155f9 ppc64le build (envoyproxy#4183) 07efc6d fix static initialization fiasco problem (envoyproxy#4314) 0b7e3b5 test: Remove declared but undefined class methods (envoyproxy#4297) 1485a13 lua: make sure resetting dynamic metadata wrapper when request info is marked dead d243cd6 test: set to zero when start_time exceeds limit (envoyproxy#4328) 0a1e92a test: fix heap use-after-free in ~IntegrationTestServer. (envoyproxy#4319) cddc732 CONTRIBUTING: Document 'kick-ci' trick. (envoyproxy#4335) f13ef24 docs: remove reference to deprecated value field (envoyproxy#4322) e947a27 router: minor doc fixes in stream idle timeout (envoyproxy#4329) 0c2e998 tcp-proxy: fixing a TCP proxy bug where we attempted to readDisable a closed connection (envoyproxy#4296) 00ffe44 utility: fix strftime overflow handling. (envoyproxy#4321) af1183c Re-enable TcpProxySslIntegrationTest and make the tests pass again. (envoyproxy#4318) 3553461 fuzz: fix H2 codec fuzzer post envoyproxy#4262. (envoyproxy#4311) 42f6048 Proto string issue fix (envoyproxy#4320) 9c492a0 Support Envoy to fetch secrets using SDS service. (envoyproxy#4256) a857219 ratelimit: revert `revert rate limit failure mode config` and add tests (envoyproxy#4303) 1d34172 dns: fix exception unsafe behavior in c-ares callbacks. (envoyproxy#4307) 1212423 alts: add gRPC TSI socket (envoyproxy#4153) f0363ae fuzz: detect client-side resets in H2 codec fuzzer. (envoyproxy#4300) 01aa3f8 test: hopefully deflaking echo integration test (envoyproxy#4304) 1fc0f4b ratelimit: link legacy proto when message is being used (envoyproxy#4308) aa4481e fix rare List::remove(&target) segfault (envoyproxy#4244) 89e0f23 headers: fixing fast fail of size-validation (envoyproxy#4269) 97eba59 build: bump googletest version. (envoyproxy#4293) 0057e22 fuzz: avoid false positives in HCM fuzzer. (envoyproxy#4262) 9d094e5 Revert ac0bd74 (envoyproxy#4295) ddb28a4 Add validation context provider (envoyproxy#4264) 3b47cba added histogram latency information to Hystrix dashboard stream (envoyproxy#3986) cf87d50 docs: update SNI FAQ. (envoyproxy#4285) f952033 config: fix update empty stat for eds (envoyproxy#4276) 329e591 router: Add ability of custom headers to rely on per-request data (envoyproxy#4219) 68d20b4 thrift: refactor build files and imports (envoyproxy#4271) 5fa8192 access_log: log requested_server_name in tcp proxy (envoyproxy#4144) fa45bb4 fuzz: libc++ clocks don't like nanos. (envoyproxy#4282) 53f8944 stats: add symbol table for future stat name encoding (envoyproxy#3927) c987b42 test infra: Remove timeSource() from the ClusterManager api (envoyproxy#4247) cd171d9 websocket: tunneling websockets (and upgrades in general) over H2 (envoyproxy#4188) b9dc5d9 router: disallow :path/host rewriting in request_headers_to_add. (envoyproxy#4220) 0c91011 network: skip socket options and source address for UDS client connections (envoyproxy#4252) da1857d build: fixing a downstream compile error by noting explicit fallthrough (envoyproxy#4265) 9857cfe fuzz: cleanup per-test environment after each fuzz case. (envoyproxy#4253) 52beb06 test: Wrap proto string in std::string before comparison (envoyproxy#4238) f5e219e extensions/thrift_proxy: Add header matching to thrift router (envoyproxy#4239) c9ce5d2 fuzz: track read_disable_count bidirectionally in codec_impl_fuzz_test. (envoyproxy#4260) 35103b3 fuzz: use nanoseconds for SystemTime in RequestInfo. (envoyproxy#4255) ba6ba98 fuzz: make runtime root hermetic in server_fuzz_test. (envoyproxy#4258) b0a9014 time: Add 'format' test to ensure no one directly instantiates Prod*Time from source. (envoyproxy#4248) 8567460 access_log: support beginning of epoch in START_TIME. (envoyproxy#4254) 28d5f41 proto: unify envoy_proto_library/api_proto_library. (envoyproxy#4233) f7d3cb6 http: fix allocation bug introduced in envoyproxy#4211. (envoyproxy#4245) Fixes istio/istio#8310 (once pulled into istio/istio). Signed-off-by: Piotr Sikora <[email protected]>
Signed-off-by: trabetti [email protected]
Description: Adding the latency information from existing histogram to Hystrix event stream that was added in #3425
Risk Level: Low
Testing:
Docs Changes: admin.rst
Release Notes: No addition to release note of original Hystrix stream PR, which is still pending for 1.8.0
Fixes #3753