Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Allow synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s #13478

Closed
wants to merge 6 commits into from

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Aug 9, 2022

Allow synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s

If you use the default histogram buckets, or guess poorly (likely) the first time around, you will probably see a straight line at 10 (or your highest bucket boundary) when you calculate quantiles on this histogram. This is due to the fact that your quantiles or most of your observations are found in the +Inf bucket. No linear approximation saves you here, the only “sane” thing to do is return the lower known bound.

-- https://linuxczar.net/blog/2017/06/15/prometheus-histogram-2/

By default, these buckets are: .005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10. This is very much tuned to measuring request durations below 10 seconds, so if you’re measuring something else you may need to configure custom buckets.

-- https://tomgregory.com/the-four-types-of-prometheus-metrics/

you have to pre-choose your buckets, and the costs moves from the client to Prometheus itself due to bucket cardinality. The default ten buckets cover a typical web service with latency in the millisecond to second range, and on occasion you will want to adjust them.

-- https://www.robustperception.io/how-does-a-prometheus-histogram-work/

Part of #13356

Before

Purple line >99% percentile has false max ceiling of 10s because the values don't go above 10.

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&var-datasource=default&var-bucket_size=%24__auto_interval_bucket_size&var-instance=matrix.org&var-job=synapse_client_reader&var-index=All&from=1660039325520&to=1660060925520&viewPanel=152

After

I don't know if this actually fixes it (haven't tested).

Dev notes

Docs:

https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544

DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF)

  • synapse_http_server_response_time_seconds_bucket
  • synapse_http_server_response_time_seconds_sum
  • synapse_http_server_response_time_seconds_count

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

Comment on lines 47 to 60
0.005,
0.01,
0.025,
0.05,
0.075,
0.1,
0.25,
0.5,
0.75,
1.0,
2.5,
5.0,
7.5,
10.0,
Copy link
Contributor Author

@MadLittleMods MadLittleMods Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section matches the default buckets: https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544 (0.005 - 10.0)

I chose the default as a base because that is what it was using before. Do we want to tune these or eliminate any to reduce cardinality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding ~30% more buckets seems like a step in the wrong direction for #11082.

I've just noticed this comment. Perhaps we could drop the 0.075, 0.75 and 7.5 metrics? Then the remaining ones would be separated by roughly factors of two.

We'd still be growing the number of buckets by 2 in that case though. If we wanted to avoid growing the cardinality we'd have to pick 2 more to drop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could drop 200.0 as well since above 180 probably hit the timeout

@MadLittleMods MadLittleMods added T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. A-Metrics metrics, measures, stuff we put in Prometheus labels Aug 9, 2022
@MadLittleMods MadLittleMods changed the title Allow Grafana histogram quantiles to show values bigger than 10s Allow synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s Aug 9, 2022
@MadLittleMods MadLittleMods marked this pull request as ready for review August 9, 2022 16:40
@MadLittleMods MadLittleMods requested a review from a team as a code owner August 9, 2022 16:40
@squahtx
Copy link
Contributor

squahtx commented Aug 11, 2022

I'm not sure if we really want this. There have been complaints in the past that the synapse_http_server_response_time collection of metrics is too big and causes performance issues with Prometheus:

Adding ~30% more buckets seems like a step in the wrong direction for #11082. Is there a particular insight we're hoping to gain by raising the 10 second cap?

@@ -43,6 +43,28 @@
"synapse_http_server_response_time_seconds",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular insight we're hoping to gain by raising the 10 second cap?

I'm trying to optimize the slow /messages requests, #13356, specifically those that take more than 10s.

In order to track progress there, I'd like the metrics to capture them.

Copy link
Contributor

@squahtx squahtx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

We only add one extra bucket overall, which isn't too bad.

@erikjohnston
Copy link
Member

Hang on two secs, I'm a bit concerned by removing the 75 buckets in terms of losing the definition in the common cases

@erikjohnston
Copy link
Member

Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate.

Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the /messages API. This would also allow us to differentiate between local and remote /messages requests for example.

Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.f. comment about losing fidelity

120.0,
180.0,
"+Inf",
),
Copy link
Contributor Author

@MadLittleMods MadLittleMods Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate.

Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the /messages API. This would also allow us to differentiate between local and remote /messages requests for example.

-- @erikjohnston, #13478 (comment)

I can create a separate PR to add a specific metric for /messages -> #13533

But do we have any interest in adjusting the buckets for the general case? @erikjohnston mentioned if anything maybe wanting even more fidelity in the lower ranges. @richvdh do you have any interest in increasing for another endpoint? Our limiting factor is cardinality since this multiplies out to all of our servlets.

I think @MadLittleMods is right in that the top bucket should be more than 10s given how often some of our endpoints take longer than that

-- @richvdh, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$CLJ5oioD_DO1A_zSGmYtCd-yToSyA6EiOwOsClvfdcs?via=matrix.org&via=element.io&via=beeper.com

https://grafana.matrix.org/d/000000012/synapse?orgId=1&from=1660574475300&to=1660596075300&viewPanel=47

Slow servlets shown in grafana

Slow KeyChangesServlets shown in grafana

Slow FederationStateIdsServlet shown in grafana


In terms of reducing cardinality, we could remove code. I think for timing, we really just need the method and servlet name. Response code can be useful but maybe we just need to change it to a successful_response boolean (with a cardinality of 2, [true|false]) since we only ever use it as code=~"2..". Or maybe more useful as error_response: true/false so that success or timeout can still be false while an actual error would be true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richvdh do you have any interest in increasing for another endpoint?

I'm not sure really. I guess? I don't feel like it's a huge priority to me right now.

In terms of reducing cardinality, we could remove code.

Probably, yes. but that's a conversation we should have over at #11082.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Metrics metrics, measures, stuff we put in Prometheus T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants