-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: Use exponential buckets for histogram metrics #1545
Conversation
394e1b2
to
2a2826a
Compare
@kakkoyun how is this PR going? |
@GiedriusS I had to park this one for a while. But I haven't abandoned it, I'll have another look at it soon. I have also discovered similar issues with Store GW histograms, I may include those improvements in this PR as well. |
Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Kemal Akkoyun <[email protected]>
536109e
to
a3568a5
Compare
Signed-off-by: Kemal Akkoyun <[email protected]>
a3568a5
to
1855ace
Compare
grpc_prometheus.WithHistogramBuckets([]float64{ | ||
0.001, 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, | ||
}), | ||
grpc_prometheus.WithHistogramBuckets(prometheus.ExponentialBuckets(0.001, 2, 15)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before:
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.001"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.01"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.05"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.1"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.2"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.4"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.8"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="1.6"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="3.2"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="6.4"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="+Inf"} 0
After:
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.001"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.002"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.004"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.008"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.016"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.032"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.064"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.128"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.256"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.512"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="1.024"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="2.048"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="4.096"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="8.192"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="16.384"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="+Inf"} 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example distirbution for existing buckets, from a real life system.
sum(grpc_server_handling_seconds_bucket{job=~"thanos-store.*", grpc_type="server_stream"}) by (le)
{le="6.4"} | 158
{le="0.05"} | 2
{le="0.1"} | 5
{le="0.2"} | 13
{le="0.4"} | 34
{le="0.8"} | 62
{le="+Inf"} | 187
{le="0.001"} | 0
{le="0.01"} | 0
{le="1.6"} | 103
{le="3.2"} | 133
pkg/store/gate.go
Outdated
}, | ||
Name: "gate_duration_seconds", | ||
Help: "How many seconds it took for queries to wait at the gate.", | ||
Buckets: prometheus.ExponentialBuckets(0.001, 2, 15), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example distirbution for existing buckets, from a real life system.
sum(thanos_bucket_store_series_gate_duration_seconds_bucket{job="thanos-store"}) by (le)
{le="10"} | 0
{le="5"} | 0
{le="+Inf"} | 187
{le="0.6"} | 0
{le="1"} | 0
{le="0.25"} | 0
{le="2"} | 0
{le="3.5"} | 0
{le="0.01"} | 0
{le="0.05"} | 0
{le="0.1"} | 0
Signed-off-by: Kemal Akkoyun <[email protected]>
bb49247
to
3aac86e
Compare
Signed-off-by: Kemal Akkoyun <[email protected]>
I’m expecting that we will need even higher buckets, but this is better than what we have and will clarify the need for more, so lgtm. |
I think higher buckets has to depend on query timeout, so probably we need higher ones, but do we need so many lower level buckets? Do we really care if we have a request going 0.001 (seconds!) or 0.128 seconds? :thinking_face: |
I'm happy to re-address all the issues after we know more about distribution. What we have does not provide much, I can do another iteration to tune them. |
* Use exponential buckets for compactor histogram metrics Signed-off-by: Kemal Akkoyun <[email protected]> * Update buckets Signed-off-by: Kemal Akkoyun <[email protected]> * Adjust histogram buckets Signed-off-by: Kemal Akkoyun <[email protected]> * Adjust store gate bucket Signed-off-by: Kemal Akkoyun <[email protected]> * Adjust http duration buckets Signed-off-by: Kemal Akkoyun <[email protected]> Signed-off-by: suntianyuan <[email protected]>
* Use exponential buckets for compactor histogram metrics Signed-off-by: Kemal Akkoyun <[email protected]> * Update buckets Signed-off-by: Kemal Akkoyun <[email protected]> * Adjust histogram buckets Signed-off-by: Kemal Akkoyun <[email protected]> * Adjust store gate bucket Signed-off-by: Kemal Akkoyun <[email protected]> * Adjust http duration buckets Signed-off-by: Kemal Akkoyun <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>
* Use exponential buckets for compactor histogram metrics Signed-off-by: Kemal Akkoyun <[email protected]> * Update buckets Signed-off-by: Kemal Akkoyun <[email protected]> * Adjust histogram buckets Signed-off-by: Kemal Akkoyun <[email protected]> * Adjust store gate bucket Signed-off-by: Kemal Akkoyun <[email protected]> * Adjust http duration buckets Signed-off-by: Kemal Akkoyun <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>
This PR changes existing bucket configurations to fix issues that observed with latency graphs.
For example, as you can observe there are large differences between mean and P50 latencies.
thanos_compact_garbage_collection_duration_seconds_bucket
thanos_compact_sync_meta_duration_seconds_bucket
This increases the number of buckets for most of the histograms. For certain metrics, it significantly affects cardinality. However, it's needed to properly instrument the components.
Changes
Uses exponential buckets to provide more even distribution. (number of buckets, before and after)
grpc_server_handling_seconds_bucket
: 10 -> 15 (+exposes multiple labels)http_request_duration_seconds_bucket
: 11 -> 17 (+exposes 3 labels, code, method, handler)thanos_compact_sync_meta_duration_seconds_bucket
: 14 -> 15thanos_compact_garbage_collection_duration_seconds_bucket
: 14 -> 15thanos_objstore_bucket_operation_duration_seconds_bucket
: 15 -> 17thanos_bucket_store_series_get_all_duration_seconds_bucket
: 14 -> 15thanos_bucket_store_series_gate_duration_seconds_bucket
: 14 -> 15thanos_bucket_store_series_merge_duration_seconds_bucket
: 10 -> 15Verification
make test
MINIO_ENABLED=1 ./scripts/quickstart.sh
andcurl
to/metrics
.