Write dashboard: allow using cortex_request_duration_seconds native histogram #8757

duricanikolic · 2024-07-17T18:54:02Z

What this PR does

Allow using cortex_request_duration_seconds native histogram in
overview dashboard everywhere.

Which issue(s) this PR fixes or relates to

Related to #7154
Depends on grafana/jsonnet-libs#1285

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

Signed-off-by: Yuri Nikolic <[email protected]>

krajorama · 2024-07-18T06:07:46Z

Generated relevant diff with _config.gateway_enabled: true:

***************
*** 468,474 ****
                    "steppedLine": false,
                    "targets": [
                       {
!                         "expr": "sum(rate(cortex_request_duration_seconds_count{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}[$__rate_interval]))",
                          "format": "time_series",
                          "instant": true,
                          "refId": "A"
--- 468,480 ----
                    "steppedLine": false,
                    "targets": [
                       {
!                         "expr": "sum (rate(cortex_request_duration_seconds_count{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}[$__rate_interval])) < ($latency_metrics * +Inf)",
!                         "format": "time_series",
!                         "instant": true,
!                         "refId": "A_classic"
!                      },
!                      {
!                         "expr": "sum (histogram_count(rate(cortex_request_duration_seconds{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}[$__rate_interval]))) < ($latency_metrics * -Inf)",
                          "format": "time_series",
                          "instant": true,
                          "refId": "A"
***************
*** 697,703 ****
                    "span": 4,
                    "targets": [
                       {
!                         "expr": "sum by (status) (\n  label_replace(label_replace(rate(cortex_request_duration_seconds_count{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}[$__rate_interval]),\n  \"status\", \"${1}xx\", \"status_code\", \"([0-9])..\"),\n  \"status\", \"${1}\", \"status_code\", \"([a-zA-Z]+)\"))\n",
                          "format": "time_series",
                          "legendFormat": "{{status}}",
                          "refId": "A"
--- 703,715 ----
                    "span": 4,
                    "targets": [
                       {
!                         "expr": "sum by (status) (\n  label_replace(label_replace(rate(cortex_request_duration_seconds_count{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}[$__rate_interval]),\n  \"status\", \"${1}xx\", \"status_code\", \"([0-9])..\"),\n  \"status\", \"${1}\", \"status_code\", \"([a-zA-Z]+)\"))\n < ($latency_metrics * +Inf)",
!                         "format": "time_series",
!                         "legendFormat": "{{status}}",
!                         "refId": "A_classic"
!                      },
!                      {
!                         "expr": "sum by (status) (\n  label_replace(label_replace(histogram_count(rate(cortex_request_duration_seconds{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}[$__rate_interval])),\n  \"status\", \"${1}xx\", \"status_code\", \"([0-9])..\"),\n  \"status\", \"${1}\", \"status_code\", \"([a-zA-Z]+)\"))\n < ($latency_metrics * -Inf)",
                          "format": "time_series",
                          "legendFormat": "{{status}}",
                          "refId": "A"
***************
*** 746,767 ****
                    "span": 4,
                    "targets": [
                       {
!                         "expr": "histogram_quantile(0.99, sum by (le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"})) * 1e3",
                          "format": "time_series",
                          "legendFormat": "99th percentile",
!                         "refId": "A"
                       },
                       {
!                         "expr": "histogram_quantile(0.50, sum by (le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"})) * 1e3",
                          "format": "time_series",
                          "legendFormat": "50th percentile",
!                         "refId": "B"
                       },
                       {
!                         "expr": "1e3 * sum(cluster_job_route:cortex_request_duration_seconds_sum:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}) / sum(cluster_job_route:cortex_request_duration_seconds_count:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"})",
                          "format": "time_series",
                          "legendFormat": "Average",
!                         "refId": "C"
                       }
                    ],
                    "title": "Latency",
--- 758,797 ----
                    "span": 4,
                    "targets": [
                       {
!                         "expr": "histogram_quantile(0.99, sum by (le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"})) * 1e3 < ($latency_metrics * +Inf)",
                          "format": "time_series",
                          "legendFormat": "99th percentile",
!                         "refId": "A_classic"
!                      },
!                      {
!                         "expr": "histogram_quantile(0.99, sum (cluster_job_route:cortex_request_duration_seconds:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"})) * 1e3 < ($latency_metrics * -Inf)",
!                         "format": "time_series",
!                         "legendFormat": "99th percentile",
!                         "refId": "A_native"
                       },
                       {
!                         "expr": "histogram_quantile(0.50, sum by (le) (cluster_job_route:cortex_request_duration_seconds_bucket:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"})) * 1e3 < ($latency_metrics * +Inf)",
                          "format": "time_series",
                          "legendFormat": "50th percentile",
!                         "refId": "B_classic"
                       },
                       {
!                         "expr": "histogram_quantile(0.50, sum (cluster_job_route:cortex_request_duration_seconds:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"})) * 1e3 < ($latency_metrics * -Inf)",
!                         "format": "time_series",
!                         "legendFormat": "50th percentile",
!                         "refId": "B_native"
!                      },
!                      {
!                         "expr": "1e3 * sum(cluster_job_route:cortex_request_duration_seconds_sum:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}) /\nsum(cluster_job_route:cortex_request_duration_seconds_count:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"})\n < ($latency_metrics * +Inf)",
                          "format": "time_series",
                          "legendFormat": "Average",
!                         "refId": "C_classic"
!                      },
!                      {
!                         "expr": "1e3 * sum(histogram_sum(cluster_job_route:cortex_request_duration_seconds:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"})) /\nsum(histogram_count(cluster_job_route:cortex_request_duration_seconds:sum_rate{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}))\n < ($latency_metrics * -Inf)",
!                         "format": "time_series",
!                         "legendFormat": "Average",
!                         "refId": "C_native"
                       }
                    ],
                    "title": "Latency",
***************
*** 808,814 ****
                    "targets": [
                       {
                          "exemplar": true,
!                         "expr": "histogram_quantile(0.99, sum by(le, pod) (rate(cortex_request_duration_seconds_bucket{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}[$__rate_interval])))",
                          "format": "time_series",
                          "legendFormat": "",
                          "legendLink": null
--- 838,851 ----
                    "targets": [
                       {
                          "exemplar": true,
!                         "expr": "histogram_quantile(0.99, sum by (le,pod) (rate(cortex_request_duration_seconds_bucket{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}[$__rate_interval]))) < ($latency_metrics * +Inf)",
!                         "format": "time_series",
!                         "legendFormat": "",
!                         "legendLink": null
!                      },
!                      {
!                         "exemplar": true,
!                         "expr": "histogram_quantile(0.99, sum by (pod) (rate(cortex_request_duration_seconds{cluster=~\"$cluster\", job=~\"($namespace)/((gateway|cortex-gw.*))\", route=~\"api_(v1|prom)_push|otlp_v1_metrics\"}[$__rate_interval]))) < ($latency_metrics * -Inf)",
                          "format": "time_series",
                          "legendFormat": "",
                          "legendLink": null

which looks ok

krajorama

LGTM, I don't see classic-only mentions of the metric.

krajorama · 2024-07-18T06:09:01Z

CHANGELOG.md

-  * Overview dashboard: status, read/write latency and queries/ingestion per sec panels, `cortex_request_duration_seconds` metric.
+* [ENHANCEMENT] Dashboards: allow switching between using classic or native histograms in dashboards.
+  * Overview dashboard: status, read/write latency and queries/ingestion per sec panels, `cortex_request_duration_seconds` metric. #7674 #8502
+  * Write dashboard: `cortex_request_duration_seconds` metric. #8757


nit:

Suggested change

* Write dashboard: `cortex_request_duration_seconds` metric. #8757

* Writes dashboard: `cortex_request_duration_seconds` metric. #8757

krajorama · 2024-07-18T06:09:10Z

operations/helm/charts/mimir-distributed/CHANGELOG.md

-  * Overview dashboard: status, read/write latency and queries/ingestion per sec panels, `cortex_request_duration_seconds` metric.
+* [ENHANCEMENT] Dashboards: allow switching between using classic or native histograms in dashboards.
+  * Overview dashboard: status, read/write latency and queries/ingestion per sec panels, `cortex_request_duration_seconds` metric. #7674
+  * Write dashboard: `cortex_request_duration_seconds` metric. #8757


Suggested change

* Write dashboard: `cortex_request_duration_seconds` metric. #8757

* Writes dashboard: `cortex_request_duration_seconds` metric. #8757

Signed-off-by: Yuri Nikolic <[email protected]>

…e-dashboard

duricanikolic added 2 commits July 17, 2024 19:12

Write dashboard: qps and latency w/ cortex_request_duration_seconds

89a81c1

Signed-off-by: Yuri Nikolic <[email protected]>

Fix the instance label

356265b

duricanikolic self-assigned this Jul 17, 2024

duricanikolic changed the title ~~Yuri/native hist write dashboard~~ Write dashboard: allow using cortex_request_duration_seconds native histogram Jul 17, 2024

duricanikolic force-pushed the yuri/native-hist-write-dashboard branch from 3715e6e to 49b9e52 Compare July 17, 2024 18:57

Distributor and Ingester panels

e0e37d6

Signed-off-by: Yuri Nikolic <[email protected]>

duricanikolic force-pushed the yuri/native-hist-write-dashboard branch from 49b9e52 to e0e37d6 Compare July 17, 2024 19:02

duricanikolic requested a review from krajorama July 17, 2024 19:08

duricanikolic marked this pull request as ready for review July 17, 2024 20:07

duricanikolic requested a review from a team as a code owner July 17, 2024 20:07

krajorama approved these changes Jul 18, 2024

View reviewed changes

duricanikolic added 2 commits July 18, 2024 10:47

Fix review findings

06cc801

Signed-off-by: Yuri Nikolic <[email protected]>

Merge remote-tracking branch 'origin/main' into yuri/native-hist-writ…

067651e

…e-dashboard

duricanikolic merged commit e10027d into main Jul 18, 2024
31 checks passed

duricanikolic deleted the yuri/native-hist-write-dashboard branch July 18, 2024 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write dashboard: allow using cortex_request_duration_seconds native histogram #8757

Write dashboard: allow using cortex_request_duration_seconds native histogram #8757

duricanikolic commented Jul 17, 2024 •

edited

Loading

krajorama commented Jul 18, 2024

krajorama left a comment

krajorama Jul 18, 2024

krajorama Jul 18, 2024

	* Write dashboard: `cortex_request_duration_seconds` metric. #8757
	* Writes dashboard: `cortex_request_duration_seconds` metric. #8757

Write dashboard: allow using cortex_request_duration_seconds native histogram #8757

Write dashboard: allow using cortex_request_duration_seconds native histogram #8757

Conversation

duricanikolic commented Jul 17, 2024 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

krajorama commented Jul 18, 2024

krajorama left a comment

Choose a reason for hiding this comment

krajorama Jul 18, 2024

Choose a reason for hiding this comment

krajorama Jul 18, 2024

Choose a reason for hiding this comment

duricanikolic commented Jul 17, 2024 •

edited

Loading