Tarantool 3 "replacements" for Cartridge #491

DifferentialOrange · 2024-07-04T11:49:11Z

Tarantool Grafana dashboard has some unique Cartridge panels. This PR is a part of "make Grafana dashboard work with Tarantool 3 with minimal efforts" activity (tarantool/grafana-dashboard#224).

The state is as follows. Cartridge had the following metrics:

tnt_cartridge_issues
tnt_cartridge_failover_trigger_total
tnt_clock_delta (from membership)

After this patch, metrics will have the following metrics for Tarantool 3:

tnt_config_alerts
tnt_config_status

How one covers the other? tnt_config_alerts is similar to tnt_cartridge_issues, but only for configuration apply. It covers everything that had gone wrong with applying configuration to the instance, separated to warnings and errors as well. It is a bit more detailed since Cartridge could just reject wrong configuration on two-phase commit and Tarantool 3 has asynchronous per-instance apply. tnt_config_status extend this info covering non-initialized instances and in-between states. On the other hand, config alerts does not cover other cluster issues, like "replication is broken due to conflict" and "instance 1 cannot ping instance 2 due to network issues". AFAIK, Tarantool 3 does not have any built-in mechanism for it. Also, Tarantool 3 (including newest EE failover) does not have any failover counters, so there is no analogues to tnt_cartridge_failover_trigger_total (yet existing tnt_read_only already provide a lot of information related to this question and also a part of Grafana dashboard for both Cartridge and Tarantool 3). There doesn't seem to be any replacement for tnt_clock_delta as well.

TCM also has some "replacement" for Cartridge issues, see source code. It covers two things: config alerts and vshard bootstrap success. We cover the first one. The second one has some additional assumptions and non-trivial logic, so I decided to not cover it here for now. It would be better if proper check would be added to vshard or tarantool so it can be exposed here, if required. So I think it's safe to say that here we cover almost the same things that are covered by TCM.

Visualization example:

Tests
Changelog
Documentation (README and rst)
Rockspec and rpm spec (not needed, already build metrics/*)

Part of tarantool/grafana-dashboard#224

oleg-jukovec

It looks like we will need two releases for the grafana-dashboard: without these metrics and with them.

Because we're currently aiming for Tarantool 3.1, and the metrics package will only be updated with Tarantool 3.2 in September.

doc/monitoring/metrics_reference.rst

metrics/utils.lua

test/tarantool/cpu_metrics_test.lua

DifferentialOrange · 2024-07-05T07:13:44Z

It looks like we will need two releases for the grafana-dashboard: without these metrics and with them.

Yeah, that's the plan. We already have 4 dashboards: Tarantool for Prometheus, Tarantool for InfluxDB, TDG for Prometheus, TDG for InfluxDB. After https://github.com/tarantool/grafana-dashboard/, there will be six of them. Due to grafana-dashboard jsonnet nature it's rather easy to build different dashboards (just a bit bothersome to support different release pages).

Since there is no full support of Tarantool 3 config instances in luatest yet (only treegen support in master), I had borrowed some test helpers from tarantool/crud [1]. 1. https://github.com/tarantool/crud/blob/98b120ef7095fa34525ef9d335a1458a2edf0cca/test/tarantool3_helpers Part of tarantool/grafana-dashboard#224

The approach used to represent enum metric here is similar to one discussed in [1, 2]. It allows to support a new status later, if required. For example, one can visualize it with hack like [3]. 1. prometheus/client_python#416 2. open-telemetry/opentelemetry-specification#1711 3. https://stackoverflow.com/a/75761900/11646599 Part of tarantool/grafana-dashboard#224

DifferentialOrange · 2024-07-05T08:31:32Z

and the metrics package will only be updated with Tarantool 3.2 in September

AFAIU, we will bump it for all supported 3.x series, so it will be built-in to 3.1.1 as well

oleg-jukovec

Thank you for the patch!

oleg-jukovec · 2024-07-05T08:48:06Z

AFAIU, we will bump it for all supported 3.x series, so it will be built-in to 3.1.1 as well

To be honest, I don't know anything about plans for 3.1.1

DifferentialOrange · 2024-07-05T15:56:41Z

I have found one more cartridge-exclusive metric: tnt_clock_delta. There doesn't seem to be any replacement in Tarantool 3 as well

(All other panels work fine, so I guess the list is complete.)

yngvar-antonsson · 2024-07-08T09:13:59Z

metrics/tarantool/config.lua

+        ready = 0,
+    }
+
+    config_status[config_info.status] = 1


Maybe it's better to choose a unique numerical value for each status instead of passing the status name to labels?

Such approach is bad at scaling: in case some new status will be introduced, we will have two ways:

add a value in-between: very bad, will break existing dashboards;

add a value to the tail: is bad in case it's not one more "healthy" status (which is unlikely), since healthy "ready" status will always be visualized as something between "reload_in_progress" and "new_not_healthy_status".

Such metrics are also impossible to read without documentation. Similar approach is already discussed in [1, 2]. So I think this approach is a bit better, even though it's harder to visualize.

Enum: custom label for state prometheus/client_python#416

Does OpenTelemetry need "enum" metric type? open-telemetry/opentelemetry-specification#1711

This patch renames `cluster` and `replication` overview panels sections to `cluster_cartridge` and `replication_cartridge` since they use Cartridge-specific metrics [1]. This is a breaking change for all custom-built dashboard which had used them. It has no effect on our published dashboards. 1. tarantool/metrics#491 Part of #224

Bump metrics package submodule. Commits from PRs [1-4] affect Tarantool, the other ones are related to module infrastructure. 1. tarantool/metrics#482 2. tarantool/metrics#483 3. tarantool/metrics#484 4. tarantool/metrics#491 NO_DOC=doc is a part of submodule

DifferentialOrange marked this pull request as ready for review July 4, 2024 13:01

DifferentialOrange force-pushed the DifferentialOrange/tarantool-3-cartridge-replacements branch from 9d5b5d8 to d42a27b Compare July 4, 2024 13:03

DifferentialOrange requested a review from oleg-jukovec July 4, 2024 13:03

oleg-jukovec reviewed Jul 4, 2024

View reviewed changes

doc/monitoring/metrics_reference.rst Outdated Show resolved Hide resolved

metrics/utils.lua Outdated Show resolved Hide resolved

test/tarantool/cpu_metrics_test.lua Outdated Show resolved Hide resolved

DifferentialOrange force-pushed the DifferentialOrange/tarantool-3-cartridge-replacements branch 2 times, most recently from 6fc93ae to 75bc41b Compare July 5, 2024 08:26

DifferentialOrange added 3 commits July 5, 2024 11:27

test: skip flaky test for all runs

c1b3bce

DifferentialOrange force-pushed the DifferentialOrange/tarantool-3-cartridge-replacements branch from 75bc41b to c1b3bce Compare July 5, 2024 08:27

oleg-jukovec approved these changes Jul 5, 2024

View reviewed changes

DifferentialOrange requested a review from yngvar-antonsson July 5, 2024 08:47

oleg-jukovec approved these changes Jul 7, 2024

View reviewed changes

yngvar-antonsson reviewed Jul 8, 2024

View reviewed changes

DifferentialOrange mentioned this pull request Jul 8, 2024

Tarantool 3 support: phase 1 tarantool/grafana-dashboard#228

Merged

yngvar-antonsson approved these changes Jul 9, 2024

View reviewed changes

DifferentialOrange merged commit e7e49a6 into master Jul 9, 2024
83 of 84 checks passed

DifferentialOrange deleted the DifferentialOrange/tarantool-3-cartridge-replacements branch July 9, 2024 09:42

DifferentialOrange mentioned this pull request Jul 10, 2024

lua: bump metrics module tarantool/tarantool#10220

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tarantool 3 "replacements" for Cartridge #491

Tarantool 3 "replacements" for Cartridge #491

DifferentialOrange commented Jul 4, 2024 •

edited

Loading

oleg-jukovec left a comment •

edited

Loading

DifferentialOrange commented Jul 5, 2024

DifferentialOrange commented Jul 5, 2024 •

edited

Loading

oleg-jukovec left a comment

oleg-jukovec commented Jul 5, 2024

DifferentialOrange commented Jul 5, 2024 •

edited

Loading

yngvar-antonsson Jul 8, 2024

DifferentialOrange Jul 8, 2024

Tarantool 3 "replacements" for Cartridge #491

Tarantool 3 "replacements" for Cartridge #491

Conversation

DifferentialOrange commented Jul 4, 2024 • edited Loading

oleg-jukovec left a comment • edited Loading

Choose a reason for hiding this comment

DifferentialOrange commented Jul 5, 2024

DifferentialOrange commented Jul 5, 2024 • edited Loading

oleg-jukovec left a comment

Choose a reason for hiding this comment

oleg-jukovec commented Jul 5, 2024

DifferentialOrange commented Jul 5, 2024 • edited Loading

yngvar-antonsson Jul 8, 2024

Choose a reason for hiding this comment

DifferentialOrange Jul 8, 2024

Choose a reason for hiding this comment

DifferentialOrange commented Jul 4, 2024 •

edited

Loading

oleg-jukovec left a comment •

edited

Loading

DifferentialOrange commented Jul 5, 2024 •

edited

Loading

DifferentialOrange commented Jul 5, 2024 •

edited

Loading