-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tarantool 3 "replacements" for Cartridge #491
Tarantool 3 "replacements" for Cartridge #491
Conversation
9d5b5d8
to
d42a27b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we will need two releases for the grafana-dashboard: without these metrics and with them.
Because we're currently aiming for Tarantool 3.1, and the metrics package will only be updated with Tarantool 3.2 in September.
Yeah, that's the plan. We already have 4 dashboards: Tarantool for Prometheus, Tarantool for InfluxDB, TDG for Prometheus, TDG for InfluxDB. After https://github.com/tarantool/grafana-dashboard/, there will be six of them. Due to grafana-dashboard jsonnet nature it's rather easy to build different dashboards (just a bit bothersome to support different release pages). |
6fc93ae
to
75bc41b
Compare
Since there is no full support of Tarantool 3 config instances in luatest yet (only treegen support in master), I had borrowed some test helpers from tarantool/crud [1]. 1. https://github.com/tarantool/crud/blob/98b120ef7095fa34525ef9d335a1458a2edf0cca/test/tarantool3_helpers Part of tarantool/grafana-dashboard#224
The approach used to represent enum metric here is similar to one discussed in [1, 2]. It allows to support a new status later, if required. For example, one can visualize it with hack like [3]. 1. prometheus/client_python#416 2. open-telemetry/opentelemetry-specification#1711 3. https://stackoverflow.com/a/75761900/11646599 Part of tarantool/grafana-dashboard#224
75bc41b
to
c1b3bce
Compare
AFAIU, we will bump it for all supported 3.x series, so it will be built-in to 3.1.1 as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patch!
To be honest, I don't know anything about plans for 3.1.1 |
I have found one more cartridge-exclusive metric: (All other panels work fine, so I guess the list is complete.) |
ready = 0, | ||
} | ||
|
||
config_status[config_info.status] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to choose a unique numerical value for each status instead of passing the status name to labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such approach is bad at scaling: in case some new status will be introduced, we will have two ways:
- add a value in-between: very bad, will break existing dashboards;
- add a value to the tail: is bad in case it's not one more "healthy" status (which is unlikely), since healthy "ready" status will always be visualized as something between "reload_in_progress" and "new_not_healthy_status".
Such metrics are also impossible to read without documentation. Similar approach is already discussed in [1, 2]. So I think this approach is a bit better, even though it's harder to visualize.
This patch renames `cluster` and `replication` overview panels sections to `cluster_cartridge` and `replication_cartridge` since they use Cartridge-specific metrics [1]. This is a breaking change for all custom-built dashboard which had used them. It has no effect on our published dashboards. 1. tarantool/metrics#491 Part of #224
This patch renames `cluster` and `replication` overview panels sections to `cluster_cartridge` and `replication_cartridge` since they use Cartridge-specific metrics [1]. This is a breaking change for all custom-built dashboard which had used them. It has no effect on our published dashboards. 1. tarantool/metrics#491 Part of #224
This patch renames `cluster` and `replication` overview panels sections to `cluster_cartridge` and `replication_cartridge` since they use Cartridge-specific metrics [1]. This is a breaking change for all custom-built dashboard which had used them. It has no effect on our published dashboards. 1. tarantool/metrics#491 Part of #224
Bump metrics package submodule. Commits from PRs [1-4] affect Tarantool, the other ones are related to module infrastructure. 1. tarantool/metrics#482 2. tarantool/metrics#483 3. tarantool/metrics#484 4. tarantool/metrics#491 NO_DOC=doc is a part of submodule
Bump metrics package submodule. Commits from PRs [1-4] affect Tarantool, the other ones are related to module infrastructure. 1. tarantool/metrics#482 2. tarantool/metrics#483 3. tarantool/metrics#484 4. tarantool/metrics#491 NO_DOC=doc is a part of submodule
Bump metrics package submodule. Commits from PRs [1-4] affect Tarantool, the other ones are related to module infrastructure. 1. tarantool/metrics#482 2. tarantool/metrics#483 3. tarantool/metrics#484 4. tarantool/metrics#491 NO_DOC=doc is a part of submodule
Bump metrics package submodule. Commits from PRs [1-4] affect Tarantool, the other ones are related to module infrastructure. 1. tarantool/metrics#482 2. tarantool/metrics#483 3. tarantool/metrics#484 4. tarantool/metrics#491 NO_DOC=doc is a part of submodule
Tarantool Grafana dashboard has some unique Cartridge panels. This PR is a part of "make Grafana dashboard work with Tarantool 3 with minimal efforts" activity (tarantool/grafana-dashboard#224).
The state is as follows. Cartridge had the following metrics:
After this patch,
metrics
will have the following metrics for Tarantool 3:How one covers the other?
tnt_config_alerts
is similar totnt_cartridge_issues
, but only for configuration apply. It covers everything that had gone wrong with applying configuration to the instance, separated to warnings and errors as well. It is a bit more detailed since Cartridge could just reject wrong configuration on two-phase commit and Tarantool 3 has asynchronous per-instance apply.tnt_config_status
extend this info covering non-initialized instances and in-between states. On the other hand, config alerts does not cover other cluster issues, like "replication is broken due to conflict" and "instance 1 cannot ping instance 2 due to network issues". AFAIK, Tarantool 3 does not have any built-in mechanism for it. Also, Tarantool 3 (including newest EE failover) does not have any failover counters, so there is no analogues totnt_cartridge_failover_trigger_total
(yet existingtnt_read_only
already provide a lot of information related to this question and also a part of Grafana dashboard for both Cartridge and Tarantool 3). There doesn't seem to be any replacement fortnt_clock_delta
as well.TCM also has some "replacement" for Cartridge issues, see source code. It covers two things: config alerts and vshard bootstrap success. We cover the first one. The second one has some additional assumptions and non-trivial logic, so I decided to not cover it here for now. It would be better if proper check would be added to
vshard
ortarantool
so it can be exposed here, if required. So I think it's safe to say that here we cover almost the same things that are covered by TCM.Visualization example:
Part of tarantool/grafana-dashboard#224