Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tarantool 3 "replacements" for Cartridge #491

Conversation

DifferentialOrange
Copy link
Member

@DifferentialOrange DifferentialOrange commented Jul 4, 2024

Tarantool Grafana dashboard has some unique Cartridge panels. This PR is a part of "make Grafana dashboard work with Tarantool 3 with minimal efforts" activity (tarantool/grafana-dashboard#224).

The state is as follows. Cartridge had the following metrics:

  • tnt_cartridge_issues
  • tnt_cartridge_failover_trigger_total
  • tnt_clock_delta (from membership)

After this patch, metrics will have the following metrics for Tarantool 3:

  • tnt_config_alerts
  • tnt_config_status

How one covers the other? tnt_config_alerts is similar to tnt_cartridge_issues, but only for configuration apply. It covers everything that had gone wrong with applying configuration to the instance, separated to warnings and errors as well. It is a bit more detailed since Cartridge could just reject wrong configuration on two-phase commit and Tarantool 3 has asynchronous per-instance apply. tnt_config_status extend this info covering non-initialized instances and in-between states. On the other hand, config alerts does not cover other cluster issues, like "replication is broken due to conflict" and "instance 1 cannot ping instance 2 due to network issues". AFAIK, Tarantool 3 does not have any built-in mechanism for it. Also, Tarantool 3 (including newest EE failover) does not have any failover counters, so there is no analogues to tnt_cartridge_failover_trigger_total (yet existing tnt_read_only already provide a lot of information related to this question and also a part of Grafana dashboard for both Cartridge and Tarantool 3). There doesn't seem to be any replacement for tnt_clock_delta as well.

TCM also has some "replacement" for Cartridge issues, see source code. It covers two things: config alerts and vshard bootstrap success. We cover the first one. The second one has some additional assumptions and non-trivial logic, so I decided to not cover it here for now. It would be better if proper check would be added to vshard or tarantool so it can be exposed here, if required. So I think it's safe to say that here we cover almost the same things that are covered by TCM.

Visualization example:
image

  • Tests
  • Changelog
  • Documentation (README and rst)
  • Rockspec and rpm spec (not needed, already build metrics/*)

Part of tarantool/grafana-dashboard#224

@DifferentialOrange DifferentialOrange marked this pull request as ready for review July 4, 2024 13:01
@DifferentialOrange DifferentialOrange force-pushed the DifferentialOrange/tarantool-3-cartridge-replacements branch from 9d5b5d8 to d42a27b Compare July 4, 2024 13:03
Copy link

@oleg-jukovec oleg-jukovec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we will need two releases for the grafana-dashboard: without these metrics and with them.

Because we're currently aiming for Tarantool 3.1, and the metrics package will only be updated with Tarantool 3.2 in September.

doc/monitoring/metrics_reference.rst Outdated Show resolved Hide resolved
metrics/utils.lua Outdated Show resolved Hide resolved
test/tarantool/cpu_metrics_test.lua Outdated Show resolved Hide resolved
@DifferentialOrange
Copy link
Member Author

It looks like we will need two releases for the grafana-dashboard: without these metrics and with them.

Yeah, that's the plan. We already have 4 dashboards: Tarantool for Prometheus, Tarantool for InfluxDB, TDG for Prometheus, TDG for InfluxDB. After https://github.com/tarantool/grafana-dashboard/, there will be six of them. Due to grafana-dashboard jsonnet nature it's rather easy to build different dashboards (just a bit bothersome to support different release pages).

@DifferentialOrange DifferentialOrange force-pushed the DifferentialOrange/tarantool-3-cartridge-replacements branch 2 times, most recently from 6fc93ae to 75bc41b Compare July 5, 2024 08:26
Since there is no full support of Tarantool 3 config instances in
luatest yet (only treegen support in master), I had borrowed some test
helpers from tarantool/crud [1].

1. https://github.com/tarantool/crud/blob/98b120ef7095fa34525ef9d335a1458a2edf0cca/test/tarantool3_helpers

Part of tarantool/grafana-dashboard#224
The approach used to represent enum metric here is similar to one
discussed in [1, 2]. It allows to support a new status later, if
required. For example, one can visualize it with hack like [3].

1. prometheus/client_python#416
2. open-telemetry/opentelemetry-specification#1711
3. https://stackoverflow.com/a/75761900/11646599

Part of tarantool/grafana-dashboard#224
@DifferentialOrange DifferentialOrange force-pushed the DifferentialOrange/tarantool-3-cartridge-replacements branch from 75bc41b to c1b3bce Compare July 5, 2024 08:27
@DifferentialOrange
Copy link
Member Author

DifferentialOrange commented Jul 5, 2024

and the metrics package will only be updated with Tarantool 3.2 in September

AFAIU, we will bump it for all supported 3.x series, so it will be built-in to 3.1.1 as well

Copy link

@oleg-jukovec oleg-jukovec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch!

@oleg-jukovec
Copy link

AFAIU, we will bump it for all supported 3.x series, so it will be built-in to 3.1.1 as well

To be honest, I don't know anything about plans for 3.1.1

@DifferentialOrange
Copy link
Member Author

DifferentialOrange commented Jul 5, 2024

I have found one more cartridge-exclusive metric: tnt_clock_delta. There doesn't seem to be any replacement in Tarantool 3 as well

(All other panels work fine, so I guess the list is complete.)

ready = 0,
}

config_status[config_info.status] = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to choose a unique numerical value for each status instead of passing the status name to labels?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such approach is bad at scaling: in case some new status will be introduced, we will have two ways:

  • add a value in-between: very bad, will break existing dashboards;
  • add a value to the tail: is bad in case it's not one more "healthy" status (which is unlikely), since healthy "ready" status will always be visualized as something between "reload_in_progress" and "new_not_healthy_status".

Such metrics are also impossible to read without documentation. Similar approach is already discussed in [1, 2]. So I think this approach is a bit better, even though it's harder to visualize.

  1. Enum: custom label for state prometheus/client_python#416
  2. Does OpenTelemetry need "enum" metric type? open-telemetry/opentelemetry-specification#1711

DifferentialOrange added a commit to tarantool/grafana-dashboard that referenced this pull request Jul 8, 2024
This patch renames `cluster` and `replication` overview panels sections
to `cluster_cartridge` and `replication_cartridge` since they use
Cartridge-specific metrics [1]. This is a breaking change for all
custom-built dashboard which had used them. It has no effect on our
published dashboards.

1. tarantool/metrics#491

Part of #224
DifferentialOrange added a commit to tarantool/grafana-dashboard that referenced this pull request Jul 8, 2024
This patch renames `cluster` and `replication` overview panels sections
to `cluster_cartridge` and `replication_cartridge` since they use
Cartridge-specific metrics [1]. This is a breaking change for all
custom-built dashboard which had used them. It has no effect on our
published dashboards.

1. tarantool/metrics#491

Part of #224
DifferentialOrange added a commit to tarantool/grafana-dashboard that referenced this pull request Jul 8, 2024
This patch renames `cluster` and `replication` overview panels sections
to `cluster_cartridge` and `replication_cartridge` since they use
Cartridge-specific metrics [1]. This is a breaking change for all
custom-built dashboard which had used them. It has no effect on our
published dashboards.

1. tarantool/metrics#491

Part of #224
@DifferentialOrange DifferentialOrange merged commit e7e49a6 into master Jul 9, 2024
83 of 84 checks passed
@DifferentialOrange DifferentialOrange deleted the DifferentialOrange/tarantool-3-cartridge-replacements branch July 9, 2024 09:42
DifferentialOrange added a commit to DifferentialOrange/tarantool that referenced this pull request Jul 10, 2024
Bump metrics package submodule. Commits from PRs [1-4] affect
Tarantool, the other ones are related to module infrastructure.

1. tarantool/metrics#482
2. tarantool/metrics#483
3. tarantool/metrics#484
4. tarantool/metrics#491

NO_DOC=doc is a part of submodule
DifferentialOrange added a commit to DifferentialOrange/tarantool that referenced this pull request Jul 10, 2024
Bump metrics package submodule. Commits from PRs [1-4] affect
Tarantool, the other ones are related to module infrastructure.

1. tarantool/metrics#482
2. tarantool/metrics#483
3. tarantool/metrics#484
4. tarantool/metrics#491

NO_DOC=doc is a part of submodule
DifferentialOrange added a commit to DifferentialOrange/tarantool that referenced this pull request Jul 16, 2024
Bump metrics package submodule. Commits from PRs [1-4] affect
Tarantool, the other ones are related to module infrastructure.

1. tarantool/metrics#482
2. tarantool/metrics#483
3. tarantool/metrics#484
4. tarantool/metrics#491

NO_DOC=doc is a part of submodule
Totktonada pushed a commit to tarantool/tarantool that referenced this pull request Jul 22, 2024
Bump metrics package submodule. Commits from PRs [1-4] affect
Tarantool, the other ones are related to module infrastructure.

1. tarantool/metrics#482
2. tarantool/metrics#483
3. tarantool/metrics#484
4. tarantool/metrics#491

NO_DOC=doc is a part of submodule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants