Feature(new_metrics): migrate meta metrics #1331

empiredan · 2023-01-29T07:58:35Z

The meta-related metrics migrated to new framework will be attached to server entity. All involved classes are put as below.

Following metrics are the members of server_state (server_state.cpp), which is created at the construction of meta_service:

Variables	Types/Computations
_dead_partition_count	Gauge
_unreadable_partition_count	Gauge
_unwritable_partition_count	Gauge
_writable_ill_partition_count	Gauge
_healthy_partition_count	Gauge
_recent_update_config_count	increase(Counter)
_recent_partition_change_unwritable_count	increase(Counter)
_recent_partition_change_writable_count	increase(Counter)

Following metrics are the members of greedy_load_balancer (greedy_load_balancer.cpp), which is created at meta_service::start():

Variables	Types/Computations
_balance_operation_count	Gauge
_recent_balance_move_primary_count	increase(Counter)
_recent_balance_copy_primary_count	increase(Counter)
_recent_balance_copy_secondary_count	increase(Counter)

Following metrics are the member of meta_service (meta_service.cpp), which is created at the construction of meta_service_app:

Variables	Types/Computations
_recent_disconnect_count	increase(Counter)
_unalive_nodes_count	Gauge
_alive_nodes_count	Gauge

Following metrics are the member of policy_context and created at policy_context::start() (meta_service.cpp). policy_context is created at meta_service::start() once cold backup is enabled:

Variables	Types/Computations
_counter_policy_recent_backup_duration_ms	Gauge

Following metrics are the member of partition_guardian (meta_service.cpp), which is created at meta_service::start() :

Variables	Types/Computations
_recent_choose_primary_fail_count	increase(Counter)

The text was updated successfully, but these errors were encountered:

…vel metrics for server_state of meta (#1431) #1331 In perf counters, all metrics of server_state are server-level, for example, the number of healthy partitions among all tables of a pegasus cluster. However, sometimes this is not enough. For example, the metric shows that there are 4 unwritable partitions: the 4 unwritable partitions might belong to different tables; or, they might belong to one table. Therefore, these server-level metrics could be changed to table-level. This will provide us with the status of each table. On the other hand, once server-level metrics is needed, just aggregate on table-level ones. The metrics of server_state that are migrated and changed to table-level include: The number of dead, unreadable, unwritable, writable-ill, and healthy partitions among all partitions of a table, the number of times the configuration has been changed and the number of times the status of partition has been changed to unwritable or writable for a table. To implement table-level metrics, table-level metric entity is also added.

…ition-level metrics for greedy_load_balancer of meta (#1435) #1331 In perf counters, all metrics of greedy_load_balancer are server-level, for example, the number of each kind of operations by greedy balancer, including moving primaries, copying primaries and copying secondaries. For new metrics, it is hoped that they are fine-grained, since sometimes we want to know which primaries are moved. Also, it is convenient to calculate table-level or server-level metrics by just aggregate on partition-level ones. The metrics of greedy_load_balancer that are changed to partition-level and migrated to new framework include: the number of balance operations by greedy balancer that are recently needed to be executed, move primaries, copy primaries, and copy secondaries. In addition to the metrics of greedy_load_balancer, we also change some metrics of server_state again to partition-level which have been migrated to table-level in #1431, for the reason that partition-level is considered more appropriate for them than table-level. The metrics changed to partition-level include the number of times the configuration has been changed and the number of times the status of partition has been changed to unwritable or writable for a partition. To implement table-level metrics, partition-level metric entity is also added.

…vel metrics for server_state of meta (#1431) #1331 In perf counters, all metrics of server_state are server-level, for example, the number of healthy partitions among all tables of a pegasus cluster. However, sometimes this is not enough. For example, the metric shows that there are 4 unwritable partitions: the 4 unwritable partitions might belong to different tables; or, they might belong to one table. Therefore, these server-level metrics could be changed to table-level. This will provide us with the status of each table. On the other hand, once server-level metrics is needed, just aggregate on table-level ones. The metrics of server_state that are migrated and changed to table-level include: The number of dead, unreadable, unwritable, writable-ill, and healthy partitions among all partitions of a table, the number of times the configuration has been changed and the number of times the status of partition has been changed to unwritable or writable for a table. To implement table-level metrics, table-level metric entity is also added.

…ition-level metrics for greedy_load_balancer of meta (#1435) #1331 In perf counters, all metrics of greedy_load_balancer are server-level, for example, the number of each kind of operations by greedy balancer, including moving primaries, copying primaries and copying secondaries. For new metrics, it is hoped that they are fine-grained, since sometimes we want to know which primaries are moved. Also, it is convenient to calculate table-level or server-level metrics by just aggregate on partition-level ones. The metrics of greedy_load_balancer that are changed to partition-level and migrated to new framework include: the number of balance operations by greedy balancer that are recently needed to be executed, move primaries, copy primaries, and copy secondaries. In addition to the metrics of greedy_load_balancer, we also change some metrics of server_state again to partition-level which have been migrated to table-level in #1431, for the reason that partition-level is considered more appropriate for them than table-level. The metrics changed to partition-level include the number of times the configuration has been changed and the number of times the status of partition has been changed to unwritable or writable for a partition. To implement table-level metrics, partition-level metric entity is also added.

#1331 Migrate metrics to new framework for meta_service, including the number of disconnections with replica servers, and the number of unalive and alive replica servers. All of these metrics are server-level, maintained in meta server. The old type in perf counters of the number of disconnections is volatile counter, which would be changed to non-volatile, while another 2 metrics would keep the type of gauge.

…backup-policy-level metrics for meta_backup_service (#1438) #1331 In perf counters, there's only one metric for meta_backup_service, namely the recent backup duration for each policy, which means this metric is policy-level. Therefore policy-level entity would also be implemented in new metrics.

…dian (#1440) #1331 In perf counters, there's only one metric for partition_guardian, namely the number of operations that fail to choose the primary replica, which is server-level. It would be changed to partition-level in new metrics since this could give which partitions fail to choose primaries and how frequency those happen. Still, to compute table-level or server-level metrics just aggregate on partition-level ones.