Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature(new_metrics): migrate meta metrics #1331

Open
Tracked by #1328
empiredan opened this issue Jan 29, 2023 · 0 comments
Open
Tracked by #1328

Feature(new_metrics): migrate meta metrics #1331

empiredan opened this issue Jan 29, 2023 · 0 comments
Labels
type/enhancement Indicates new feature requests

Comments

@empiredan
Copy link
Contributor

The meta-related metrics migrated to new framework will be attached to server entity. All involved classes are put as below.


Following metrics are the members of server_state (server_state.cpp), which is created at the construction of meta_service:

Variables Types/Computations
_dead_partition_count Gauge
_unreadable_partition_count Gauge
_unwritable_partition_count Gauge
_writable_ill_partition_count Gauge
_healthy_partition_count Gauge
_recent_update_config_count increase(Counter)
_recent_partition_change_unwritable_count increase(Counter)
_recent_partition_change_writable_count increase(Counter)

Following metrics are the members of greedy_load_balancer (greedy_load_balancer.cpp), which is created at meta_service::start():

Variables Types/Computations
_balance_operation_count Gauge
_recent_balance_move_primary_count increase(Counter)
_recent_balance_copy_primary_count increase(Counter)
_recent_balance_copy_secondary_count increase(Counter)

Following metrics are the member of meta_service (meta_service.cpp), which is created at the construction of meta_service_app:

Variables Types/Computations
_recent_disconnect_count increase(Counter)
_unalive_nodes_count Gauge
_alive_nodes_count Gauge

Following metrics are the member of policy_context and created at policy_context::start() (meta_service.cpp). policy_context is created at meta_service::start() once cold backup is enabled:

Variables Types/Computations
_counter_policy_recent_backup_duration_ms Gauge

Following metrics are the member of partition_guardian (meta_service.cpp), which is created at meta_service::start() :

Variables Types/Computations
_recent_choose_primary_fail_count increase(Counter)
@empiredan empiredan added the type/enhancement Indicates new feature requests label Jan 29, 2023
@empiredan empiredan changed the title Feature(new_metrics): migrate meta metrics Feature(new_metrics): add table-level metric entity migrate meta metrics Apr 8, 2023
@empiredan empiredan changed the title Feature(new_metrics): add table-level metric entity migrate meta metrics Feature(new_metrics): add table-level metric entity and migrate meta metrics Apr 10, 2023
acelyc111 pushed a commit that referenced this issue Apr 10, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Apr 11, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Apr 11, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
acelyc111 pushed a commit that referenced this issue Apr 14, 2023
…ition-level metrics for greedy_load_balancer of meta (#1435)

#1331

In perf counters, all metrics of greedy_load_balancer are server-level, for
example, the number of each kind of operations by greedy balancer, including
moving primaries, copying primaries and copying secondaries.

For new metrics, it is hoped that they are fine-grained, since sometimes we
want to know which primaries are moved. Also, it is convenient to calculate
table-level or server-level metrics by just aggregate on partition-level ones.

The metrics of greedy_load_balancer that are changed to partition-level and
migrated to new framework include: the number of balance operations by
greedy balancer that are recently needed to be executed, move primaries,
copy primaries, and copy secondaries.

In addition to the metrics of greedy_load_balancer, we also change some
metrics of server_state again to partition-level which have been migrated
to table-level in #1431, 
for the reason that partition-level is considered more appropriate for them
than table-level.  The metrics changed to partition-level include the number
of times the configuration has been changed and the number of times the
status of partition has been changed to unwritable or writable for a partition.

To implement table-level metrics, partition-level metric entity is also added.
empiredan added a commit that referenced this issue Apr 14, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Apr 14, 2023
…ition-level metrics for greedy_load_balancer of meta (#1435)

#1331

In perf counters, all metrics of greedy_load_balancer are server-level, for
example, the number of each kind of operations by greedy balancer, including
moving primaries, copying primaries and copying secondaries.

For new metrics, it is hoped that they are fine-grained, since sometimes we
want to know which primaries are moved. Also, it is convenient to calculate
table-level or server-level metrics by just aggregate on partition-level ones.

The metrics of greedy_load_balancer that are changed to partition-level and
migrated to new framework include: the number of balance operations by
greedy balancer that are recently needed to be executed, move primaries,
copy primaries, and copy secondaries.

In addition to the metrics of greedy_load_balancer, we also change some
metrics of server_state again to partition-level which have been migrated
to table-level in #1431, 
for the reason that partition-level is considered more appropriate for them
than table-level.  The metrics changed to partition-level include the number
of times the configuration has been changed and the number of times the
status of partition has been changed to unwritable or writable for a partition.

To implement table-level metrics, partition-level metric entity is also added.
empiredan added a commit that referenced this issue Apr 14, 2023
#1331

Migrate metrics to new framework for meta_service, including the number
of disconnections with replica servers, and the number of unalive and alive
replica servers. All of these metrics are server-level, maintained in meta
server.

The old type in perf counters of the number of disconnections is volatile
counter, which would be changed to non-volatile, while another 2 metrics
would keep the type of gauge.
empiredan added a commit that referenced this issue Apr 15, 2023
…backup-policy-level metrics for meta_backup_service (#1438)

#1331

In perf counters, there's only one metric for meta_backup_service, namely
the recent backup duration for each policy, which means this metric is
policy-level. Therefore policy-level entity would also be implemented in
new metrics.
@empiredan empiredan changed the title Feature(new_metrics): add table-level metric entity and migrate meta metrics Feature(new_metrics): migrate meta metrics Apr 15, 2023
acelyc111 pushed a commit that referenced this issue Apr 16, 2023
…dian (#1440)

#1331

In perf counters, there's only one metric for partition_guardian, namely
the number of operations that fail to choose the primary replica, which
is server-level. It would be changed to partition-level in new metrics
since this could give which partitions fail to choose primaries and how
frequency those happen. Still, to compute table-level or server-level
metrics just aggregate on partition-level ones.
empiredan added a commit that referenced this issue Apr 18, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Apr 18, 2023
…ition-level metrics for greedy_load_balancer of meta (#1435)

#1331

In perf counters, all metrics of greedy_load_balancer are server-level, for
example, the number of each kind of operations by greedy balancer, including
moving primaries, copying primaries and copying secondaries.

For new metrics, it is hoped that they are fine-grained, since sometimes we
want to know which primaries are moved. Also, it is convenient to calculate
table-level or server-level metrics by just aggregate on partition-level ones.

The metrics of greedy_load_balancer that are changed to partition-level and
migrated to new framework include: the number of balance operations by
greedy balancer that are recently needed to be executed, move primaries,
copy primaries, and copy secondaries.

In addition to the metrics of greedy_load_balancer, we also change some
metrics of server_state again to partition-level which have been migrated
to table-level in #1431, 
for the reason that partition-level is considered more appropriate for them
than table-level.  The metrics changed to partition-level include the number
of times the configuration has been changed and the number of times the
status of partition has been changed to unwritable or writable for a partition.

To implement table-level metrics, partition-level metric entity is also added.
empiredan added a commit that referenced this issue Apr 18, 2023
#1331

Migrate metrics to new framework for meta_service, including the number
of disconnections with replica servers, and the number of unalive and alive
replica servers. All of these metrics are server-level, maintained in meta
server.

The old type in perf counters of the number of disconnections is volatile
counter, which would be changed to non-volatile, while another 2 metrics
would keep the type of gauge.
empiredan added a commit that referenced this issue Apr 18, 2023
…backup-policy-level metrics for meta_backup_service (#1438)

#1331

In perf counters, there's only one metric for meta_backup_service, namely
the recent backup duration for each policy, which means this metric is
policy-level. Therefore policy-level entity would also be implemented in
new metrics.
empiredan added a commit that referenced this issue Apr 18, 2023
…dian (#1440)

#1331

In perf counters, there's only one metric for partition_guardian, namely
the number of operations that fail to choose the primary replica, which
is server-level. It would be changed to partition-level in new metrics
since this could give which partitions fail to choose primaries and how
frequency those happen. Still, to compute table-level or server-level
metrics just aggregate on partition-level ones.
empiredan added a commit that referenced this issue Apr 27, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Apr 27, 2023
…ition-level metrics for greedy_load_balancer of meta (#1435)

#1331

In perf counters, all metrics of greedy_load_balancer are server-level, for
example, the number of each kind of operations by greedy balancer, including
moving primaries, copying primaries and copying secondaries.

For new metrics, it is hoped that they are fine-grained, since sometimes we
want to know which primaries are moved. Also, it is convenient to calculate
table-level or server-level metrics by just aggregate on partition-level ones.

The metrics of greedy_load_balancer that are changed to partition-level and
migrated to new framework include: the number of balance operations by
greedy balancer that are recently needed to be executed, move primaries,
copy primaries, and copy secondaries.

In addition to the metrics of greedy_load_balancer, we also change some
metrics of server_state again to partition-level which have been migrated
to table-level in #1431, 
for the reason that partition-level is considered more appropriate for them
than table-level.  The metrics changed to partition-level include the number
of times the configuration has been changed and the number of times the
status of partition has been changed to unwritable or writable for a partition.

To implement table-level metrics, partition-level metric entity is also added.
empiredan added a commit that referenced this issue Apr 27, 2023
#1331

Migrate metrics to new framework for meta_service, including the number
of disconnections with replica servers, and the number of unalive and alive
replica servers. All of these metrics are server-level, maintained in meta
server.

The old type in perf counters of the number of disconnections is volatile
counter, which would be changed to non-volatile, while another 2 metrics
would keep the type of gauge.
empiredan added a commit that referenced this issue Apr 27, 2023
…backup-policy-level metrics for meta_backup_service (#1438)

#1331

In perf counters, there's only one metric for meta_backup_service, namely
the recent backup duration for each policy, which means this metric is
policy-level. Therefore policy-level entity would also be implemented in
new metrics.
empiredan added a commit that referenced this issue Apr 27, 2023
…dian (#1440)

#1331

In perf counters, there's only one metric for partition_guardian, namely
the number of operations that fail to choose the primary replica, which
is server-level. It would be changed to partition-level in new metrics
since this could give which partitions fail to choose primaries and how
frequency those happen. Still, to compute table-level or server-level
metrics just aggregate on partition-level ones.
empiredan added a commit that referenced this issue May 5, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Jun 5, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Jun 5, 2023
…ition-level metrics for greedy_load_balancer of meta (#1435)

#1331

In perf counters, all metrics of greedy_load_balancer are server-level, for
example, the number of each kind of operations by greedy balancer, including
moving primaries, copying primaries and copying secondaries.

For new metrics, it is hoped that they are fine-grained, since sometimes we
want to know which primaries are moved. Also, it is convenient to calculate
table-level or server-level metrics by just aggregate on partition-level ones.

The metrics of greedy_load_balancer that are changed to partition-level and
migrated to new framework include: the number of balance operations by
greedy balancer that are recently needed to be executed, move primaries,
copy primaries, and copy secondaries.

In addition to the metrics of greedy_load_balancer, we also change some
metrics of server_state again to partition-level which have been migrated
to table-level in #1431, 
for the reason that partition-level is considered more appropriate for them
than table-level.  The metrics changed to partition-level include the number
of times the configuration has been changed and the number of times the
status of partition has been changed to unwritable or writable for a partition.

To implement table-level metrics, partition-level metric entity is also added.
empiredan added a commit that referenced this issue Jun 5, 2023
#1331

Migrate metrics to new framework for meta_service, including the number
of disconnections with replica servers, and the number of unalive and alive
replica servers. All of these metrics are server-level, maintained in meta
server.

The old type in perf counters of the number of disconnections is volatile
counter, which would be changed to non-volatile, while another 2 metrics
would keep the type of gauge.
empiredan added a commit that referenced this issue Jun 5, 2023
…backup-policy-level metrics for meta_backup_service (#1438)

#1331

In perf counters, there's only one metric for meta_backup_service, namely
the recent backup duration for each policy, which means this metric is
policy-level. Therefore policy-level entity would also be implemented in
new metrics.
empiredan added a commit that referenced this issue Jun 5, 2023
…dian (#1440)

#1331

In perf counters, there's only one metric for partition_guardian, namely
the number of operations that fail to choose the primary replica, which
is server-level. It would be changed to partition-level in new metrics
since this could give which partitions fail to choose primaries and how
frequency those happen. Still, to compute table-level or server-level
metrics just aggregate on partition-level ones.
empiredan added a commit that referenced this issue Jun 21, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Jun 21, 2023
…ition-level metrics for greedy_load_balancer of meta (#1435)

#1331

In perf counters, all metrics of greedy_load_balancer are server-level, for
example, the number of each kind of operations by greedy balancer, including
moving primaries, copying primaries and copying secondaries.

For new metrics, it is hoped that they are fine-grained, since sometimes we
want to know which primaries are moved. Also, it is convenient to calculate
table-level or server-level metrics by just aggregate on partition-level ones.

The metrics of greedy_load_balancer that are changed to partition-level and
migrated to new framework include: the number of balance operations by
greedy balancer that are recently needed to be executed, move primaries,
copy primaries, and copy secondaries.

In addition to the metrics of greedy_load_balancer, we also change some
metrics of server_state again to partition-level which have been migrated
to table-level in #1431, 
for the reason that partition-level is considered more appropriate for them
than table-level.  The metrics changed to partition-level include the number
of times the configuration has been changed and the number of times the
status of partition has been changed to unwritable or writable for a partition.

To implement table-level metrics, partition-level metric entity is also added.
empiredan added a commit that referenced this issue Jun 21, 2023
#1331

Migrate metrics to new framework for meta_service, including the number
of disconnections with replica servers, and the number of unalive and alive
replica servers. All of these metrics are server-level, maintained in meta
server.

The old type in perf counters of the number of disconnections is volatile
counter, which would be changed to non-volatile, while another 2 metrics
would keep the type of gauge.
empiredan added a commit that referenced this issue Jun 21, 2023
…backup-policy-level metrics for meta_backup_service (#1438)

#1331

In perf counters, there's only one metric for meta_backup_service, namely
the recent backup duration for each policy, which means this metric is
policy-level. Therefore policy-level entity would also be implemented in
new metrics.
empiredan added a commit that referenced this issue Jun 21, 2023
…dian (#1440)

#1331

In perf counters, there's only one metric for partition_guardian, namely
the number of operations that fail to choose the primary replica, which
is server-level. It would be changed to partition-level in new metrics
since this could give which partitions fail to choose primaries and how
frequency those happen. Still, to compute table-level or server-level
metrics just aggregate on partition-level ones.
empiredan added a commit that referenced this issue Aug 9, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Aug 9, 2023
…ition-level metrics for greedy_load_balancer of meta (#1435)

#1331

In perf counters, all metrics of greedy_load_balancer are server-level, for
example, the number of each kind of operations by greedy balancer, including
moving primaries, copying primaries and copying secondaries.

For new metrics, it is hoped that they are fine-grained, since sometimes we
want to know which primaries are moved. Also, it is convenient to calculate
table-level or server-level metrics by just aggregate on partition-level ones.

The metrics of greedy_load_balancer that are changed to partition-level and
migrated to new framework include: the number of balance operations by
greedy balancer that are recently needed to be executed, move primaries,
copy primaries, and copy secondaries.

In addition to the metrics of greedy_load_balancer, we also change some
metrics of server_state again to partition-level which have been migrated
to table-level in #1431, 
for the reason that partition-level is considered more appropriate for them
than table-level.  The metrics changed to partition-level include the number
of times the configuration has been changed and the number of times the
status of partition has been changed to unwritable or writable for a partition.

To implement table-level metrics, partition-level metric entity is also added.
empiredan added a commit that referenced this issue Aug 9, 2023
#1331

Migrate metrics to new framework for meta_service, including the number
of disconnections with replica servers, and the number of unalive and alive
replica servers. All of these metrics are server-level, maintained in meta
server.

The old type in perf counters of the number of disconnections is volatile
counter, which would be changed to non-volatile, while another 2 metrics
would keep the type of gauge.
empiredan added a commit that referenced this issue Aug 9, 2023
…backup-policy-level metrics for meta_backup_service (#1438)

#1331

In perf counters, there's only one metric for meta_backup_service, namely
the recent backup duration for each policy, which means this metric is
policy-level. Therefore policy-level entity would also be implemented in
new metrics.
empiredan added a commit that referenced this issue Aug 9, 2023
…dian (#1440)

#1331

In perf counters, there's only one metric for partition_guardian, namely
the number of operations that fail to choose the primary replica, which
is server-level. It would be changed to partition-level in new metrics
since this could give which partitions fail to choose primaries and how
frequency those happen. Still, to compute table-level or server-level
metrics just aggregate on partition-level ones.
empiredan added a commit that referenced this issue Aug 11, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Aug 11, 2023
…ition-level metrics for greedy_load_balancer of meta (#1435)

#1331

In perf counters, all metrics of greedy_load_balancer are server-level, for
example, the number of each kind of operations by greedy balancer, including
moving primaries, copying primaries and copying secondaries.

For new metrics, it is hoped that they are fine-grained, since sometimes we
want to know which primaries are moved. Also, it is convenient to calculate
table-level or server-level metrics by just aggregate on partition-level ones.

The metrics of greedy_load_balancer that are changed to partition-level and
migrated to new framework include: the number of balance operations by
greedy balancer that are recently needed to be executed, move primaries,
copy primaries, and copy secondaries.

In addition to the metrics of greedy_load_balancer, we also change some
metrics of server_state again to partition-level which have been migrated
to table-level in #1431, 
for the reason that partition-level is considered more appropriate for them
than table-level.  The metrics changed to partition-level include the number
of times the configuration has been changed and the number of times the
status of partition has been changed to unwritable or writable for a partition.

To implement table-level metrics, partition-level metric entity is also added.
empiredan added a commit that referenced this issue Aug 11, 2023
#1331

Migrate metrics to new framework for meta_service, including the number
of disconnections with replica servers, and the number of unalive and alive
replica servers. All of these metrics are server-level, maintained in meta
server.

The old type in perf counters of the number of disconnections is volatile
counter, which would be changed to non-volatile, while another 2 metrics
would keep the type of gauge.
empiredan added a commit that referenced this issue Aug 11, 2023
…backup-policy-level metrics for meta_backup_service (#1438)

#1331

In perf counters, there's only one metric for meta_backup_service, namely
the recent backup duration for each policy, which means this metric is
policy-level. Therefore policy-level entity would also be implemented in
new metrics.
empiredan added a commit that referenced this issue Aug 11, 2023
…dian (#1440)

#1331

In perf counters, there's only one metric for partition_guardian, namely
the number of operations that fail to choose the primary replica, which
is server-level. It would be changed to partition-level in new metrics
since this could give which partitions fail to choose primaries and how
frequency those happen. Still, to compute table-level or server-level
metrics just aggregate on partition-level ones.
empiredan added a commit to empiredan/pegasus that referenced this issue Dec 6, 2023
…vel metrics for server_state of meta (apache#1431)

apache#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit to empiredan/pegasus that referenced this issue Dec 6, 2023
…ition-level metrics for greedy_load_balancer of meta (apache#1435)

apache#1331

In perf counters, all metrics of greedy_load_balancer are server-level, for
example, the number of each kind of operations by greedy balancer, including
moving primaries, copying primaries and copying secondaries.

For new metrics, it is hoped that they are fine-grained, since sometimes we
want to know which primaries are moved. Also, it is convenient to calculate
table-level or server-level metrics by just aggregate on partition-level ones.

The metrics of greedy_load_balancer that are changed to partition-level and
migrated to new framework include: the number of balance operations by
greedy balancer that are recently needed to be executed, move primaries,
copy primaries, and copy secondaries.

In addition to the metrics of greedy_load_balancer, we also change some
metrics of server_state again to partition-level which have been migrated
to table-level in apache#1431, 
for the reason that partition-level is considered more appropriate for them
than table-level.  The metrics changed to partition-level include the number
of times the configuration has been changed and the number of times the
status of partition has been changed to unwritable or writable for a partition.

To implement table-level metrics, partition-level metric entity is also added.
empiredan added a commit to empiredan/pegasus that referenced this issue Dec 6, 2023
…che#1437)

apache#1331

Migrate metrics to new framework for meta_service, including the number
of disconnections with replica servers, and the number of unalive and alive
replica servers. All of these metrics are server-level, maintained in meta
server.

The old type in perf counters of the number of disconnections is volatile
counter, which would be changed to non-volatile, while another 2 metrics
would keep the type of gauge.
empiredan added a commit to empiredan/pegasus that referenced this issue Dec 6, 2023
…backup-policy-level metrics for meta_backup_service (apache#1438)

apache#1331

In perf counters, there's only one metric for meta_backup_service, namely
the recent backup duration for each policy, which means this metric is
policy-level. Therefore policy-level entity would also be implemented in
new metrics.
empiredan added a commit to empiredan/pegasus that referenced this issue Dec 6, 2023
…dian (apache#1440)

apache#1331

In perf counters, there's only one metric for partition_guardian, namely
the number of operations that fail to choose the primary replica, which
is server-level. It would be changed to partition-level in new metrics
since this could give which partitions fail to choose primaries and how
frequency those happen. Still, to compute table-level or server-level
metrics just aggregate on partition-level ones.
empiredan added a commit that referenced this issue Dec 11, 2023
…vel metrics for server_state of meta (#1431)

#1331

In perf counters, all metrics of server_state are server-level, for example,
the number of healthy partitions among all tables of a pegasus cluster.

However, sometimes this is not enough. For example, the metric shows
that there are 4 unwritable partitions: the 4 unwritable partitions might
belong to different tables; or, they might belong to one table.

Therefore, these server-level metrics could be changed to table-level.
This will provide us with the status of each table. On the other hand,
once server-level metrics is needed, just aggregate on table-level ones.

The metrics of server_state that are migrated and changed to table-level
include: The number of dead, unreadable, unwritable, writable-ill, and
healthy partitions among all partitions of a table, the number of times
the configuration has been changed and the number of times the status
of partition has been changed to unwritable or writable for a table.

To implement table-level metrics, table-level metric entity is also added.
empiredan added a commit that referenced this issue Dec 11, 2023
…ition-level metrics for greedy_load_balancer of meta (#1435)

#1331

In perf counters, all metrics of greedy_load_balancer are server-level, for
example, the number of each kind of operations by greedy balancer, including
moving primaries, copying primaries and copying secondaries.

For new metrics, it is hoped that they are fine-grained, since sometimes we
want to know which primaries are moved. Also, it is convenient to calculate
table-level or server-level metrics by just aggregate on partition-level ones.

The metrics of greedy_load_balancer that are changed to partition-level and
migrated to new framework include: the number of balance operations by
greedy balancer that are recently needed to be executed, move primaries,
copy primaries, and copy secondaries.

In addition to the metrics of greedy_load_balancer, we also change some
metrics of server_state again to partition-level which have been migrated
to table-level in #1431, 
for the reason that partition-level is considered more appropriate for them
than table-level.  The metrics changed to partition-level include the number
of times the configuration has been changed and the number of times the
status of partition has been changed to unwritable or writable for a partition.

To implement table-level metrics, partition-level metric entity is also added.
empiredan added a commit that referenced this issue Dec 11, 2023
#1331

Migrate metrics to new framework for meta_service, including the number
of disconnections with replica servers, and the number of unalive and alive
replica servers. All of these metrics are server-level, maintained in meta
server.

The old type in perf counters of the number of disconnections is volatile
counter, which would be changed to non-volatile, while another 2 metrics
would keep the type of gauge.
empiredan added a commit that referenced this issue Dec 11, 2023
…backup-policy-level metrics for meta_backup_service (#1438)

#1331

In perf counters, there's only one metric for meta_backup_service, namely
the recent backup duration for each policy, which means this metric is
policy-level. Therefore policy-level entity would also be implemented in
new metrics.
empiredan added a commit that referenced this issue Dec 11, 2023
…dian (#1440)

#1331

In perf counters, there's only one metric for partition_guardian, namely
the number of operations that fail to choose the primary replica, which
is server-level. It would be changed to partition-level in new metrics
since this could give which partitions fail to choose primaries and how
frequency those happen. Still, to compute table-level or server-level
metrics just aggregate on partition-level ones.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Indicates new feature requests
Projects
None yet
Development

No branches or pull requests

1 participant