-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executor: add diagnosis rule to detect cluster critical errors #14743
Conversation
Signed-off-by: Lonng <[email protected]>
Signed-off-by: Lonng <[email protected]>
Signed-off-by: Lonng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
}, | ||
} | ||
|
||
for _, cas := range cases { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, cas := range cases { | |
for _, case := range cases { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case
is a keyword.
/merge |
/run-all-tests |
What problem does this PR solve?
This PR adds a new diagnosis rule, which is used to detect whether critical errors occurred in the cluster. We will detect the following metrics tables in the current implementation:
tidb_failed_query_opm
tikv_critical_error
tidb_panic_count
tidb_binlog_error_count
pd_cmd_fail_ops
tidb_kv_region_error_ops
tidb_lock_resolver_ops
tikv_scheduler_is_busy
tikv_coprocessor_is_busy
tikv_channel_full_total
tikv_coprocessor_request_error
tidb_schema_lease_error_opm
tidb_transaction_retry_error_ops
tikv_grpc_errors
What is changed and how it works?
Check the metrics table and check whether some errors occurred in the past.
Check List
Tests
Release note
critical-error
which is used to detect cluster critical errors