From 753408883af38ab9f249c6865672284ba4399962 Mon Sep 17 00:00:00 2001 From: shichun-0415 <89768198+shichun-0415@users.noreply.github.com> Date: Sat, 9 Oct 2021 11:07:23 +0800 Subject: [PATCH] tiflash, metric: add alert for TiFlash down (#6590) --- alert-rules.md | 22 +++++++++++++++++++++- tiflash/tiflash-alert-rules.md | 2 +- 2 files changed, 22 insertions(+), 2 deletions(-) diff --git a/alert-rules.md b/alert-rules.md index 1ed724b2cf2d1..b6d482271238e 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -8,7 +8,7 @@ aliases: ['/docs/dev/alert-rules/','/docs/dev/reference/alert-rules/'] # TiDB Cluster Alert Rules -This document describes the alert rules for different components in a TiDB cluster, including the rule descriptions and solutions of the alert items in TiDB, TiKV, PD, TiDB Binlog, Node_exporter and Blackbox_exporter. +This document describes the alert rules for different components in a TiDB cluster, including the rule descriptions and solutions of the alert items in TiDB, TiKV, PD, TiFlash, TiDB Binlog, Node_exporter and Blackbox_exporter. According to the severity level, alert rules are divided into three categories (from high to low): emergency-level, critical-level, and warning-level. This division of severity levels applies to all alert items of each component below. @@ -781,6 +781,10 @@ This section gives the alert rules for the TiKV component. The speed of splitting Regions is slower than the write speed. To alleviate this issue, you’d better update TiDB to a version that supports batch-split (>= 2.1.0-rc1). If it is not possible to update temporarily, you can use `pd-ctl operator add split-region --policy=approximate` to manually split Regions. +## TiFlash alert rules + +For the detailed descriptions of TiFlash alert rules, see [TiFlash Alert Rules](/tiflash/tiflash-alert-rules.md). + ## TiDB Binlog alert rules For the detailed descriptions of TiDB Binlog alert rules, see [TiDB Binlog monitoring document](/tidb-binlog/monitor-tidb-binlog-cluster.md#alert-rules). @@ -954,6 +958,22 @@ This section gives the alert rules for the Blackbox_exporter TCP, ICMP, and HTTP * Check whether the TiDB process exists. * Check whether the network between the monitoring machine and the TiDB machine is normal. +#### `TiFlash_server_is_down` + +* Alert rule: + + `probe_success{group="tiflash"} == 0` + +* Description: + + Failure to probe the TiFlash service port. + +* Solution: + + * Check whether the machine that provides the TiFlash service is down. + * Check whether the TiFlash process exists. + * Check whether the network between the monitoring machine and the TiFlash machine is normal. + #### `Pump_server_is_down` * Alert rule: diff --git a/tiflash/tiflash-alert-rules.md b/tiflash/tiflash-alert-rules.md index 4f9a6d1f423fc..e6750760f2f9c 100644 --- a/tiflash/tiflash-alert-rules.md +++ b/tiflash/tiflash-alert-rules.md @@ -34,7 +34,7 @@ This document introduces the alert rules of the TiFlash cluster. - Solution: - It might be caused by the internal problems of the TiFlash TMT engine. Contact [TiFlash R&D](mailto:support@pingcap.com) for support. + It might be caused by the internal problems of the TiFlash storage engine. Contact [TiFlash R&D](mailto:support@pingcap.com) for support. ## `TiFlash_raft_read_index_duration`