-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: tidb built-in sql diagnostics #13481
Conversation
Signed-off-by: Lonng <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #13481 +/- ##
===========================================
Coverage 80.1103% 80.1103%
===========================================
Files 474 474
Lines 117257 117257
===========================================
Hits 93935 93935
Misses 15948 15948
Partials 7374 7374 |
Signed-off-by: Lonng <[email protected]>
Signed-off-by: Lonng <[email protected]>
Signed-off-by: Lonng <[email protected]>
Signed-off-by: Lonng <[email protected]>
Signed-off-by: Lonng <[email protected]>
| 33 | pd | pd-0 | 127.0.0.1:2379 | log.format | text | | ||
| 34 | pd | pd-0 | 127.0.0.1:2379 | log.level | | | ||
| 35 | pd | pd-0 | 127.0.0.1:2379 | log.sampling | <nil> | | ||
| 114 | tidb | tidb-0 | 127.0.0.1:4000 | log.disable-error-stack | <nil> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tidb-0? can we name a TiDB instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we cannot, it's a temporary name generated by TIDB_CLUSTER_INFO
. Do we need to remote it?
Signed-off-by: Lonng <[email protected]>
|
||
#### 监控信息系统表 | ||
|
||
由于监控指标会随着程序的迭代添加和删除监控指标,对于同一个监控指标,可能有不同的表达式获取监控不同维度的信息。鉴于以上两个需求,需要设计一个有弹性的监控系统表框架,本提案暂时才采取以下方案:将表达式映射为 `metrics_schema` 数据库中的系统表,表达式与系统表的关系可以通过以下方式关联: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need to provide a mechanism that checking the query is valid.
| 126 | tidb | tidb-0 | 127.0.0.1:4000 | log.record-plan-in-slow-log | 1 | | ||
| 127 | tidb | tidb-0 | 127.0.0.1:4000 | log.slow-query-file | tidb-slow.log | | ||
| 128 | tidb | tidb-0 | 127.0.0.1:4000 | log.slow-threshold | 300 | | ||
| 213 | tikv | tikv-0 | 127.0.0.1:20160 | log-file | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of the tikv name suffix, store id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been removed and I have not updated this part yet.
See: #13587
+------+------+--------+-----------------+-----------------------------+---------------+ | ||
| ID | TYPE | NAME | ADDRESS | KEY | VALUE | | ||
+------+------+--------+-----------------+-----------------------------+---------------+ | ||
| 21 | pd | pd-0 | 127.0.0.1:2379 | log-file | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of ID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been removed and I have not updated this part yet.
See: #13587
- 劣势:增加集群运维难度,第三方组件不容易与 TiDB 内部 SQL 集成;日志收集工具会收集全量日志,收集过程占用各个系统资源(磁盘 IO、网络 IO) | ||
- 各个节点提供日志服务,TiDB 通过各个节点的接口将谓词下推到日志检索接口,直接对各个节点返回的日志进行归并 | ||
- 优势:不引入三方组件,谓词下推后只返回过滤后的日志,能轻易的与 TiDB SQL 进行集成,并能复用 SQL 引擎的过滤、聚合等 | ||
- 劣势:如果节点日志删除后,不能检索到对应日志 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be resolved by allowing TiDB to query other logging systems. Making the log storage pluggable. And it can also be applied to the monitoring system. If users want more persistent monitoring/logging data, they can replace the builtin storage engines for this kind of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We choose a lightweight way for the time being.
|
||
#### 节点配置信息 | ||
|
||
所有节点都包含当前节点的生效配置,不需要额外的步骤既可拿到配置信息。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有节点都包含当前节点的生效配置,不需要额外的步骤既可拿到配置信息。 | |
所有节点都包含当前节点的生效配置,不需要额外的步骤即可拿到配置信息。 |
- rxcmp/s:每秒钟接收的压缩数据包 | ||
- txcmp/s:每秒钟发送的压缩数据包 | ||
- rxmcst/s:每秒钟接收的多播数据包 | ||
- 常用的系统配置:sysctl -a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ulimit -a
1. PD 中添加 `remote-metrics-storage` 配置,暂时配置为 Prometheus Server 的地址。PD 作为 proxy,将请求转移到 Prometheus 上执行,主要有以下考量: | ||
- 后续 PD 实现查询接口实现自举,TiDB 不需要做其他改动 | ||
- 用户不使用 TiDB 部署的 Prometheus 而使用自建的监控服务,依然可以使用 SQL 查询监控信息以及诊断框架 | ||
2. 将 Prometheus 时序数据保存和查询相应的模块抽离出来,并嵌入到 PD 中 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to embed Prometheus into PD, how about shim TiKV as the storage of metrics via https://github.com/bragfoo/TiPrometheus?
|
||
- 使用 Prometheus client 和 PromQL 查询 Prometheus server 的数据 | ||
- 优势:有现成解决方案,只需要将 Prometheus server 的地址注册到 TiDB 即可,实现简单 | ||
- 劣势:增强了 TiDB 对 Prometheus 的依赖,为后续完全移除 Prometheus 增加了困难 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about how to integrate the alert and graphing solutions from community, like alertmanager and grafana after we remove Prometheus.
Signed-off-by: Lonng <[email protected]>
Co-Authored-By: Tennix <[email protected]>
Signed-off-by: Lonng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/run-all-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
+---------+-------------+------+------+---------+-------+ | ||
5 rows in set (0.00 sec) | ||
|
||
mysql> select * from tidb_cluster_log where content like '%412134239937495042%'; -- 查询 TSO 为 412134239937495042 全链路日志 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mysql> -- 查询 TSO 为 412134239937495042 全链路日
mysql> select * from tidb_cluster_log where content like '%412134239937495042%';
may look better
+------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| TYPE | ADDRESS | LEVEL | CONTENT | | ||
+------+------------------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ||
| tidb | 10.9.120.251:10080 | INFO | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:501.60574ms txnStartTS:412134239937495042 region_id:180 store_addr:10.9.82.29:20160 kv_process_ms:416 scan_total_write:340807 scan_processed_write:340806 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the log is too verbose here, can you simply them and only list a few?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
Signed-off-by: Lonng <[email protected]>
Signed-off-by: Lonng [email protected]
Summary
Currently, TiDB diagnostic information acquisition relies mainly on external tools (perf/iosnoop/iotop/iostat/vmstat/sar/...), monitoring systems (Prometheus/Grafana), log files, HTTP APIs, and system tables provided by TiDB. The decentralized toolchains and cumbersome acquisition methods lead to high barriers to the use of TiDB clusters, difficulty in operation and maintenance, failure to detect problems in advance, and failure to timely investigate, diagnose, and recover clusters.
This proposal proposes a new method of acquiring diagnostic information in TiDB and exposing diagnostic information by the system tables so that users can query using SQL.
Motivation
This proposal mainly solves the following problems in TiDB's process of obtaining diagnostic information:
The efficiency of the cluster-based information query, state acquisition, log retrieval, one-click inspection, and fault diagnosis will be improved after the multi-dimensional cluster-level system table and the cluster's diagnostic rule framework is provided. And provide basic data for the subsequent abnormal early warning function.