Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: implement inspection_summary system table which organizes metrics by link/module #14810

Merged
merged 6 commits into from
Feb 18, 2020
Merged

*: implement inspection_summary system table which organizes metrics by link/module #14810

merged 6 commits into from
Feb 18, 2020

Conversation

lonng
Copy link
Contributor

@lonng lonng commented Feb 17, 2020

What problem does this PR solve?

There are many metrics in the TiDB cluster and the users have no idea to use which metrics to locate the problem in the cluster because our metrics have not been organized in an efficient approach.

What is changed and how it works?

This PR try to introduce a system table inspection_summary, which is used to organize our metrics by read/write link or modules, e.g:

  • read-link is used to organize all metrics in the reading query execution path via the top-bottom form.
  • write-link is used to organize all metrics in the writting query execution path via the top-bottom form.
  • wait-events is used to organize all duration metrics
  • ddl/txn/rocksdb/raftstore...

NOTE

  • Some predicates will be pushed down to InspectionSummaryRetriever, which means there is no extra cost in unused rules, e.g: select * from inspection_summary where rule='read-link' only retrieve metrics related to the reading link. The supported push-down columns: rule, metric_name.
  • The quantile can be specified as an arbitrary positive number(s) by the user. e.g: select * from inspection_summary where rule='read-link' and quantile=0.999 or select * from inspection_summary where rule='read-link' and quantile in (0.80, 0.90, 0.99, 0.999, 1).

TODO

The value columns in the result should be a Double(22,6), but there will be many changes to fix this issue. It will be addressed in the next PR because this PR is so huge and will lose reviewability as diff growth.

Check List

Tests

  • Unit test
  • Manual test (add detailed scripts or steps below)
mysql> select * from inspection_summary where rule='read-link' and quantile=0.99;
+-----------+-----------------+----------------------------------------+-----------------------------------------+----------+------------------------+------------------------+------------------------+
| RULE      | INSTANCE        | METRIC_NAME                            | LABEL                                   | QUANTILE | AVG_VALUE              | MIN_VALUE              | MAX_VALUE              |
+-----------+-----------------+----------------------------------------+-----------------------------------------+----------+------------------------+------------------------+------------------------+
| read-link | localhost:10080 | tidb_get_token_duration                |                                         |     0.99 |                      0 |                      0 |                      0 |
| read-link | localhost:10080 | tidb_parse_duration                    | general                                 |     0.99 |                      0 |                      0 |                      0 |
| read-link | localhost:10080 | tidb_parse_duration                    | internal                                |     0.99 | 0.00024318023875114783 | 0.00015546666666666671 | 0.00031222857142857124 |
| read-link | localhost:10080 | tidb_compile_duration                  | general                                 |     0.99 |  0.0005602036363636362 | 0.00031885714285714286 |  0.0008511999999999967 |
| read-link | localhost:10080 | tidb_compile_duration                  | internal                                |     0.99 |  0.0005602036363636362 | 0.00031885714285714286 |  0.0008511999999999967 |
| read-link |                 | pd_tso_rpc_duration                    |                                         |     0.99 |  0.0006046491087783595 |               0.000495 |  0.0016449999999999941 |
| read-link |                 | pd_tso_wait_duration                   |                                         |     0.99 |  0.0007610674231760213 |  0.0004950000000000001 |  0.0019399999999999817 |
| read-link | localhost:10080 | tidb_execute_duration                  | general                                 |     0.99 |                      0 |                      0 |                      0 |
| read-link | localhost:10080 | tidb_execute_duration                  | internal                                |     0.99 |   0.006195393939393939 |    0.00432000000000001 |   0.009813333333333346 |
...
| read-link | localhost:20180 | tikv_per_read_max_bytes                | bytes_per_read_percentile95, raft       |        0 |                      0 |                      0 |                      0 |
| read-link | localhost:20180 | tikv_per_read_max_bytes                | bytes_per_read_percentile99, kv         |        0 |      9.931350874431612 |                9.92375 |      9.998446601941748 |
| read-link | localhost:20180 | tikv_per_read_max_bytes                | bytes_per_read_percentile99, raft       |        0 |                      0 |                      0 |                      0 |
| read-link | localhost:20180 | tikv_per_read_max_bytes                | bytes_per_read_standard_deviation, kv   |        0 |      9.116433007264865 |       8.65275822534578 |     13.146375472013052 |
| read-link | localhost:20180 | tikv_per_read_max_bytes                | bytes_per_read_standard_deviation, raft |        0 |                      0 |                      0 |                      0 |
+-----------+-----------------+----------------------------------------+-----------------------------------------+----------+------------------------+------------------------+------------------------+
387 rows in set (0.21 sec)

Release note

  • Add a new system table inspection_summary which is used to organizes metrics by link/module

@lonng lonng added the sig/execution SIG execution label Feb 17, 2020
@lonng lonng added this to the v4.0.0-beta.1 milestone Feb 17, 2020
@lonng lonng requested a review from a team as a code owner February 17, 2020 01:17
@ghost ghost requested review from alivxxx and lzmhhh123 and removed request for a team February 17, 2020 01:18
@lonng lonng changed the title [WIP] *: inspection summary *: implement inspection_summary system table which organizes metrics by link/module Feb 18, 2020
@lonng lonng requested review from crazycs520 and Deardrops and removed request for alivxxx and lzmhhh123 February 18, 2020 05:15
executor/inspection_summary_test.go Show resolved Hide resolved
planner/core/memtable_predicate_extractor_test.go Outdated Show resolved Hide resolved
util/set/float64_set.go Outdated Show resolved Hide resolved
executor/inspection_summary.go Outdated Show resolved Hide resolved
executor/inspection_summary.go Outdated Show resolved Hide resolved
Copy link
Contributor

@Deardrops Deardrops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Deardrops Deardrops added the status/LGT1 Indicates that a PR has LGTM 1. label Feb 18, 2020
Copy link
Contributor

@crazycs520 crazycs520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@crazycs520 crazycs520 added the status/can-merge Indicates a PR has been approved by a committer. label Feb 18, 2020
@sre-bot
Copy link
Contributor

sre-bot commented Feb 18, 2020

/run-all-tests

@sre-bot sre-bot merged commit 9fbefc5 into pingcap:master Feb 18, 2020
@lonng lonng deleted the inspection-summary branch February 19, 2020 02:38
@lonng lonng added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Feb 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/execution SIG execution status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants