Slow Log Enhancement Tracking Issue #34487

cfzjywxk · 2022-05-09T09:19:03Z

Enhancement

This is an enhancement task following #34106 and tikv/tikv#12362. As more information would be collected in these two enhancement tasks, the slow query log could also be enhanced so that users could locate the root cause of slowness within 5 minutes. Previously it was quite difficult to get to know what really slows down specific queries, for example, the issue described in #28937.

Background

The current slow log mainly contains two parts, the first level fields such as Query_time, Compile_time and the executor runtime information which could be decoded by using select tidb_decode_plan(). The problem is:

Some important execution paths are not tracked. For example the kv request processing in tidb batch client.
The information display lacks a user-perceivable latency perspective. For example, the execution model is like a waterfall driven by the Next call but the coprocessor tasks and 2pc prewrite processing are other working pipelines. Recording single cop task processing duration does not reflect user-perceivable latency directly.

Target

Users could locate the root cause of slowness based on the slow log record within 5 minutes. The TiDB diagnose engine could provide improvement advice accordingly.

Design

The overall idea

We could split the query execution into three phases and record these main durations in the slow log first-level fields, so we could easily tell which is the major source of the latency. Then the task details could be recorded in the execution plan tree. Currently, there's already an execution plan tree in which the details of the root and cop tasks are recorded.

select tidb_decode_plan('hQeIMAk1XzIwCTAJMQlmdW5jczpjb3VudChDb2x1bW4jMjApLT4NDPBGMTUJMQl0aW1lOjEuM3MsIGxvb3BzOjIJMzgwIEJ5dGVzCU4vQQoxCTMyXzIxCTAJMQlpbmRleDpTdHJlYW1BZ2dfOAkxNQlGQgBsLCBjb3BfdGFzazoge251bTogMTcsIG1heDogMQVqWG1pbjogMy4xM21zLCBhdmc6IDU2MS43AQ4IcDk1FSZUYXhfcHJvY19rZXlzOiAxMjgwMDAwLAEjUhgACHRvdAUYHDogOS40N3MsBREcd2FpdDogNTEBXBRycGNfbnUNjwENBbsUIDkuNTVzBbRscl9jYWNoZV9oaXRfcmF0aW86IDAuMDB9CTM4Mj0TCDIJNQH7BF8wOmQBADE5XAgyMAklGwRrdikIAHsBySUDFDEuMjlzLCUDLDBzLCBwODA6MS4xOAELBDk1ER6oaXRlcnM6OTc3OCwgdGFza3M6MTd9LCBzY2FuX2RldGFpbDoge3RvdGFsXwFZCGVzczElISMAMCEmRh4AIF9zaXplOiA4OQEhHSQtXwETdDAxNywgcm9ja3NkYjoge2RlbGV0ZV9za2lwcGVkX0VKHDogMCwga2V5PhYACUABegxibG9jIfo5TA05GDI1LCByZWEVSRAxNDM0OQ0TUGJ5dGU6IDI4My45IE1CfX19CU4vQQEEIAozCTQ2XzE5CSVwADAJAUwJdGFibGU6VF9DVVNUT01FUiwgaUWcTFBSSU1BUlkoY3VzdG9tZXJfaWQpAa84ZXAgb3JkZXI6ZmFsc2UJGUj+ogEyogEgCU4vQQlOL0EK');
|                 id                  task          estRows            operator info                                                      actRows                    execution info                                                                                                                                                                                                                                                                                                                     memory           disk                                                                                                                                                                                                                                                                              StreamAgg_20  root      1       funcs:count(Column#20)->Column#15                                     1               time:1.3s, loops:2                                                                                                                                                                                                                                                                                                                  380 Bytes        N/A
        └─IndexReader_21              root             1               index:StreamAgg_8                                                     15              time:1.3s, loops:2, cop_task: {num: 17, max: 1.3s, min: 3.13ms, avg: 561.7ms, p95: 1.3s, max_proc_keys: 1280000, p95_proc_keys: 1280000, tot_proc: 9.47s, tot_wait: 51ms, rpc_num: 17, rpc_time: 9.55s, copr_cache_hit_ratio: 0.00}                                                                                                         382 Bytes        N/A
          └─StreamAgg_8               cop[tikv]        1               funcs:count(1)->Column#20                                             15              tikv_task:{proc max:1.29s, min:0s, p80:1.18s, p95:1.29s, iters:9778, tasks:17}, scan_detail: {total_process_keys: 10000000, total_process_keys_size: 890000000, total_keys: 10000017, rocksdb: {delete_skipped_count: 0, key_skipped_count: 10000000, block: {cache_hit_count: 25, read_count: 14349, read_byte: 283.9 MB}}}                 N/A             N/A
            └─IndexFullScan_19        cop[tikv]        10000000        table:T_CUSTOMER, index:PRIMARY(customer_id), keep order:false        10000000        tikv_task:{proc max:1.29s, min:0s, p80:1.18s, p95:1.29s, iters:9778, tasks:17}

The execution preparing phase

Add Protocol_time, it represents the execution duration after receiving the mysql packet and before starting to process the mysql command.
Add Token_wait_time, it is a sub-part of Protocol_time that represents the duration waiting for the token so it's allowed to process the mysql command.
Add Load_variable_time, it is the time reading the global variables for the current execution. Usually, it should be small.

So the duration of this phase is:

protocol process -> parse -> load global variables -> compile -> wait ts

These are user-perceivable durations and these executions are within a single tidb goroutine.

The execution phase

Add Executor_build_time, it's the time building the executors. In some situations the for_update_ts could be updated so there may be a sync wait_ts duration in this stage.
Add Executor_open_time, it's the time opening the executors. For example, the CopClient may try to start specific numbers of cop workers to send cop requests and process responses.
Add Loop_time, it's the actual time used doing a Next move. This is the most important part of the execution handling, there could be different tasks for different executors. For pessimistic transaction statements, there would be another phase trying to lock the related keys.

The execution finishing phase

Add Response_time, it's the time responding mysql packets to the mysql client.
Add Close_time, it's the time finishing the current query processing, for a select statement it needs to close the internal result_set. If the close processing is slow, the current session or connection could not process the next mysql command or packet in time.

The detail duration of the `Loop_time`

As mentioned above, the most important duration is the Loop_time phase in which there could be complex calculations and kv request round trips. Also, we need to figure out a way to display the information in a user-perceivable way.

Note these detail enhancements could be implemented in the execution plan tree which is already used by the slow log. These changes would not change the slow log first level fields but the content of select tidb_decode_plan().

The read path

Each time a Next is triggered, there are some differences between the root type tasks and the cop type tasks.

The root tasks

The time used to prepare a new row chunk. There could be some producer-consumer model doing complex computation like aggregation or join.

Refactor the time field. Try to separate the user-perceivable duration and the background task processing duration. Also the data preparation duration if the consumption is faster than the processing.
Add the task detail duration for background tasks, for example, there could be some parallel processing of aggregation, join.

The cop tasks

Refactor the time field. Try to separate the user-perceivable duration and the background task processing duration. Also the data preparation duration if the consumption is faster than the processing.
Enhance the cop task details for a single task based on the information from Performance Diagnosis Enhancements #34106 and Performance Diagnosis Enhancements tikv/tikv#12362.

The write path

Split the execution into two phases the transaction ongoing and transaction committing.

Transaction ongoing phase

Refactor the time field. Try to separate the user-perceivable duration and the concurrent key locking phase.
Enhance the pessimistic lock kv request details based on the information from Performance Diagnosis Enhancements #34106 and Performance Diagnosis Enhancements tikv/tikv#12362.
Make it clear if the read processing slows down the whole write processing.

Transaction committing phase

Add the time field. Try to separate the user-perceivable duration and the commit duration of the two phase committer.
Enhance the prewrite kv request details based on the information from Performance Diagnosis Enhancements #34106 and Performance Diagnosis Enhancements tikv/tikv#12362.

The text was updated successfully, but these errors were encountered:

ref #34487

cfzjywxk · 2022-08-25T06:42:07Z

The execution details from tikv is displayed in slow log execdetails by now:

cfzjywxk added type/enhancement The issue or PR belongs to an enhancement. sig/diagnosis SIG: Diagnosis labels May 9, 2022

sticnarf mentioned this issue Jul 14, 2022

execdetails: output write details in RuntimeStats #35814

Merged

2 tasks

ti-chi-bot pushed a commit that referenced this issue Jul 18, 2022

execdetails: output write details in RuntimeStats (#35814)

df31ac6

ref #34487

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Log Enhancement Tracking Issue #34487

Slow Log Enhancement Tracking Issue #34487

cfzjywxk commented May 9, 2022 •

edited

Loading

cfzjywxk commented Aug 25, 2022

Slow Log Enhancement Tracking Issue #34487

Slow Log Enhancement Tracking Issue #34487

Comments

cfzjywxk commented May 9, 2022 • edited Loading

Enhancement

Background

Target

Design

The overall idea

The execution preparing phase

The execution phase

The execution finishing phase

The detail duration of the Loop_time

The read path

The root tasks

The cop tasks

The write path

Transaction ongoing phase

Transaction committing phase

cfzjywxk commented Aug 25, 2022

cfzjywxk commented May 9, 2022 •

edited

Loading

The detail duration of the `Loop_time`