From 03e50755d1bcbc1eac544e6c299a93bcbd0d6647 Mon Sep 17 00:00:00 2001 From: Wenqi Mou Date: Tue, 6 Aug 2024 20:30:07 -0400 Subject: [PATCH 1/6] add metrics for restore Signed-off-by: Wenqi Mou --- grafana-tikv-dashboard.md | 41 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index b4ef20b9e866f..61cd11b13780e 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -484,6 +484,47 @@ This section provides a detailed description of these key metrics on the **TiKV- - Initial Scanning Trigger Reason: The reason for triggering incremental scanning - Region Checkpoint Key Putting: The number of checkpoint operations logged to the PD +### Snapshot restore + +- Import CPU Utilization: CPU utilization aggregated by sst importer. +- Import Thread Count: number of threads used by sst importer. +- Import Errors: error counts during sst import. +- Import RPC Duration: time spent for various RPC calls in sst importer. +- Import RPC Ops: number of total RPC calls in sst importer. +- Import RPC Count: number of inflight RPC calls in sst importer. +- Import Write/Download RPC Duration: RPC time for write/download in sst importer. +- Import Wait Duration: time spent on downloading task waiting in queue for execution. +- Import Read SST Duration: time spent on reading from external storage of a file and download it to TiKV. +- Import Rewrite SST Duration: time spent on rewriting SST based on rewrite rules. +- Import Ingest RPC Duration: time spent on sending ingest response in RPC call. +- Import Ingest SST Duration: time spent on ingesting SST into RocksDB. +- Import Ingest SST Bytes: number of bytes ingested. +- Import Download SST Throughput: SST download throughput in bytes per second. +- Import Local Write keys: ??? +- Import Local Write bytes: ??? +- TTL Expired: number of expired items after TTL in backup files. +- cloud request: number of request to cloud providers. + +### Point-in-Time Restore + +- CPU Usage: CPU utilization by PITR. +- P99 RPC Duration: 99 percentile of the RPC requests time. +- Import RPC Ops: number of total RPC calls in sst importer. +- Import RPC Count: number of inflight RPC calls in sst importer. +- Cache Events: number of events on file cache during sst import. +- Overall RPC Duration: time spent on RPC calls. +- Read File into Memory Duration: time spent on downloading files from external storage and loaded in to memory. +- Queuing Time: time spent on waiting to get scheduled on a thread. +- Apply Request Throughput: Apply request rate in bytes. +- Downloaded File Size: downloaded file size in bytes. +- Apply Batch Size: number of bytes for applying to raft engine in one batch. +- Blocked by Concurrency Time: time spent on waiting to get executed due to concurrency constraint. +- Apply Request Speed: speed of applying request to raft engine. +- Cached File in Memory: files cached by the applying requests of importer. +- Engine Requests Unfinished: number of pending requests to raft engine. +- Apply Time: time spent on writing data to the Raft engine. +- Raft Store Memory Usage: memory usage for raft store. + > **Note:** > > The following monitoring metrics all use TiDB nodes as their data source, but they have some impact on the log backup process. Therefore, they are placed in the **TiKV Details** dashboard for ease of reference. TiKV actively pushes progress most of the time, but it is normal for some of the following monitoring metrics to occasionally not have sampled data. From 3040e480924f5ae6a2a8be5bd1b8d6d90d979dd5 Mon Sep 17 00:00:00 2001 From: Wenqi Mou Date: Wed, 7 Aug 2024 11:40:50 -0400 Subject: [PATCH 2/6] address comments Signed-off-by: Wenqi Mou --- grafana-tikv-dashboard.md | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 61cd11b13780e..fa74f2e51cea9 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -496,12 +496,10 @@ This section provides a detailed description of these key metrics on the **TiKV- - Import Wait Duration: time spent on downloading task waiting in queue for execution. - Import Read SST Duration: time spent on reading from external storage of a file and download it to TiKV. - Import Rewrite SST Duration: time spent on rewriting SST based on rewrite rules. -- Import Ingest RPC Duration: time spent on sending ingest response in RPC call. +- Import Ingest RPC Duration: time spent on handling ingest RPC request on TiKV. - Import Ingest SST Duration: time spent on ingesting SST into RocksDB. - Import Ingest SST Bytes: number of bytes ingested. - Import Download SST Throughput: SST download throughput in bytes per second. -- Import Local Write keys: ??? -- Import Local Write bytes: ??? - TTL Expired: number of expired items after TTL in backup files. - cloud request: number of request to cloud providers. @@ -517,13 +515,13 @@ This section provides a detailed description of these key metrics on the **TiKV- - Queuing Time: time spent on waiting to get scheduled on a thread. - Apply Request Throughput: Apply request rate in bytes. - Downloaded File Size: downloaded file size in bytes. -- Apply Batch Size: number of bytes for applying to raft engine in one batch. +- Apply Batch Size: number of bytes for applying to Raft store in one batch. - Blocked by Concurrency Time: time spent on waiting to get executed due to concurrency constraint. -- Apply Request Speed: speed of applying request to raft engine. +- Apply Request Speed: speed of applying request to Raft store. - Cached File in Memory: files cached by the applying requests of importer. -- Engine Requests Unfinished: number of pending requests to raft engine. -- Apply Time: time spent on writing data to the Raft engine. -- Raft Store Memory Usage: memory usage for raft store. +- Engine Requests Unfinished: number of pending requests to Raft store. +- Apply Time: time spent on writing data to the Raft store. +- Raft Store Memory Usage: memory usage for Raft store. > **Note:** > From 2528780e489208714173f8ab0e56c1e1082cc6b6 Mon Sep 17 00:00:00 2001 From: Wenqi Mou Date: Sun, 11 Aug 2024 20:37:49 -0400 Subject: [PATCH 3/6] address comments Signed-off-by: Wenqi Mou --- grafana-tikv-dashboard.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index fa74f2e51cea9..9f5df717b0bf3 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -484,7 +484,7 @@ This section provides a detailed description of these key metrics on the **TiKV- - Initial Scanning Trigger Reason: The reason for triggering incremental scanning - Region Checkpoint Key Putting: The number of checkpoint operations logged to the PD -### Snapshot restore +### Import & Snapshot restore - Import CPU Utilization: CPU utilization aggregated by sst importer. - Import Thread Count: number of threads used by sst importer. @@ -500,7 +500,6 @@ This section provides a detailed description of these key metrics on the **TiKV- - Import Ingest SST Duration: time spent on ingesting SST into RocksDB. - Import Ingest SST Bytes: number of bytes ingested. - Import Download SST Throughput: SST download throughput in bytes per second. -- TTL Expired: number of expired items after TTL in backup files. - cloud request: number of request to cloud providers. ### Point-in-Time Restore From 1af44780ba7824682c2042093216e9b952fe378d Mon Sep 17 00:00:00 2001 From: Wenqi Mou Date: Mon, 19 Aug 2024 18:37:38 -0400 Subject: [PATCH 4/6] address comments --- grafana-tikv-dashboard.md | 76 +++++++++++++++++++-------------------- 1 file changed, 38 insertions(+), 38 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 9f5df717b0bf3..8c01856b24e7f 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -484,44 +484,6 @@ This section provides a detailed description of these key metrics on the **TiKV- - Initial Scanning Trigger Reason: The reason for triggering incremental scanning - Region Checkpoint Key Putting: The number of checkpoint operations logged to the PD -### Import & Snapshot restore - -- Import CPU Utilization: CPU utilization aggregated by sst importer. -- Import Thread Count: number of threads used by sst importer. -- Import Errors: error counts during sst import. -- Import RPC Duration: time spent for various RPC calls in sst importer. -- Import RPC Ops: number of total RPC calls in sst importer. -- Import RPC Count: number of inflight RPC calls in sst importer. -- Import Write/Download RPC Duration: RPC time for write/download in sst importer. -- Import Wait Duration: time spent on downloading task waiting in queue for execution. -- Import Read SST Duration: time spent on reading from external storage of a file and download it to TiKV. -- Import Rewrite SST Duration: time spent on rewriting SST based on rewrite rules. -- Import Ingest RPC Duration: time spent on handling ingest RPC request on TiKV. -- Import Ingest SST Duration: time spent on ingesting SST into RocksDB. -- Import Ingest SST Bytes: number of bytes ingested. -- Import Download SST Throughput: SST download throughput in bytes per second. -- cloud request: number of request to cloud providers. - -### Point-in-Time Restore - -- CPU Usage: CPU utilization by PITR. -- P99 RPC Duration: 99 percentile of the RPC requests time. -- Import RPC Ops: number of total RPC calls in sst importer. -- Import RPC Count: number of inflight RPC calls in sst importer. -- Cache Events: number of events on file cache during sst import. -- Overall RPC Duration: time spent on RPC calls. -- Read File into Memory Duration: time spent on downloading files from external storage and loaded in to memory. -- Queuing Time: time spent on waiting to get scheduled on a thread. -- Apply Request Throughput: Apply request rate in bytes. -- Downloaded File Size: downloaded file size in bytes. -- Apply Batch Size: number of bytes for applying to Raft store in one batch. -- Blocked by Concurrency Time: time spent on waiting to get executed due to concurrency constraint. -- Apply Request Speed: speed of applying request to Raft store. -- Cached File in Memory: files cached by the applying requests of importer. -- Engine Requests Unfinished: number of pending requests to Raft store. -- Apply Time: time spent on writing data to the Raft store. -- Raft Store Memory Usage: memory usage for Raft store. - > **Note:** > > The following monitoring metrics all use TiDB nodes as their data source, but they have some impact on the log backup process. Therefore, they are placed in the **TiKV Details** dashboard for ease of reference. TiKV actively pushes progress most of the time, but it is normal for some of the following monitoring metrics to occasionally not have sampled data. @@ -533,6 +495,44 @@ This section provides a detailed description of these key metrics on the **TiKV- - Get Region Operation Count: The number of times the coordinator requests Region information from the PD - Try Advance Trigger Time: The time taken for the coordinator to attempt to advance the checkpoint +### Backup & Import + +- Import CPU Utilization: The CPU utilization aggregated by SST importer. +- Import Thread Count: The number of threads used by SST importer. +- Import Errors: The number of errors encountered during SST import. +- Import RPC Duration: The time spent on various RPC calls in SST importer. +- Import RPC Ops: The total number of RPC calls in SST importer. +- Import RPC Count: The number of in-flight RPC calls in SST importer. +- Import Write/Download RPC Duration: The RPC time for write or download operations in SST importer. +- Import Wait Duration: The time spent waiting in queue for download task execution. +- Import Read SST Duration: The time spent reading a file from external storage and downloading it to TiKV. +- Import Rewrite SST Duration: The time spent rewriting SST based on rewrite rules. +- Import Ingest RPC Duration: The time spent handling ingest RPC requests on TiKV. +- Import Ingest SST Duration: The time spent ingesting SST into RocksDB. +- Import Ingest SST Bytes: The number of bytes ingested. +- Import Download SST Throughput: The SST download throughput in bytes per second. +- cloud request: The number of requests to cloud providers. + +### Point In Time Restore + +- CPU Usage: The CPU utilization by point-in-time recovery (PITR). +- P99 RPC Duration: The 99th percentile of RPC request time. +- Import RPC Ops: The total number of RPC calls in SST importer. +- Import RPC Count: The number of inf-light RPC calls in SST importer. +- Cache Events: The number of events on file cache during SST import. +- Overall RPC Duration: The time spent on RPC calls. +- Read File into Memory Duration: The time spent downloading files from external storage and loading them into memory. +- Queuing Time: The time spent waiting to be scheduled on a thread. +- Apply Request Throughput: The rate of applying requests in bytes. +- Downloaded File Size: The size of downloaded file in bytes. +- Apply Batch Size: The number of bytes for applying to Raft store in one batch. +- Blocked by Concurrency Time: The time spent waiting for execution due to concurrency constraints. +- Apply Request Speed: The speed of applying request to Raft store. +- Cached File in Memory: The files cached by the applying requests of importer. +- Engine Requests Unfinished: The number of pending requests to Raft store. +- Apply Time: The time spent writing data to the Raft store. +- Raft Store Memory Usage: The memory usage for Raft store. + ### Explanation of Common Parameters #### gRPC Message Type From 620bca09e927d6f658ae8f3c3742ea3a4ff70917 Mon Sep 17 00:00:00 2001 From: Aolin Date: Tue, 20 Aug 2024 12:36:42 +0800 Subject: [PATCH 5/6] Apply suggestions from code review Co-authored-by: Grace Cai --- grafana-tikv-dashboard.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 8c01856b24e7f..02170d21dfe2f 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -502,13 +502,13 @@ This section provides a detailed description of these key metrics on the **TiKV- - Import Errors: The number of errors encountered during SST import. - Import RPC Duration: The time spent on various RPC calls in SST importer. - Import RPC Ops: The total number of RPC calls in SST importer. -- Import RPC Count: The number of in-flight RPC calls in SST importer. +- Import RPC Count: The number of RPC calls being processed by SST importer. - Import Write/Download RPC Duration: The RPC time for write or download operations in SST importer. - Import Wait Duration: The time spent waiting in queue for download task execution. -- Import Read SST Duration: The time spent reading a file from external storage and downloading it to TiKV. -- Import Rewrite SST Duration: The time spent rewriting SST based on rewrite rules. +- Import Read SST Duration: The time spent reading an SST file from external storage and downloading it to TiKV. +- Import Rewrite SST Duration: The time spent rewriting the SST file based on rewrite rules. - Import Ingest RPC Duration: The time spent handling ingest RPC requests on TiKV. -- Import Ingest SST Duration: The time spent ingesting SST into RocksDB. +- Import Ingest SST Duration: The time spent ingesting the SST file into RocksDB. - Import Ingest SST Bytes: The number of bytes ingested. - Import Download SST Throughput: The SST download throughput in bytes per second. - cloud request: The number of requests to cloud providers. @@ -516,10 +516,10 @@ This section provides a detailed description of these key metrics on the **TiKV- ### Point In Time Restore - CPU Usage: The CPU utilization by point-in-time recovery (PITR). -- P99 RPC Duration: The 99th percentile of RPC request time. +- P99 RPC Duration: The 99th percentile of RPC request duration. - Import RPC Ops: The total number of RPC calls in SST importer. -- Import RPC Count: The number of inf-light RPC calls in SST importer. -- Cache Events: The number of events on file cache during SST import. +- Import RPC Count: The number of RPC calls being processed by SST importer. +- Cache Events: The number of events in the file cache during SST import. - Overall RPC Duration: The time spent on RPC calls. - Read File into Memory Duration: The time spent downloading files from external storage and loading them into memory. - Queuing Time: The time spent waiting to be scheduled on a thread. @@ -528,7 +528,7 @@ This section provides a detailed description of these key metrics on the **TiKV- - Apply Batch Size: The number of bytes for applying to Raft store in one batch. - Blocked by Concurrency Time: The time spent waiting for execution due to concurrency constraints. - Apply Request Speed: The speed of applying request to Raft store. -- Cached File in Memory: The files cached by the applying requests of importer. +- Cached File in Memory: The files cached by the applying requests of SST importer. - Engine Requests Unfinished: The number of pending requests to Raft store. - Apply Time: The time spent writing data to the Raft store. - Raft Store Memory Usage: The memory usage for Raft store. From 350fa2aaaa0ec3c18573771d4bc9cfcb1c4b1000 Mon Sep 17 00:00:00 2001 From: Aolin Date: Tue, 20 Aug 2024 13:09:57 +0800 Subject: [PATCH 6/6] Apply suggestions from code review --- grafana-tikv-dashboard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 02170d21dfe2f..801408895b7ab 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -530,7 +530,7 @@ This section provides a detailed description of these key metrics on the **TiKV- - Apply Request Speed: The speed of applying request to Raft store. - Cached File in Memory: The files cached by the applying requests of SST importer. - Engine Requests Unfinished: The number of pending requests to Raft store. -- Apply Time: The time spent writing data to the Raft store. +- Apply Time: The time spent writing data to Raft store. - Raft Store Memory Usage: The memory usage for Raft store. ### Explanation of Common Parameters