ticdc: add description about gc-ttl (#6835) (#6963)

pingcap · Nov 25, 2021 · b94e95c · b94e95c
1 parent 88793e5
commit b94e95c
Show file tree

Hide file tree

Showing 4 changed files with 46 additions and 4 deletions.
diff --git a/media/ticdc-state-transfer.png b/media/ticdc-state-transfer.png
diff --git a/ticdc/deploy-ticdc.md b/ticdc/deploy-ticdc.md
@@ -50,7 +50,7 @@ cdc server --pd=http://10.0.10.25:2379 --log-file=ticdc_3.log --addr=0.0.0.0:830
 
 The following are descriptions of options available in the `cdc server` command:
 
-- `gc-ttl`: The TTL (Time To Live) of the service level `GC safepoint` in PD set by TiCDC, in seconds. The default value is `86400`, which means 24 hours.
+- `gc-ttl`: The TTL (Time To Live) of the service level `GC safepoint` in PD set by TiCDC, and the duration that the replication task can suspend, in seconds. The default value is `86400`, which means 24 hours. Note: Suspending of the TiCDC replication task affects the progress of TiCDC GC safepoint, which means that it affects the progress of upstream TiDB GC, as detailed in [Complete Behavior of TiCDC GC safepoint](/ticdc/troubleshoot-ticdc.md#what-is-the-complete-behavior-of-ticdc-garbage-collection-gc-safepoint).
 - `pd`: The URL of the PD client.
 - `addr`: The listening address of TiCDC, the HTTP API address, and the Prometheus address of the service.
 - `advertise-addr`: The access address of TiCDC to the outside world.

diff --git a/ticdc/manage-ticdc.md b/ticdc/manage-ticdc.md
@@ -73,6 +73,33 @@ If you deploy TiCDC using TiUP, replace `cdc cli` in the following commands with
 
 ### Manage replication tasks (`changefeed`)
 
+#### State transfer of replication tasks
+
+This feature is available in TiDB v4.0.16 and later versions.
+
+The state of a replication task represents the running status of the replication task. During the running of TiCDC, replication tasks might fail with errors, be manually paused, resumed, or reach the specified `TargetTs`. These behaviors can lead to the change of the replication task state. This section describes the states of TiCDC replication tasks and the transfer relationships between states.
+
+![TiCDC state transfer](/media/ticdc-state-transfer.png)
+
+The states in the above state transfer diagram are described as follows:
+
+- `Normal`: The replication task runs normally and the checkpoint-ts proceeds normally.
+- `Stopped`: The replication task is stopped, because the user manually pauses the changefeed. The changefeed in this state blocks GC operations.
+- `Error`: The replication task returns an error. The replication cannot continue due to some recoverable errors. The changefeed in this state keeps trying to resume until the state transfers to `Normal`. The changefeed in this state blocks GC operations.
+- `Finished`: The replication task is finished and has reached the preset `TargetTs`. The changefeed in this state does not block GC operations.
+- `Failed`: The replication task fails. Due to some unrecoverable errors, the replication task cannot resume and cannot be recovered. The changefeed in this state does not block GC operations.
+
+The numbers in the above state transfer diagram are described as follows.
+
+- ① Execute the `changefeed pause` command
+- ② Execute the `changefeed resume` command to resume the replication task
+- ③ Recoverable errors occur during the `changefeed` operation, and the operation is resumed automatically.
+- ④ Execute the `changefeed resume` command to resume the replication task
+- ⑤ Recoverable errors occur during the `changefeed` operation
+- ⑥ `changefeed` has reached the preset `TargetTs`, and the replication is automatically stopped.
+- ⑦ `changefeed` suspended longer than the duration specified by `gc-ttl`, and cannot be resumed.
+- ⑧ `changefeed` experienced an unrecoverable error when trying to execute automatic recovery.
+
 #### Create a replication task
 
 Execute the following commands to create a replication task:

diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md
@@ -139,13 +139,28 @@ cdc cli changefeed update -c <changefeed-id> --sort-engine="unified" --sort-dir=
 
 ## What is `gc-ttl` in TiCDC?
 
-Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data later than this GC safepoint is not cleaned by GC. When the replication task is unavailable or interrupted, this feature ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC.
+Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data later than this GC safepoint is not cleaned by GC.
 
-When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint by configuring `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default.
+When the replication task is unavailable or interrupted, this feature ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC.
+
+When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint by configuring `gc-ttl`. The default value is 24 hours. In TiCDC, this value means:
+
+- The maximum time the GC safepoint is retained at the PD after the TiCDC service is stopped.
+- The maximum time a replication task can be suspended after the task is interrupted or manually stopped. If the time for a suspended replication task is longer than the value set by `gc-ttl`, the replication task enters the `failed` status, cannot be resumed, and cannot continue to affect the progress of the GC safepoint.
+
+The second behavior above is introduced in TiCDC v4.0.13 and later versions. The purpose is to prevent a replication task in TiCDC from suspending for too long, causing the GC safepoint of the upstream TiKV cluster not to continue for a long time and retaining too many outdated data versions, thus affecting the performance of the upstream cluster.
+
+> **Note:**
+>
+> In some scenarios, for example, when you use TiCDC for incremental replication after full replication with Dumpling/BR, the default 24 hours of `gc-ttl` may not be sufficient. You need to specify an appropriate value for `gc-ttl` when you start the TiCDC server.
 
 ## What is the complete behavior of TiCDC garbage collection (GC) safepoint?
 
-If a replication task starts after the TiCDC service starts, the TiCDC owner updates the PD service GC safepoint with the smallest value of `checkpoint-ts` among all replication tasks. The service GC safepoint ensures that TiCDC does not delete data generated at that time and after that time. If the replication task is interrupted, the `checkpoint-ts` of this task does not change and PD's corresponding service GC safepoint is not updated either. The Time-To-Live (TTL) that TiCDC sets for a service GC safepoint is 24 hours, which means that the GC mechanism does not delete any data if the TiCDC service can be recovered within 24 hours after it is interrupted.
+If a replication task starts after the TiCDC service starts, the TiCDC owner updates the PD service GC safepoint with the smallest value of `checkpoint-ts` among all replication tasks. The service GC safepoint ensures that TiCDC does not delete data generated at that time and after that time. If the replication task is interrupted, or manually stopped, the `checkpoint-ts` of this task does not change. Meanwhile, PD's corresponding service GC safepoint is not updated either.
+
+If the replication task is suspended longer than the time specified by `gc-ttl`, the replication task enters the `failed` status and cannot be resumed. The PD corresponding service GC safepoint will continue.
+
+The Time-To-Live (TTL) that TiCDC sets for a service GC safepoint is 24 hours, which means that the GC mechanism does not delete any data if the TiCDC service can be recovered within 24 hours after it is interrupted.
 
 ## How do I handle the `Error 1298: Unknown or incorrect time zone: 'UTC'` error when creating the replication task or replicating data to MySQL?