streamingccl: only return lag replanning error if lagging node has not advanced #115000

msbutler · 2023-11-22T20:28:59Z

Previously, the frontier processor would return a lag replanning error if it detected a lagging node and after the hwm had advanced during the flow. This implies the frontier processor could replan as soon as a lagging node finished its catchup scan and bumped the hwm, but was still far behind the other nodes, as we observed in #114706. Ideally, the frontier processor should not throw this replanning error because the lagging node is making progress and because replanning can cause repeated work.

This patch prevents this scenario by teaching the frontier processor to only throw a replanning error if:

the hwm has advanced in the flow
two consecutive lagging node checks detected a lagging node and the hwm has not advanced during those two checks.

Informs #114706

Release note: none

…t advanced Previously, the frontier processor would return a lag replanning error if it detected a lagging node and after the hwm had advanced during the flow. This implies the frontier processor could replan as soon as a lagging node finished its catchup scan and bumped the hwm, but was still far behind the other nodes, as we observed in cockroachdb#114706. Ideally, the frontier processor should not throw this replanning error because the lagging node is making progress and because replanning can cause repeated work. This patch prevents this scenario by teaching the frontier processor to only throw a replanning error if: - the hwm has advanced in the flow - two consecutive lagging node checks detected a lagging node and the hwm has not advanced during those two checks. Informs cockroachdb#114706 Release note: none

cockroach-teamcity · 2023-11-22T20:29:10Z

This change is

msbutler · 2023-11-22T21:12:08Z

unrelated ci flake

msbutler · 2023-11-27T16:54:33Z

TFTR!

bors r=stevendanna

craig · 2023-11-27T17:47:25Z

Build succeeded:

Bazel Essential CI (Cockroach)

PR cockroachdb#115000 refactored the lagging node checker to only trigger a replanning event if the checker detects a lagging node 2 times in a row without hwm advancement, but maintained the frequency the checker runs. This implies that it takes twice as long for the checker to trigger a replanning event, relative to the `stream_replication.replan_flow_frequency` setting. This patch doubles the frequency the checker runs, implying a replanning event would trigger after `stream_replication.replan_flow_frequency` time has elapsed. Informs cockroachdb#115415 Release note: none

115459: streamingccl: double the frequency we check for lagging nodes r=stevendanna a=msbutler PR #115000 refactored the lagging node checker to only trigger a replanning event if the checker detects a lagging node 2 times in a row without hwm advancement, but maintained the frequency the checker runs. This implies that it takes twice as long for the checker to trigger a replanning event, relative to the `stream_replication.replan_flow_frequency` setting. This patch doubles the frequency the checker runs, implying a replanning event would trigger after `stream_replication.replan_flow_frequency` time has elapsed. Informs #115415 Release note: none Co-authored-by: Michael Butler <[email protected]>

PR #115000 refactored the lagging node checker to only trigger a replanning event if the checker detects a lagging node 2 times in a row without hwm advancement, but maintained the frequency the checker runs. This implies that it takes twice as long for the checker to trigger a replanning event, relative to the `stream_replication.replan_flow_frequency` setting. This patch doubles the frequency the checker runs, implying a replanning event would trigger after `stream_replication.replan_flow_frequency` time has elapsed. Informs #115415 Release note: none

msbutler added the T-disaster-recovery label Nov 22, 2023

msbutler requested a review from stevendanna November 22, 2023 20:28

msbutler self-assigned this Nov 22, 2023

msbutler requested review from a team as code owners November 22, 2023 20:29

msbutler requested review from herkolategan and srosenberg and removed request for a team November 22, 2023 20:29

msbutler removed request for herkolategan and srosenberg November 22, 2023 20:29

msbutler added the backport-23.2.x Flags PRs that need to be backported to 23.2. label Nov 22, 2023

stevendanna approved these changes Nov 27, 2023

View reviewed changes

msbutler mentioned this pull request Nov 27, 2023

roachtest: c2c/BulkOps/full failed #115052

Closed

craig bot merged commit b464c1c into cockroachdb:master Nov 27, 2023
8 checks passed

blathers-crl bot mentioned this pull request Nov 27, 2023

release-23.2: streamingccl: only return lag replanning error if lagging node has not advanced #115103

Merged

msbutler mentioned this pull request Dec 1, 2023

streamingccl: double the frequency we check for lagging nodes #115459

Merged

blathers-crl bot mentioned this pull request Dec 4, 2023

release-23.2: streamingccl: double the frequency we check for lagging nodes #115543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamingccl: only return lag replanning error if lagging node has not advanced #115000

streamingccl: only return lag replanning error if lagging node has not advanced #115000

msbutler commented Nov 22, 2023

cockroach-teamcity commented Nov 22, 2023

msbutler commented Nov 22, 2023

msbutler commented Nov 27, 2023

craig bot commented Nov 27, 2023

streamingccl: only return lag replanning error if lagging node has not advanced #115000

streamingccl: only return lag replanning error if lagging node has not advanced #115000

Conversation

msbutler commented Nov 22, 2023

cockroach-teamcity commented Nov 22, 2023

msbutler commented Nov 22, 2023

msbutler commented Nov 27, 2023

craig bot commented Nov 27, 2023