Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamingccl: only return lag replanning error if lagging node has not advanced #115000

Merged
merged 1 commit into from
Nov 27, 2023

Conversation

msbutler
Copy link
Collaborator

Previously, the frontier processor would return a lag replanning error if it detected a lagging node and after the hwm had advanced during the flow. This implies the frontier processor could replan as soon as a lagging node finished its catchup scan and bumped the hwm, but was still far behind the other nodes, as we observed in #114706. Ideally, the frontier processor should not throw this replanning error because the lagging node is making progress and because replanning can cause repeated work.

This patch prevents this scenario by teaching the frontier processor to only throw a replanning error if:

  • the hwm has advanced in the flow
  • two consecutive lagging node checks detected a lagging node and the hwm has not advanced during those two checks.

Informs #114706

Release note: none

…t advanced

Previously, the frontier processor would return a lag replanning error if it
detected a lagging node and after the hwm had advanced during the flow. This
implies the frontier processor could replan as soon as a lagging node finished
its catchup scan and bumped the hwm, but was still far behind the other nodes,
as we observed in cockroachdb#114706. Ideally, the frontier processor should not throw
this replanning error because the lagging node is making progress and because
replanning can cause repeated work.

This patch prevents this scenario by teaching the frontier processor to only
throw a replanning error if:
- the hwm has advanced in the flow
- two consecutive lagging node checks detected a lagging node and the hwm has
  not advanced during those two checks.

Informs cockroachdb#114706

Release note: none
@msbutler msbutler requested a review from stevendanna November 22, 2023 20:28
@msbutler msbutler self-assigned this Nov 22, 2023
@msbutler msbutler requested review from a team as code owners November 22, 2023 20:29
@msbutler msbutler requested review from herkolategan and srosenberg and removed request for a team November 22, 2023 20:29
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler added the backport-23.2.x Flags PRs that need to be backported to 23.2. label Nov 22, 2023
@msbutler
Copy link
Collaborator Author

unrelated ci flake

@msbutler
Copy link
Collaborator Author

TFTR!

bors r=stevendanna

@craig
Copy link
Contributor

craig bot commented Nov 27, 2023

Build succeeded:

@craig craig bot merged commit b464c1c into cockroachdb:master Nov 27, 2023
8 checks passed
msbutler added a commit to msbutler/cockroach that referenced this pull request Dec 1, 2023
PR cockroachdb#115000 refactored the lagging node checker to only trigger a replanning
event if the checker detects a lagging node 2 times in a row without hwm
advancement, but maintained the frequency the checker runs. This implies that
it takes twice as long for the checker to trigger a replanning event, relative
to the `stream_replication.replan_flow_frequency` setting. This patch doubles
the frequency the checker runs, implying a replanning event would trigger after
`stream_replication.replan_flow_frequency` time has elapsed.

Informs cockroachdb#115415

Release note: none
craig bot pushed a commit that referenced this pull request Dec 4, 2023
115459: streamingccl: double the frequency we check for lagging nodes r=stevendanna a=msbutler

PR #115000 refactored the lagging node checker to only trigger a replanning event if the checker detects a lagging node 2 times in a row without hwm advancement, but maintained the frequency the checker runs. This implies that it takes twice as long for the checker to trigger a replanning event, relative to the `stream_replication.replan_flow_frequency` setting. This patch doubles the frequency the checker runs, implying a replanning event would trigger after `stream_replication.replan_flow_frequency` time has elapsed.

Informs #115415

Release note: none

Co-authored-by: Michael Butler <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Dec 4, 2023
PR #115000 refactored the lagging node checker to only trigger a replanning
event if the checker detects a lagging node 2 times in a row without hwm
advancement, but maintained the frequency the checker runs. This implies that
it takes twice as long for the checker to trigger a replanning event, relative
to the `stream_replication.replan_flow_frequency` setting. This patch doubles
the frequency the checker runs, implying a replanning event would trigger after
`stream_replication.replan_flow_frequency` time has elapsed.

Informs #115415

Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2. T-disaster-recovery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants