-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.2: rangefeed: cancel async tasks on processor stop #118411
release-23.2: rangefeed: cancel async tasks on processor stop #118411
Conversation
Epic: none Release note: None
Previously, async tasks spawned by the rangefeed processor (typically txn pushes and resolved timestamp scans) were not cancelled when the processor was stopped or the stopper quiesced. If these operations stalled, this could lead to goroutine leaks and node shutdown stalls. However, this was mitigated to some extent by the intent resolver itself detecting stopper quiescence. This patch uses a separate task context for async tasks, which is cancelled either when the processor is stopped or the stopper quiesces. In general, the rangefeed scheduler shutdown logic could use some improvement, but this patch does not attempt a broader cleanup in the interest of backportability. Epic: none Release note: None
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @rharding6373)
pkg/kv/kvserver/rangefeed/task.go
line 64 at r2 (raw file):
if err := s.iterateAndConsume(ctx); err != nil { err = errors.Wrap(err, "initial resolved timestamp scan failed") if ctx.Err() == nil { // cancellation probably caused the error
Out of curiosity, why do we not want to log these errors when it wasn't caused by a cancellation?
We log the error when it wasn't caused by cancellation. If it's cancelled, the system is shutting down, so these are just noise -- potentially hundreds of them. |
This is part of a #117612 backport.
This issue only affects the new scheduler-based rangefeed processor in 23.2 (off by default), the legacy processor handled this properly.
Backport 2/2 commits from #117859.
Release justification: prerequisite for backporting #117612.
/cc @cockroachdb/release
Previously, async tasks spawned by the rangefeed processor (typically txn pushes and resolved timestamp scans) were not cancelled when the processor was stopped or the stopper quiesced. If these operations stalled, this could lead to goroutine leaks and node shutdown stalls. However, this was mitigated to some extent by the intent resolver itself detecting stopper quiescence.
This patch uses a separate task context for async tasks, which is cancelled either when the processor is stopped or the stopper quiesces.
In general, the rangefeed scheduler shutdown logic could use some improvement, but this patch does not attempt a broader cleanup in the interest of backportability.
Epic: none
Release note: None