-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix persistence liveness check deadlock #269
Fix persistence liveness check deadlock #269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self review
@@ -189,7 +216,7 @@ protected override void PreStart() | |||
|
|||
private void ScheduleProbeRestart() | |||
{ | |||
Context.System.Scheduler.ScheduleTellOnce(_delay, Self, CreateProbe.Instance, Self, _shutdownCancellable); | |||
Timers.StartSingleTimer(CreateProbeTimerKey, CreateProbe.Instance, _delay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace scheduler timer calls with IWithTimer
to prevent leaks.
private readonly string _id; | ||
private readonly Cancelable _shutdownCancellable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed anymore because we drop the scheduler timer
ScheduleProbeRestart(); | ||
return true; | ||
|
||
case CreateProbe: | ||
if(_logInfo) | ||
_log.Debug("Recreating persistence probe."); | ||
|
||
Timers.StartSingleTimer(TimeoutTimerKey, CheckTimeout.Instance, _timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guarding against suicide actor deadlock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're doing it here inside the probe actor instead of inside the suicide actor because mailbox processing can be blocked inside a persistence actor, making timers unreliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
case CheckTimeout: | ||
const string errMsg = "Timeout while checking persistence liveness. Recovery status is undefined."; | ||
_log.Warning(errMsg); | ||
_currentLivenessStatus = new PersistenceLivenessStatus(errMsg); | ||
PublishStatusUpdates(); | ||
|
||
if(_probe is not null) | ||
Context.Stop(_probe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Force stop the suicide actor if it stalls/deadlocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we also emit a new liveness status with a timeout message to let subscriber know about the problem.
# Defines the timeout for each liveness check operation | ||
timeout = 3s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new setting to configure check timeout value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ScheduleProbeRestart(); | ||
return true; | ||
|
||
case CreateProbe: | ||
if(_logInfo) | ||
_log.Debug("Recreating persistence probe."); | ||
|
||
Timers.StartSingleTimer(TimeoutTimerKey, CheckTimeout.Instance, _timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
Fix potential persistence liveness check deadlock if the suicide actor never completes due to peristence actor / storage problems.