Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix persistence liveness check deadlock #269

Merged

Conversation

Arkatufus
Copy link
Contributor

Fix potential persistence liveness check deadlock if the suicide actor never completes due to peristence actor / storage problems.

Copy link
Contributor Author

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self review

@@ -189,7 +216,7 @@ protected override void PreStart()

private void ScheduleProbeRestart()
{
Context.System.Scheduler.ScheduleTellOnce(_delay, Self, CreateProbe.Instance, Self, _shutdownCancellable);
Timers.StartSingleTimer(CreateProbeTimerKey, CreateProbe.Instance, _delay);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace scheduler timer calls with IWithTimer to prevent leaks.

private readonly string _id;
private readonly Cancelable _shutdownCancellable;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore because we drop the scheduler timer

ScheduleProbeRestart();
return true;

case CreateProbe:
if(_logInfo)
_log.Debug("Recreating persistence probe.");

Timers.StartSingleTimer(TimeoutTimerKey, CheckTimeout.Instance, _timeout);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guarding against suicide actor deadlock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing it here inside the probe actor instead of inside the suicide actor because mailbox processing can be blocked inside a persistence actor, making timers unreliable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

Comment on lines +187 to +194
case CheckTimeout:
const string errMsg = "Timeout while checking persistence liveness. Recovery status is undefined.";
_log.Warning(errMsg);
_currentLivenessStatus = new PersistenceLivenessStatus(errMsg);
PublishStatusUpdates();

if(_probe is not null)
Context.Stop(_probe);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force stop the suicide actor if it stalls/deadlocks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we also emit a new liveness status with a timeout message to let subscriber know about the problem.

Comment on lines +37 to +38
# Defines the timeout for each liveness check operation
timeout = 3s
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new setting to configure check timeout value

Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ScheduleProbeRestart();
return true;

case CreateProbe:
if(_logInfo)
_log.Debug("Recreating persistence probe.");

Timers.StartSingleTimer(TimeoutTimerKey, CheckTimeout.Instance, _timeout);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

@Aaronontheweb Aaronontheweb merged commit b6e3aa1 into petabridge:dev Feb 23, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants