-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement detection and potential mitigation of recovery failure cycles #435
Conversation
…o boost tracing and disable prefetch during replay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nits, otherwise looks fantastic!
src/DurableTask.Netherite/StorageLayer/Faster/PartitionStorage.cs
Outdated
Show resolved
Hide resolved
if (this.CheckpointInfo.RecoveryAttempts > 0 || DateTimeOffset.UtcNow - lastModified > TimeSpan.FromMinutes(5)) | ||
{ | ||
this.CheckpointInfo.RecoveryAttempts++; | ||
|
||
this.TraceHelper.FasterProgress($"Incremented recovery attempt counter to {this.CheckpointInfo.RecoveryAttempts} in {this.checkpointCompletedBlob.Name}."); | ||
|
||
await this.WriteCheckpointMetadataAsync(); | ||
|
||
if (this.CheckpointInfo.RecoveryAttempts > 3 && this.CheckpointInfo.RecoveryAttempts < 30) | ||
{ | ||
this.TraceHelper.BoostTracing = true; | ||
} | ||
} | ||
|
||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that this could fail definitely - should we have a cap to how big this integer can grow? Maybe it it's larger than ~100, it's not worth increasing it further, or is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why capping the counter itself would be useful. No matter how large, it will still give us useful information (also in the traces).
Or did you mean to cap the actual recovery attempts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the extrene: I just worry about the integer getting too large to represent, and then causing another class of issues. In general, I think there's no benefit in increasing this counter past ~10k, for example. I'd prefer to have an upper limit here. After ~10k, we know it is simply "too many" anyways. I do feel a bit stronger about this.
I am running a final test. If that works we can merge and release. |
In light of recent issues with FASTER crashing repeatedly during recovery, while replaying the commit log, this PR implements several steps that should help us troubleshoot the issue (and possibly mitigate it).
We are adding a recovery attempt counter to the last-checkpoint.json file so we can detect if partition recovery is repeatedly failing.
If the number of recovery attempts is between 3 and 30, we are boosting the tracing for the duration of the recovery. This may help us investigate the location of where the crash happens.
If the number of recovery attempts is larger than 6, we are disabling the prefetch during the replay. This means FASTER is executing fetch operations sequentially during replay, which slows down the replay A LOT but makes it more deterministic so we can better pinpoint the failure. It is also possible that this may eliminate the failure (e.g. if the bug is a race condition). Slowing the replay down would be a bad idea in general but actually desirable in this situation since it will also reduce the frequency of crashes caused by the struggling partition.