Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement detection and potential mitigation of recovery failure cycles #435

Merged
merged 4 commits into from
Oct 31, 2024

Conversation

sebastianburckhardt
Copy link
Member

In light of recent issues with FASTER crashing repeatedly during recovery, while replaying the commit log, this PR implements several steps that should help us troubleshoot the issue (and possibly mitigate it).

  1. We are adding a recovery attempt counter to the last-checkpoint.json file so we can detect if partition recovery is repeatedly failing.

  2. If the number of recovery attempts is between 3 and 30, we are boosting the tracing for the duration of the recovery. This may help us investigate the location of where the crash happens.

  3. If the number of recovery attempts is larger than 6, we are disabling the prefetch during the replay. This means FASTER is executing fetch operations sequentially during replay, which slows down the replay A LOT but makes it more deterministic so we can better pinpoint the failure. It is also possible that this may eliminate the failure (e.g. if the bug is a race condition). Slowing the replay down would be a bad idea in general but actually desirable in this situation since it will also reduce the frequency of crashes caused by the struggling partition.

…o boost tracing and disable prefetch during replay
Copy link
Member

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits, otherwise looks fantastic!

Comment on lines 1135 to 1149
if (this.CheckpointInfo.RecoveryAttempts > 0 || DateTimeOffset.UtcNow - lastModified > TimeSpan.FromMinutes(5))
{
this.CheckpointInfo.RecoveryAttempts++;

this.TraceHelper.FasterProgress($"Incremented recovery attempt counter to {this.CheckpointInfo.RecoveryAttempts} in {this.checkpointCompletedBlob.Name}.");

await this.WriteCheckpointMetadataAsync();

if (this.CheckpointInfo.RecoveryAttempts > 3 && this.CheckpointInfo.RecoveryAttempts < 30)
{
this.TraceHelper.BoostTracing = true;
}
}

return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this could fail definitely - should we have a cap to how big this integer can grow? Maybe it it's larger than ~100, it's not worth increasing it further, or is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why capping the counter itself would be useful. No matter how large, it will still give us useful information (also in the traces).

Or did you mean to cap the actual recovery attempts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the extrene: I just worry about the integer getting too large to represent, and then causing another class of issues. In general, I think there's no benefit in increasing this counter past ~10k, for example. I'd prefer to have an upper limit here. After ~10k, we know it is simply "too many" anyways. I do feel a bit stronger about this.

@sebastianburckhardt
Copy link
Member Author

I am running a final test. If that works we can merge and release.

@sebastianburckhardt sebastianburckhardt merged commit 68fb3e7 into main Oct 31, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants