Implement detection and potential mitigation of recovery failure cycles #435

sebastianburckhardt · 2024-10-29T18:18:41Z

In light of recent issues with FASTER crashing repeatedly during recovery, while replaying the commit log, this PR implements several steps that should help us troubleshoot the issue (and possibly mitigate it).

We are adding a recovery attempt counter to the last-checkpoint.json file so we can detect if partition recovery is repeatedly failing.
If the number of recovery attempts is between 3 and 30, we are boosting the tracing for the duration of the recovery. This may help us investigate the location of where the crash happens.
If the number of recovery attempts is larger than 6, we are disabling the prefetch during the replay. This means FASTER is executing fetch operations sequentially during replay, which slows down the replay A LOT but makes it more deterministic so we can better pinpoint the failure. It is also possible that this may eliminate the failure (e.g. if the bug is a race condition). Slowing the replay down would be a bad idea in general but actually desirable in this situation since it will also reduce the frequency of crashes caused by the struggling partition.

…o boost tracing and disable prefetch during replay

davidmrdavid

Two nits, otherwise looks fantastic!

src/DurableTask.Netherite/StorageLayer/Faster/PartitionStorage.cs

davidmrdavid · 2024-10-29T18:32:11Z

src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/BlobManager.cs

+            if (this.CheckpointInfo.RecoveryAttempts > 0 || DateTimeOffset.UtcNow - lastModified > TimeSpan.FromMinutes(5))
+            {
+                this.CheckpointInfo.RecoveryAttempts++;
+
+                this.TraceHelper.FasterProgress($"Incremented recovery attempt counter to {this.CheckpointInfo.RecoveryAttempts} in {this.checkpointCompletedBlob.Name}.");
+
+                await this.WriteCheckpointMetadataAsync();
+
+                if (this.CheckpointInfo.RecoveryAttempts > 3 && this.CheckpointInfo.RecoveryAttempts < 30)
+                {
+                    this.TraceHelper.BoostTracing = true;
+                }
+            }
+
+            return true;


given that this could fail definitely - should we have a cap to how big this integer can grow? Maybe it it's larger than ~100, it's not worth increasing it further, or is it?

I don't see why capping the counter itself would be useful. No matter how large, it will still give us useful information (also in the traces).

Or did you mean to cap the actual recovery attempts?

In the extrene: I just worry about the integer getting too large to represent, and then causing another class of issues. In general, I think there's no benefit in increasing this counter past ~10k, for example. I'd prefer to have an upper limit here. After ~10k, we know it is simply "too many" anyways. I do feel a bit stronger about this.

… permanently breaking anything

sebastianburckhardt · 2024-10-30T21:41:30Z

I am running a final test. If that works we can merge and release.

implement detection of repeated recovery failures, and add triggers t…

d3ea290

…o boost tracing and disable prefetch during replay

davidmrdavid reviewed Oct 29, 2024

View reviewed changes

add comments as per PR feedback

b97eb92

davidmrdavid approved these changes Oct 29, 2024

View reviewed changes

sebastianburckhardt mentioned this pull request Oct 29, 2024

bump minor version to 2.1.0 #436

Merged

sebastianburckhardt added 2 commits October 30, 2024 08:53

fix the disabling of prefetch, make more readable, improve tracing

d89c992

disable prefetch on every other attempt only, to make sure we are not…

8c6cb3c

… permanently breaking anything

sebastianburckhardt merged commit 68fb3e7 into main Oct 31, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement detection and potential mitigation of recovery failure cycles #435

Implement detection and potential mitigation of recovery failure cycles #435

sebastianburckhardt commented Oct 29, 2024

davidmrdavid left a comment

davidmrdavid Oct 29, 2024

sebastianburckhardt Oct 29, 2024

davidmrdavid Oct 29, 2024

sebastianburckhardt commented Oct 30, 2024

Implement detection and potential mitigation of recovery failure cycles #435

Implement detection and potential mitigation of recovery failure cycles #435

Conversation

sebastianburckhardt commented Oct 29, 2024

davidmrdavid left a comment

Choose a reason for hiding this comment

davidmrdavid Oct 29, 2024

Choose a reason for hiding this comment

sebastianburckhardt Oct 29, 2024

Choose a reason for hiding this comment

davidmrdavid Oct 29, 2024

Choose a reason for hiding this comment

sebastianburckhardt commented Oct 30, 2024