Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling for FASTER IO completion callbacks #349

Merged
merged 7 commits into from
Mar 18, 2024

Conversation

sebastianburckhardt
Copy link
Member

Two changes:

  • wrap FASTER callbacks so that exceptions are noticed
  • release semaphore prior to calling back into FASTER (to avoid losing the semaphore should FASTER hang)

Copy link
Member

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor questions

}
catch (Exception ex)
{
this.BlobManager.StorageTracer?.FasterStorageError($"{nameof(CancelAllRequests)} for access id={id} failed during FASTER completion callback", ex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to re-throw here? Aren't we swallowing the exception

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no one to meaningfully catch exceptions here. And the partition has already been terminated (that's why the requests are being cancelled). So it is just about tracing at this point.

Comment on lines -384 to +400
.ContinueWith((Task t) =>
{
if (this.pendingReadWriteOperations.TryRemove(id, out ReadWriteRequestInfo request))
{
if (t.IsFaulted)
{
this.BlobManager?.StorageTracer?.FasterStorageProgress($"StorageOpReturned AzureStorageDevice.ReadAsync id={id} (Failure)");
request.Callback(uint.MaxValue, request.NumBytes, request.Context);
}
else
{
this.BlobManager?.StorageTracer?.FasterStorageProgress($"StorageOpReturned AzureStorageDevice.ReadAsync id={id}");
request.Callback(0, request.NumBytes, request.Context);
}
}
}, TaskContinuationOptions.ExecuteSynchronously);
// we are not awaiting this task because it uses FASTER's callback mechanism
// when the access is completed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't immediately see why it's safe to remove this logic. Mind elaborating? Perhaps it's related to your comment that this task isn't awaited, but I did not fully get that either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic was not removed, just placed somewhere else. These callbacks now happen right after the access completes. It should be functionally mostly the same but I prefer to not use the tricky task stuff and instead just write out the straight line code when possible.

Comment on lines 649 to 652
if (this.underLease)
{
this.SingleWriterSemaphore.Release();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add comment on why this is important to do at this point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we not have a forceful semaphore release if it hangs for too long?

Copy link
Member Author

@sebastianburckhardt sebastianburckhardt Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lease add comment on why this is important to do at this point

ok

do we not have a forceful semaphore release if it hangs for too long?

the hang check is somewhere else. We don't need it here since the entire object (incl the semaphore) gets disposed when the partition is terminated. Also, have not seen Azure Storage hanging recently.

@sebastianburckhardt
Copy link
Member Author

Sorry for the churn but I did a substantial refactor of this. The error handling path is simplified compared to before. A significant change is that I am no longer calling cancellation callbacks into FASTER at all. I noticed that those cancellation callbacks were causing a lot of exceptions and it turns out they are not needed.

@sebastianburckhardt sebastianburckhardt changed the title Catch and log exceptions in FASTER IO completion callbacks Improve error handling for FASTER IO completion callbacks Feb 14, 2024
@davidmrdavid davidmrdavid self-requested a review February 14, 2024 21:40
@sebastianburckhardt
Copy link
Member Author

After running more tests I added two more changes:

  • there were issues with performance when canceling large numbers of requests under semaphore pressure (because each canceled request had to acquire semaphore). This is easily preventable by canceling the waiters on the semaphore directly when the partition is terminated, as opposed to having to acquire the semaphore first just to find out it is canceled.
  • removing the cancellation callbacks to faster caused hangs in the dispose paths because faster is blocking waiting for the IO operations to complete. I re-added the cancellations, but now they are being called during the dispose, which makes sure they are properly logged and are not interfering with the cancellation token callbacks.

@sebastianburckhardt sebastianburckhardt marked this pull request as draft February 15, 2024 23:04
@sebastianburckhardt sebastianburckhardt added this to the 1.4.3 milestone Mar 5, 2024
Base automatically changed from dev to main March 15, 2024 20:21
# Conflicts:
#	src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs
@sebastianburckhardt sebastianburckhardt marked this pull request as ready for review March 15, 2024 20:56
@sebastianburckhardt sebastianburckhardt merged commit 7e16a84 into main Mar 18, 2024
2 checks passed
@sebastianburckhardt sebastianburckhardt deleted the pr/catch-callback-exceptions branch March 18, 2024 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants