Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always release partition lease when a partition shuts down #335

Merged
merged 5 commits into from
Jan 3, 2024

Conversation

sebastianburckhardt
Copy link
Member

While analyzing some of traces showing high client latencies (see Azure/azure-functions-durable-extension#2603) we discovered a problem: in some cases the partition lease is not released when the partition is terminated, because the request is cancelled. This causes an excessive delay before the partitition recovery, since the lease has to expire first. Waiting for the lease to expire can take up to 40s and delays all pending client requests to that partition.

This PR fixes this problem by not passing the termination cancellation token to the lease renewal (which caused the lease release to not happen).

Comment on lines 739 to 742

// we always release the lease here, whether the partition has already been terminated or not.
await this.leaseClient.ReleaseAsync(conditions: null, CancellationToken.None).ConfigureAwait(false);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// we always release the lease here, whether the partition has already been terminated or not.
await this.leaseClient.ReleaseAsync(conditions: null, CancellationToken.None).ConfigureAwait(false);
// we always release the lease here, whether the partition has already been terminated or not.
// otherwise, a pending client request for that partition will be delayed until the lease expires
await this.leaseClient.ReleaseAsync(conditions: null, CancellationToken.None).ConfigureAwait(false);

@sebastianburckhardt
Copy link
Member Author

Added two more changes (because of related issues observed when testing this PR).
These are small "tuning" changes that may help a little when dealing with OOM, for example.

  • recognize another transient storage exception (so it can be retried instead of failing the partition immediately)
  • change the default setting so partitions do NOT take checkpoints when shutting down (taking many checkpoints at once is a typical cause for OOM)
  • do not release lease if it is already lost

@sebastianburckhardt sebastianburckhardt merged commit b47eae7 into dev Jan 3, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants