Change Feed Processor: Fixes LeaseLostException on Notifications API for Renewer #4276
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#3401 fixed all the scenarios where the LeaseLostException is generated based on a logical condition (not network request) to contain a CosmosException that is reportable to the Notifications API.
The PartitionSupervisor was modified in that same PR to only report the inner CosmosException to the Notifications API.
However, there was a missing case:
The PartitionController starts a PartitionSupervisor once acquiring a lease, which spans a PartitionProcessor and Renewer.
The Renewer can hit a LeaseLostException if the PartitionProcessor is not processing (idle partition) but being renewed, in that case, the LeaseLostException goes up, stops the PartitionSupervisor as expected, and gets bubbled up to the PartitionController, which releases the lease.
The problem was that the PartitionController calls the Notification APIs to report Errors without identifying that this can only be done for LeaseLostExceptions when they have a linked inner exception (CosmosException). This handling was previously added to other flows (lease acquiring, lease release) but not on the renewal.
Symptoms
This was reported in Azure Functions, where the Functions Extension has a switch statement to log CosmosExceptions:
https://github.com/Azure/azure-webjobs-sdk-extensions/blob/cf0fc8022b230aa1e19325638881a56210474189/src/WebJobs.Extensions.CosmosDB/Trigger/CosmosDBTriggerHealthMonitor.cs#L23-L37
But it is receiving the OuterType as LeaseLostException:
Verification
With this fix, the emitted Notifications/Logs, have the proper Severity Level (Information) and the proper Type (CosmosException):