-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message duplication causes orchestrator messages dead-letter #6624
Comments
ProblemIt appears that the message duplication is caused by the Orchestrator taking so long to process a message that Service Bus delivers the message again to the Orchestrator. Our Service Bus locks are set to 1 minute, meaning that message duplication will happen every time we take longer than 1 minute to process the message. See this AI query:
Here messages clearly exceeded 1 minute, thereby causing a surge of dead lettered messages. The following query shows that a validation set is guaranteed to deadletter due to copy failures if one of its message's lock expire:
In the last week, 165 validation sets had messages that expired their lease:
Potential solutionsFirst, we should add a metric that tracks how long it takes the Orchestrator to process Service Bus messages. We should also create a dashboard for this. Option 1: Auto renewal of the current messageWe could renew the lock that the Orchestrator keeps the current message. This would reduce the likelihood of Service Bus redelivering the current message. This isn't a perfect solution as the message processing can still exceed the auto renewal lifetime. See: https://docs.microsoft.com/en-us/dotnet/api/microsoft.servicebus.messaging.onmessageoptions.autorenewtimeout?view=azure-dotnet Option 2: Don't send a follow-up message if the current message's lock has expiredWe could ensure that the Orchestrator doesn't send a new message if it no longer has the lock on the message it is processing. Today, the Orchestrator does the following steps after processing a message:
I propose that we change the Orchestrator to do the following steps instead:
With this change, a message that takes a long time to process will be redelivered by the Service Bus (so long as we haven't reached our redelivery quota). |
Message processing time on PRODIn the last week, here is how long the Orchestrator took to process a message:
It seems that 1 out of every 100 messages exceeds its lock duration. Here is the query I ran:
Causes for long message processingIn the last week:
Validation Sets with multiple messages exceeding the lockThere is a concern that the solution could make a validation set "stuck" if all deliveries of a message exceed the lock duration. Say a package takes 30 minutes to download, no matter what. If all 20 deliveries of the messages hit the 30 minute download, all deliveries will not send a follow-up message, effectively making the validation set "stuck". In the last week on PROD:
From the data, there is no concern for validation sets getting "stuck" after the change. Here is the query I ran:
|
Orchestrator does not seem to set |
|
The Orchestrator used the default connection limit of 2 per server. The Orchestrator processes messages in parallel, each of which may be downloading/uploading large files. Part of NuGet/NuGetGallery#6624
…cted duration (#230) Today, we have to parse Application Insights logs to figure out how a message handler's duration. This makes it impossible to dashboard and monitor. This metric will let us detect messages that caused deadlettering by exceeding their expected duration. I also fixed a bunch of build warnings as part of this change. Part of: NuGet/NuGetGallery#6624
Package backups exceeding the expected duration may be causing message duplication in the Orchestrator. This adds more telemetry around package backups for future investigations. Part of NuGet/NuGetGallery#6624
Today, we have to parse Application Insights logs to figure out how a message handler's duration. This makes it impossible to dashboard and monitor. This metric will let us detect messages that caused deadlettering by exceeding their expected duration. Depends on NuGet/ServerCommon#230 Part of NuGet/NuGetGallery#6624
@agr should we keep this open until we’ve verified there’s no more dead lettering? I would wait a few days and maybe run functional tests on PROD. |
The Orchestrator used the default connection limit of 2 per server. The Orchestrator processes messages in parallel, each of which may be downloading/uploading large files. Part of NuGet/NuGetGallery#6624
Package backups exceeding the expected duration may be causing message duplication in the Orchestrator. This adds more telemetry around package backups for future investigations. Part of NuGet/NuGetGallery#6624
Today, we have to parse Application Insights logs to figure out how a message handler's duration. This makes it impossible to dashboard and monitor. This metric will let us detect messages that caused deadlettering by exceeding their expected duration. Depends on NuGet/ServerCommon#230 Part of NuGet/NuGetGallery#6624
Orchestrator has logic that clones messages and re-enqueues them so that we can try a message more than the maximum attempts set on the subscription (20). In short, orchestrator does this:
If the "complete the message" step fails, that means there are now two messages for the one validation set. When the validation set finally completes, one message "wins" by copying the package from to the public container. The other message "loses" because it can't copy the package from the validation set location to the public container because the first message deleted the validation set copy.
This causes dead-lettering.
This issue is a continuation of #6515.
In my on-call week I had 34+ dead-letter messages.
The text was updated successfully, but these errors were encountered: