Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message duplication causes orchestrator messages dead-letter #6624

Closed
joelverhagen opened this issue Nov 5, 2018 · 5 comments
Closed

Message duplication causes orchestrator messages dead-letter #6624

joelverhagen opened this issue Nov 5, 2018 · 5 comments

Comments

@joelverhagen
Copy link
Member

Orchestrator has logic that clones messages and re-enqueues them so that we can try a message more than the maximum attempts set on the subscription (20). In short, orchestrator does this:

  1. Gets a message.
  2. Process the message.
  3. If the validation set is not done, re-enqueue
  4. Complete the message.

If the "complete the message" step fails, that means there are now two messages for the one validation set. When the validation set finally completes, one message "wins" by copying the package from to the public container. The other message "loses" because it can't copy the package from the validation set location to the public container because the first message deleted the validation set copy.

This causes dead-lettering.

This issue is a continuation of #6515.

In my on-call week I had 34+ dead-letter messages.

@loic-sharma
Copy link
Contributor

loic-sharma commented Nov 21, 2018

Problem

It appears that the message duplication is caused by the Orchestrator taking so long to process a message that Service Bus delivers the message again to the Orchestrator. Our Service Bus locks are set to 1 minute, meaning that message duplication will happen every time we take longer than 1 minute to process the message. See this AI query:

image

traces
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| extend CallGuid = tostring(customDimensions.CallGuid)
| summarize max(timestamp), min(timestamp) by CallGuid
| extend DurationSeconds = (max_timestamp  - min_timestamp)/1s
| summarize percentiles(DurationSeconds, 50, 90, 99) by bin(min_timestamp, 10m)
| render timechart 

Here messages clearly exceeded 1 minute, thereby causing a surge of dead lettered messages. The following query shows that a validation set is guaranteed to deadletter due to copy failures if one of its message's lock expire:

exceptions
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| where outerMessage contains "The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue."
| extend CallGuid = tostring(customDimensions.CallGuid)
| project CallGuid
| join (
  traces
  | where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
  | extend CallGuid = tostring(customDimensions.CallGuid)
  | extend ValidationSetId = tostring(customDimensions.ValidationSetId)
  | where ValidationSetId <> ""
  | summarize max(timestamp), min(timestamp), any(ValidationSetId) by CallGuid
  | extend DurationSeconds = (max_timestamp  - min_timestamp)/1s
  | extend ValidationSetId = any_ValidationSetId
  | project ValidationSetId, DurationSeconds, CallGuid
) on CallGuid 
| join (
  traces
  | where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
  | where message contains "Before calling FetchAttributesAsync(), the source blob 'validation-sets/"
  | extend ValidationSetId = tostring(customDimensions.ValidationSetId)
  | summarize count() by ValidationSetId
  | extend CopyFailures = count_
) on ValidationSetId 
| project ValidationSetId, CallGuid, DurationSeconds, CopyFailures
| order by DurationSeconds desc

In the last week, 165 validation sets had messages that expired their lease:

exceptions
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| where outerMessage contains "The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue."
| extend CallGuid = tostring(customDimensions.CallGuid)
| join (
    traces
    | extend CallGuid = tostring(customDimensions.CallGuid)
    | extend ValidationSetId = tostring(customDimensions.ValidationSetId)
    | summarize max(ValidationSetId) by CallGuid
) on CallGuid 
| extend ValidationSetId = max_ValidationSetId 
| summarize count() by ValidationSetId
| order by count_ desc

Potential solutions

First, we should add a metric that tracks how long it takes the Orchestrator to process Service Bus messages. We should also create a dashboard for this.

Option 1: Auto renewal of the current message

We could renew the lock that the Orchestrator keeps the current message. This would reduce the likelihood of Service Bus redelivering the current message. This isn't a perfect solution as the message processing can still exceed the auto renewal lifetime. See: https://docs.microsoft.com/en-us/dotnet/api/microsoft.servicebus.messaging.onmessageoptions.autorenewtimeout?view=azure-dotnet

Option 2: Don't send a follow-up message if the current message's lock has expired

We could ensure that the Orchestrator doesn't send a new message if it no longer has the lock on the message it is processing. Today, the Orchestrator does the following steps after processing a message:

  1. If the current validation set is incomplete:
    1. The Orchestrator sends itself a new message to continue processing
  2. It completes the current message

I propose that we change the Orchestrator to do the following steps instead:

  1. If the the message took longer than a minute to process:
    1. If the current validation set is incomplete:
      1. The Orchestrator throws an exception
    2. Else if the current validation set is complete:
      1. The Orchestrator completes the message
  2. Else if the message took less than a minute to process:
    1. If the current validation set is incomplete:
      1. The Orchestrator sends itself a new message to continue processing
    2. It completes the current message

With this change, a message that takes a long time to process will be redelivered by the Service Bus (so long as we haven't reached our redelivery quota).

@loic-sharma
Copy link
Contributor

loic-sharma commented Nov 21, 2018

Message processing time on PROD

In the last week, here is how long the Orchestrator took to process a message:

50% 90% 99% 99.9% 99.99%
0.7187417441s 4.2273579406s 14.3209690600s 64.0709640868s 435.44250065s

It seems that 1 out of every 100 messages exceeds its lock duration. Here is the query I ran:

traces
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| extend CallGuid = tostring(customDimensions.CallGuid)
| where CallGuid <> ""
| summarize max(timestamp), min(timestamp) by CallGuid
| extend DurationSeconds = (max_timestamp  - min_timestamp)/1s
| summarize percentiles(DurationSeconds, 50, 90, 99, 99.9, 99.99)

Causes for long message processing

⚠️ I'm still working on this bit

In the last week:

traces
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| extend CallGuid = tostring(customDimensions.CallGuid)
| where CallGuid <> ""
| summarize max(timestamp), min(timestamp) by CallGuid
| extend DurationSeconds = (max_timestamp - min_timestamp)/1s
| where DurationSeconds >= 60
| join kind=leftouter (
  traces
  | where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
  | extend CallGuid = tostring(customDimensions.CallGuid)
  | where message contains "Downloaded" and message contains "bytes in" and message contains "seconds for request"
  | extend DownloadSeconds = todouble(customDimensions.DownloadElapsedTime) 
  | extend LongDownload = iif(DownloadSeconds > 50, 1, 0)
) on CallGuid
| join kind=leftouter (
  traces
  | where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
  | extend CallGuid = tostring(customDimensions.CallGuid)
  | where message contains "Backing up package" or message contains "Adding validation set entry"
  | summarize min(timestamp), max(timestamp) by CallGuid
  | extend BackupSeconds = (max_timestamp  - min_timestamp)/1s
  | extend LongBackup = iif(BackupSeconds > 50, 1, 0)
) on CallGuid 
| project CallGuid, DurationSeconds, LongDownload, LongBackup, DownloadSeconds, BackupSeconds
| summarize count(), countif(LongDownload == 1), countif(LongBackup == 1)

Validation Sets with multiple messages exceeding the lock

There is a concern that the solution could make a validation set "stuck" if all deliveries of a message exceed the lock duration. Say a package takes 30 minutes to download, no matter what. If all 20 deliveries of the messages hit the 30 minute download, all deliveries will not send a follow-up message, effectively making the validation set "stuck".

In the last week on PROD:

  1. 911 validation sets had 1 message take longer than 1 minute to process
  2. 21 validation sets had 2 messages take longer than 1 minute to process
  3. 0 validation sets had 3 or more messages take longer than 1 minute to process

From the data, there is no concern for validation sets getting "stuck" after the change. Here is the query I ran:

traces
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| extend CallGuid = tostring(customDimensions.CallGuid)
| extend ValidationSetId = tostring(customDimensions.ValidationSetId)
| where CallGuid <> ""
| summarize max(timestamp), min(timestamp), max(ValidationSetId) by CallGuid
| extend DurationSeconds = (max_timestamp  - min_timestamp)/1s
| where DurationSeconds >= 60
| extend ValidationSetId = max_ValidationSetId 
| project ValidationSetId, DurationSeconds, CallGuid
| summarize count() by ValidationSetId
| order by count_ desc

@loic-sharma
Copy link
Contributor

loic-sharma commented Nov 21, 2018

Orchestrator does not seem to set ServicePointManager.DefaultConnectionLimit.

@loic-sharma
Copy link
Contributor

loic-sharma commented Nov 22, 2018

loic-sharma added a commit to NuGet/NuGet.Jobs that referenced this issue Nov 26, 2018
The Orchestrator used the default connection limit of 2 per server. The Orchestrator processes messages in parallel, each of which may be downloading/uploading large files.

Part of NuGet/NuGetGallery#6624
loic-sharma added a commit to NuGet/ServerCommon that referenced this issue Nov 26, 2018
…cted duration (#230)

Today, we have to parse Application Insights logs to figure out how a message handler's duration. This makes it impossible to dashboard and monitor. This metric will let us detect messages that caused deadlettering by exceeding their expected duration.

I also fixed a bunch of build warnings as part of this change.

Part of: NuGet/NuGetGallery#6624
loic-sharma added a commit to NuGet/NuGet.Jobs that referenced this issue Nov 27, 2018
Package backups exceeding the expected duration may be causing message duplication in the Orchestrator. This adds more telemetry around package backups for future investigations.

Part of NuGet/NuGetGallery#6624
loic-sharma added a commit to NuGet/NuGet.Jobs that referenced this issue Nov 27, 2018
Today, we have to parse Application Insights logs to figure out how a message handler's duration. This makes it impossible to dashboard and monitor. This metric will let us detect messages that caused deadlettering by exceeding their expected duration.

Depends on NuGet/ServerCommon#230
Part of NuGet/NuGetGallery#6624
@joelverhagen joelverhagen removed this from the S145 - 2018.11.26 milestone Dec 12, 2018
@loic-sharma loic-sharma assigned agr and unassigned loic-sharma Mar 9, 2019
@loic-sharma loic-sharma added this to the S150 - 2018.03.11 milestone Mar 9, 2019
@agr agr removed the ops grabs label Mar 29, 2019
@loic-sharma
Copy link
Contributor

@agr should we keep this open until we’ve verified there’s no more dead lettering? I would wait a few days and maybe run functional tests on PROD.

joelverhagen pushed a commit to NuGet/NuGet.Jobs that referenced this issue Oct 26, 2020
The Orchestrator used the default connection limit of 2 per server. The Orchestrator processes messages in parallel, each of which may be downloading/uploading large files.

Part of NuGet/NuGetGallery#6624
joelverhagen pushed a commit to NuGet/NuGet.Jobs that referenced this issue Oct 26, 2020
Package backups exceeding the expected duration may be causing message duplication in the Orchestrator. This adds more telemetry around package backups for future investigations.

Part of NuGet/NuGetGallery#6624
joelverhagen pushed a commit to NuGet/NuGet.Jobs that referenced this issue Oct 26, 2020
Today, we have to parse Application Insights logs to figure out how a message handler's duration. This makes it impossible to dashboard and monitor. This metric will let us detect messages that caused deadlettering by exceeding their expected duration.

Depends on NuGet/ServerCommon#230
Part of NuGet/NuGetGallery#6624
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants