Message duplication causes orchestrator messages dead-letter #6624

joelverhagen · 2018-11-05T20:31:10Z

Orchestrator has logic that clones messages and re-enqueues them so that we can try a message more than the maximum attempts set on the subscription (20). In short, orchestrator does this:

Gets a message.
Process the message.
If the validation set is not done, re-enqueue
Complete the message.

If the "complete the message" step fails, that means there are now two messages for the one validation set. When the validation set finally completes, one message "wins" by copying the package from to the public container. The other message "loses" because it can't copy the package from the validation set location to the public container because the first message deleted the validation set copy.

This causes dead-lettering.

This issue is a continuation of #6515.

In my on-call week I had 34+ dead-letter messages.

loic-sharma · 2018-11-21T01:18:38Z

Problem

It appears that the message duplication is caused by the Orchestrator taking so long to process a message that Service Bus delivers the message again to the Orchestrator. Our Service Bus locks are set to 1 minute, meaning that message duplication will happen every time we take longer than 1 minute to process the message. See this AI query:

traces
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| extend CallGuid = tostring(customDimensions.CallGuid)
| summarize max(timestamp), min(timestamp) by CallGuid
| extend DurationSeconds = (max_timestamp  - min_timestamp)/1s
| summarize percentiles(DurationSeconds, 50, 90, 99) by bin(min_timestamp, 10m)
| render timechart

Here messages clearly exceeded 1 minute, thereby causing a surge of dead lettered messages. The following query shows that a validation set is guaranteed to deadletter due to copy failures if one of its message's lock expire:

exceptions
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| where outerMessage contains "The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue."
| extend CallGuid = tostring(customDimensions.CallGuid)
| project CallGuid
| join (
  traces
  | where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
  | extend CallGuid = tostring(customDimensions.CallGuid)
  | extend ValidationSetId = tostring(customDimensions.ValidationSetId)
  | where ValidationSetId <> ""
  | summarize max(timestamp), min(timestamp), any(ValidationSetId) by CallGuid
  | extend DurationSeconds = (max_timestamp  - min_timestamp)/1s
  | extend ValidationSetId = any_ValidationSetId
  | project ValidationSetId, DurationSeconds, CallGuid
) on CallGuid 
| join (
  traces
  | where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
  | where message contains "Before calling FetchAttributesAsync(), the source blob 'validation-sets/"
  | extend ValidationSetId = tostring(customDimensions.ValidationSetId)
  | summarize count() by ValidationSetId
  | extend CopyFailures = count_
) on ValidationSetId 
| project ValidationSetId, CallGuid, DurationSeconds, CopyFailures
| order by DurationSeconds desc

In the last week, 165 validation sets had messages that expired their lease:

exceptions
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| where outerMessage contains "The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue."
| extend CallGuid = tostring(customDimensions.CallGuid)
| join (
    traces
    | extend CallGuid = tostring(customDimensions.CallGuid)
    | extend ValidationSetId = tostring(customDimensions.ValidationSetId)
    | summarize max(ValidationSetId) by CallGuid
) on CallGuid 
| extend ValidationSetId = max_ValidationSetId 
| summarize count() by ValidationSetId
| order by count_ desc

Potential solutions

First, we should add a metric that tracks how long it takes the Orchestrator to process Service Bus messages. We should also create a dashboard for this.

Option 1: Auto renewal of the current message

We could renew the lock that the Orchestrator keeps the current message. This would reduce the likelihood of Service Bus redelivering the current message. This isn't a perfect solution as the message processing can still exceed the auto renewal lifetime. See: https://docs.microsoft.com/en-us/dotnet/api/microsoft.servicebus.messaging.onmessageoptions.autorenewtimeout?view=azure-dotnet

Option 2: Don't send a follow-up message if the current message's lock has expired

We could ensure that the Orchestrator doesn't send a new message if it no longer has the lock on the message it is processing. Today, the Orchestrator does the following steps after processing a message:

If the current validation set is incomplete:
1. The Orchestrator sends itself a new message to continue processing
It completes the current message

I propose that we change the Orchestrator to do the following steps instead:

If the the message took longer than a minute to process:
1. If the current validation set is incomplete:
  1. The Orchestrator throws an exception
2. Else if the current validation set is complete:
  1. The Orchestrator completes the message
Else if the message took less than a minute to process:
1. If the current validation set is incomplete:
  1. The Orchestrator sends itself a new message to continue processing
2. It completes the current message

With this change, a message that takes a long time to process will be redelivered by the Service Bus (so long as we haven't reached our redelivery quota).

loic-sharma · 2018-11-21T19:20:18Z

Message processing time on PROD

In the last week, here is how long the Orchestrator took to process a message:

50%	90%	99%	99.9%	99.99%
0.7187417441s	4.2273579406s	14.3209690600s	64.0709640868s	435.44250065s

It seems that 1 out of every 100 messages exceeds its lock duration. Here is the query I ran:

traces
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| extend CallGuid = tostring(customDimensions.CallGuid)
| where CallGuid <> ""
| summarize max(timestamp), min(timestamp) by CallGuid
| extend DurationSeconds = (max_timestamp  - min_timestamp)/1s
| summarize percentiles(DurationSeconds, 50, 90, 99, 99.9, 99.99)

Causes for long message processing

⚠️ I'm still working on this bit

In the last week:

953 Orchestrator messages took over 60 seconds to process
- 783 of those messages were due to backups that took over 60 seconds. See: [Orchestrator] Use server side copy for backups #6694
- 688 of those messages had package downloads of over 60 seconds

traces
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| extend CallGuid = tostring(customDimensions.CallGuid)
| where CallGuid <> ""
| summarize max(timestamp), min(timestamp) by CallGuid
| extend DurationSeconds = (max_timestamp - min_timestamp)/1s
| where DurationSeconds >= 60
| join kind=leftouter (
  traces
  | where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
  | extend CallGuid = tostring(customDimensions.CallGuid)
  | where message contains "Downloaded" and message contains "bytes in" and message contains "seconds for request"
  | extend DownloadSeconds = todouble(customDimensions.DownloadElapsedTime) 
  | extend LongDownload = iif(DownloadSeconds > 50, 1, 0)
) on CallGuid
| join kind=leftouter (
  traces
  | where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
  | extend CallGuid = tostring(customDimensions.CallGuid)
  | where message contains "Backing up package" or message contains "Adding validation set entry"
  | summarize min(timestamp), max(timestamp) by CallGuid
  | extend BackupSeconds = (max_timestamp  - min_timestamp)/1s
  | extend LongBackup = iif(BackupSeconds > 50, 1, 0)
) on CallGuid 
| project CallGuid, DurationSeconds, LongDownload, LongBackup, DownloadSeconds, BackupSeconds
| summarize count(), countif(LongDownload == 1), countif(LongBackup == 1)

Validation Sets with multiple messages exceeding the lock

There is a concern that the solution could make a validation set "stuck" if all deliveries of a message exceed the lock duration. Say a package takes 30 minutes to download, no matter what. If all 20 deliveries of the messages hit the 30 minute download, all deliveries will not send a follow-up message, effectively making the validation set "stuck".

In the last week on PROD:

911 validation sets had 1 message take longer than 1 minute to process
21 validation sets had 2 messages take longer than 1 minute to process
0 validation sets had 3 or more messages take longer than 1 minute to process

From the data, there is no concern for validation sets getting "stuck" after the change. Here is the query I ran:

traces
| where cloud_RoleName == "NuGet.Services.Validation.Orchestrator" 
| extend CallGuid = tostring(customDimensions.CallGuid)
| extend ValidationSetId = tostring(customDimensions.ValidationSetId)
| where CallGuid <> ""
| summarize max(timestamp), min(timestamp), max(ValidationSetId) by CallGuid
| extend DurationSeconds = (max_timestamp  - min_timestamp)/1s
| where DurationSeconds >= 60
| extend ValidationSetId = max_ValidationSetId 
| project ValidationSetId, DurationSeconds, CallGuid
| summarize count() by ValidationSetId
| order by count_ desc

loic-sharma · 2018-11-21T22:15:45Z

Orchestrator does not seem to set ServicePointManager.DefaultConnectionLimit.

loic-sharma · 2018-11-22T02:10:30Z

Improve message duration + lock expiration telemetry: Improve SubscriptionProcessor's telemetry for messages exceeding expected duration ServerCommon#230, Improve telemetry for validation messages exceeding expected duration NuGet.Jobs#678
Increase the connection limit in the Orchestrator: [Orchestrator] Increase the connection limit NuGet.Jobs#677
Improve the package backup telemetry: [Orchestrator] Track package backup durations NuGet.Jobs#682

The Orchestrator used the default connection limit of 2 per server. The Orchestrator processes messages in parallel, each of which may be downloading/uploading large files. Part of NuGet/NuGetGallery#6624

…cted duration (#230) Today, we have to parse Application Insights logs to figure out how a message handler's duration. This makes it impossible to dashboard and monitor. This metric will let us detect messages that caused deadlettering by exceeding their expected duration. I also fixed a bunch of build warnings as part of this change. Part of: NuGet/NuGetGallery#6624

Package backups exceeding the expected duration may be causing message duplication in the Orchestrator. This adds more telemetry around package backups for future investigations. Part of NuGet/NuGetGallery#6624

Today, we have to parse Application Insights logs to figure out how a message handler's duration. This makes it impossible to dashboard and monitor. This metric will let us detect messages that caused deadlettering by exceeding their expected duration. Depends on NuGet/ServerCommon#230 Part of NuGet/NuGetGallery#6624

loic-sharma · 2019-05-15T00:09:21Z

@agr should we keep this open until we’ve verified there’s no more dead lettering? I would wait a few days and maybe run functional tests on PROD.

The Orchestrator used the default connection limit of 2 per server. The Orchestrator processes messages in parallel, each of which may be downloading/uploading large files. Part of NuGet/NuGetGallery#6624

Package backups exceeding the expected duration may be causing message duplication in the Orchestrator. This adds more telemetry around package backups for future investigations. Part of NuGet/NuGetGallery#6624

Today, we have to parse Application Insights logs to figure out how a message handler's duration. This makes it impossible to dashboard and monitor. This metric will let us detect messages that caused deadlettering by exceeding their expected duration. Depends on NuGet/ServerCommon#230 Part of NuGet/NuGetGallery#6624

loic-sharma self-assigned this Nov 21, 2018

loic-sharma modified the milestones: S144 - 2018.11.05, S145 - 2018.11.26 Nov 21, 2018

loic-sharma mentioned this issue Nov 21, 2018

Update Service Bus SDK dependency NuGet/ServerCommon#229

Closed

loic-sharma mentioned this issue Nov 26, 2018

[Orchestrator] Track package backup durations NuGet/NuGet.Jobs#682

Merged

joelverhagen removed this from the S145 - 2018.11.26 milestone Dec 12, 2018

loic-sharma assigned agr and unassigned loic-sharma Mar 9, 2019

loic-sharma added the ops grabs label Mar 9, 2019

loic-sharma added this to the S150 - 2018.03.11 milestone Mar 9, 2019

agr modified the milestones: S150 - 2019.03.11, S151 - 2019.04.01 Mar 29, 2019

agr removed the ops grabs label Mar 29, 2019

agr mentioned this issue Apr 17, 2019

Sending validation request right before committing changes to DB #7084

Merged

skofman1 modified the milestones: S151 - 2019.04.01, S152 - 2019.04.22 Apr 19, 2019

loic-sharma added Verified-Dev and removed Verified-Dev labels Apr 23, 2019

agr added the Verified-Dev label Apr 30, 2019

agr added Verified-Int Verified-Prod labels May 8, 2019

agr closed this as completed May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message duplication causes orchestrator messages dead-letter #6624

Message duplication causes orchestrator messages dead-letter #6624

joelverhagen commented Nov 5, 2018

loic-sharma commented Nov 21, 2018 •

edited

Loading

loic-sharma commented Nov 21, 2018 •

edited

Loading

loic-sharma commented Nov 21, 2018 •

edited

Loading

loic-sharma commented Nov 22, 2018 •

edited

Loading

loic-sharma commented May 15, 2019

Message duplication causes orchestrator messages dead-letter #6624

Message duplication causes orchestrator messages dead-letter #6624

Comments

joelverhagen commented Nov 5, 2018

loic-sharma commented Nov 21, 2018 • edited Loading

Problem

Potential solutions

Option 1: Auto renewal of the current message

Option 2: Don't send a follow-up message if the current message's lock has expired

loic-sharma commented Nov 21, 2018 • edited Loading

Message processing time on PROD

Causes for long message processing

Validation Sets with multiple messages exceeding the lock

loic-sharma commented Nov 21, 2018 • edited Loading

loic-sharma commented Nov 22, 2018 • edited Loading

loic-sharma commented May 15, 2019

loic-sharma commented Nov 21, 2018 •

edited

Loading

loic-sharma commented Nov 21, 2018 •

edited

Loading

loic-sharma commented Nov 21, 2018 •

edited

Loading

loic-sharma commented Nov 22, 2018 •

edited

Loading