-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlocks/timeouts occur on SQL outbox table cleanup if it is under high load #778
Comments
One option you might try as a mitigation strategy is to disable the built-in cleanup and run a custom SQL Agent job that has smaller batches. This could improve the situation in the short term. In the long term it seems like the best option would be to partition the Outbox table and have a custom SQL Agent job that would run e.g. daily and
Dropping a partition is should be a much faster operation than deleting individual rows. Unfortunately we don't yet have guidance on how to automate partition-based expiration policy but we are working on this as we speak. |
Second what @SzymonPobiega says, though I doubt you'll need partitioning. You just have a lot of activity and multiple different endpoint instances fighting over who gets to do the cleanup. Ideally all users would do Outbox cleanup using SQL Agent—that's the logical choice for a task like that. We just have the app to do it for the simple cases so that by default it Just Works® without having to set up and manage SQL Agent tasks. You just happen to not be a simple case. |
We can improve here. We can catch the exception type and provide a link to guidance in docs that share how a user can benefit from using SQL Agent to prevent competing over outbox cleanup. |
For reference the hard-coded batch size was decreased from 10_000 to 4_00 by: Also, here's a request to make the batch size configurable in: |
Symptoms
NServiceBus.dll 7.5.0
NServiceBus.Persistence.Sql 6.2.0
We have a highly loaded system that produces around 12M events per day. With a retention policy of 7 days for deduplication data, we can have around 80M events in our outbox table. This amount of data leads us to the situation where we receive a lot of deadlock and timeout exceptions on running the cleanup query for the outbox data table. There're multiple instances of an endpoint. Each of them deletes outbox data. Please see attached images with exceptions below.
While researching this issue we tried to set the deduplication retention policy value to 1 day (so we had around 12M events in the table). After that, we started getting fewer deadlock and timeout exceptions but we still got a lot of them.
During our research we also tried to rebuild indices on the outbox table since the fragmentation on them was 90% and higher. After the index was rebuilt everything was okay for some time. But fragmentation was increasing very fast and we got errors again after 1 hour or so. Then we decided to execute the index rebuild on regular basis. Unfortunately, the primary key index rebuild took around 5 minutes for us, and during that time we had SQL Server performance degradation and receive a lot of timeout exceptions from SQL Server (from the outbox cleanup process).
Our next idea was to decrease the size of deleting batch (which is 10 000 right now). Unfortunately, as we could see the value 10 000 is hardcoded in the source code of the NServiceBus.Persistence.Sql package and we couldn't change it since there's no option for that.
Who's affected
SQL dialect users meaning Microsoft SQL Server, Azure SQL or any other hosted Microsoft SQL Server instance types.
Root cause
Too many rows are deleted in the deletion query resulting in lock escalation which results in dead-locks.
Workarounds
When the instances are hosted in Kubernetes and scaled using StatefulSets it is possible to get the stable instance id (a number from
0
ton
) via thepod.beta.kubernetes.io/petset-index
annotation.This can be mapped to an environment variable in the container and later used to disable Outbox clean-up on all instances except instance
0
.The text was updated successfully, but these errors were encountered: