Deadlocks/timeouts occur on SQL outbox table cleanup if it is under high load #778

RuslanKachmarSecurrency · 2021-09-10T11:19:43Z

Symptoms

NServiceBus.dll 7.5.0
NServiceBus.Persistence.Sql 6.2.0

We have a highly loaded system that produces around 12M events per day. With a retention policy of 7 days for deduplication data, we can have around 80M events in our outbox table. This amount of data leads us to the situation where we receive a lot of deadlock and timeout exceptions on running the cleanup query for the outbox data table. There're multiple instances of an endpoint. Each of them deletes outbox data. Please see attached images with exceptions below.

While researching this issue we tried to set the deduplication retention policy value to 1 day (so we had around 12M events in the table). After that, we started getting fewer deadlock and timeout exceptions but we still got a lot of them.

During our research we also tried to rebuild indices on the outbox table since the fragmentation on them was 90% and higher. After the index was rebuilt everything was okay for some time. But fragmentation was increasing very fast and we got errors again after 1 hour or so. Then we decided to execute the index rebuild on regular basis. Unfortunately, the primary key index rebuild took around 5 minutes for us, and during that time we had SQL Server performance degradation and receive a lot of timeout exceptions from SQL Server (from the outbox cleanup process).

Our next idea was to decrease the size of deleting batch (which is 10 000 right now). Unfortunately, as we could see the value 10 000 is hardcoded in the source code of the NServiceBus.Persistence.Sql package and we couldn't change it since there's no option for that.

Who's affected

SQL dialect users meaning Microsoft SQL Server, Azure SQL or any other hosted Microsoft SQL Server instance types.

Root cause

Too many rows are deleted in the deletion query resulting in lock escalation which results in dead-locks.

Workarounds

When the instances are hosted in Kubernetes and scaled using StatefulSets it is possible to get the stable instance id (a number from 0 to n) via the pod.beta.kubernetes.io/petset-index annotation.

This can be mapped to an environment variable in the container and later used to disable Outbox clean-up on all instances except instance 0.

The text was updated successfully, but these errors were encountered:

SzymonPobiega · 2021-09-14T07:38:22Z

Hi @RuslanKachmarSecurrency

One option you might try as a mitigation strategy is to disable the built-in cleanup and run a custom SQL Agent job that has smaller batches. This could improve the situation in the short term.

In the long term it seems like the best option would be to partition the Outbox table and have a custom SQL Agent job that would run e.g. daily and

maintain the partition table i.e. add ranges as they are needed
drop the old partitions when they are older than ~7 days

Dropping a partition is should be a much faster operation than deleting individual rows. Unfortunately we don't yet have guidance on how to automate partition-based expiration policy but we are working on this as we speak.

DavidBoike · 2021-09-14T15:05:43Z

Second what @SzymonPobiega says, though I doubt you'll need partitioning. You just have a lot of activity and multiple different endpoint instances fighting over who gets to do the cleanup.

Ideally all users would do Outbox cleanup using SQL Agent—that's the logical choice for a task like that. We just have the app to do it for the simple cases so that by default it Just Works® without having to set up and manage SQL Agent tasks. You just happen to not be a simple case.

ramonsmits · 2021-10-08T10:40:48Z

We can improve here. We can catch the exception type and provide a link to guidance in docs that share how a user can benefit from using SQL Agent to prevent competing over outbox cleanup.

ttran-ipkg · 2023-10-03T07:27:59Z

For reference the hard-coded batch size was decreased from 10_000 to 4_00 by:

Optimize TSQL outbox cleanup by reducing batch size and adding rowlock hint #1249

Also, here's a request to make the batch size configurable in:

make the outbox batch size configuable. #1209

ramonsmits mentioned this issue Aug 8, 2022

Outbox cleanup in scaled out environments requires advanced configuration. Cleanup should be adaptive and elastic scaleout friendly by default #987

Open

This was referenced Oct 24, 2023

Optimize TSQL outbox cleanup by reducing batch size and adding rowlock hint #1249

Merged

Optimize TSQL outbox cleanup by reducing batch size and adding rowlock hint - 7.0 #1314

Merged

ramonsmits added the Bug label Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlocks/timeouts occur on SQL outbox table cleanup if it is under high load #778

Deadlocks/timeouts occur on SQL outbox table cleanup if it is under high load #778

RuslanKachmarSecurrency commented Sep 10, 2021 •

edited by tmasternak

Loading

SzymonPobiega commented Sep 14, 2021

DavidBoike commented Sep 14, 2021

ramonsmits commented Oct 8, 2021

ttran-ipkg commented Oct 3, 2023 •

edited by ramonsmits

Loading

Deadlocks/timeouts occur on SQL outbox table cleanup if it is under high load #778

Deadlocks/timeouts occur on SQL outbox table cleanup if it is under high load #778

Comments

RuslanKachmarSecurrency commented Sep 10, 2021 • edited by tmasternak Loading

Symptoms

Who's affected

Root cause

Workarounds

SzymonPobiega commented Sep 14, 2021

DavidBoike commented Sep 14, 2021

ramonsmits commented Oct 8, 2021

ttran-ipkg commented Oct 3, 2023 • edited by ramonsmits Loading

RuslanKachmarSecurrency commented Sep 10, 2021 •

edited by tmasternak

Loading

ttran-ipkg commented Oct 3, 2023 •

edited by ramonsmits

Loading