-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TransactionAbortedException when performing queries in parallel inside a transaction scope #1675
Comments
@joostmeijles Thank you for the repro. I'll back to you if I had any questions to reproduce the issue. |
@DavoudEshtehari any update on an ETA to verify that this is a bug and get it fixed? |
@joostmeijles I confirm it's reproducible with MDS and we'll return to it after the next release. I'm afraid we can't provide any estimation for a fix while it's in the investigation stage. We'll keep updating here with any findings, and we welcome any contribution as this is an open-source driver. |
@joostmeijles can you test with a reduced MaxPoolSize to 20? I wonder if this is something similar to #422? |
Changing the pool size to 20 does not change much, but setting it very large (e.g. 1000) seems to trigger the problem less often. |
@joostmeijles one more question, were you able to see the same problem on Windows 10? |
@joostmeijles regarding the comment in the repro // NB. Using a single connection and opening it once does NOT trigger the error does this mean if you open a connection first and then try the rest of the code is working fine or just a single connection? |
It means changing the code to:
does not trigger the error. |
Yes, same. |
@joostmeijles couple of things I noticed:
|
Edit: Timeout isn't the issue here.
Useful blog post: https://docs.microsoft.com/en-za/archive/blogs/dbrowne/using-new-transactionscope-considered-harmful |
An error possibly occur within seconds, so doesn't seem to hit a timeout. Also, the 100K parallel operations are just to make it easier to reproduce. I have seen the issue also with less (~100) parallel operations. Creating the transaction scope as follows (as mentioned in the blog post + async option) results in the same error.
|
Without connection pooling (on the connection string or by clearing the pool before the second connection) |
Just posting an update: I could reproduce this issue with This issue has been introduced by PR #543 for issue #237, which explains why it doesn't occur against SDS. |
@joostmeijles Using the NuGet package from the pipeline that is introduced with PR #1814 I experienced some improvement with this issue. May I ask you to verify it with the provided link, please? |
@DavoudEshtehari thanks for the update! Sounds as a good change, but unfortunately I am still getting an error, although its a different one:
|
@joostmeijles thank you for sharing the result. |
It fluctuates, doing 100K iterations. Some runs pass, others still bring up the error 1 or 2 times. |
I can reproduce it under both MDS-5.1.0-preview2.22314.2 and SDS-4.8.5 on Windows10, under NET5, NET6, and NET7. Making a query is not even required to reproduce (might take a few runs). Uncommenting LinqPad: async Task Main()
{
var tasks = Enumerable.Range(0, 100_000).Select(PerformWork);
await Task.WhenAll(tasks);
"DONE!".Dump();
}
const string connString = "Server=.\\SQL2019;Database=TempDB;Application Name=TestApp;Trusted_Connection=True;TrustServerCertificate=True;Pooling=True;Encrypt=True;Max Pool Size = 5000;";
static TransactionScope CreateTransactionScope() =>
new TransactionScope(TransactionScopeOption.Required,
new TransactionOptions { IsolationLevel = System.Transactions.IsolationLevel.ReadCommitted }, TransactionScopeAsyncFlowOption.Enabled);
static async Task PerformWork(int num)
{
try
{
using (var ts = CreateTransactionScope())
{
for (int i = 0; i < 2; ++i)
{
await using (var dbConn = new SqlConnection(connString))
{
await dbConn.OpenAsync();
await Task.Delay(1);
} // Connection is disposed (and thus closed)
}//for
//ts.Complete();
}//ts
}//try
catch (Exception e)
{
Console.WriteLine($"Failed {num}");
e.Dump(); Environment.Exit(-1);
throw;
}
}
|
What you're trying to do is unsafe and I don't agree this is bug. Old behavior before v1.1.4 was a bug where accounting was not done for externally aborted transactions. I would request a read on #237 (comment) and #237 (comment) to understand the root cause. How distributed transactions work is that they are decoupled from your application and are running within the MS DTC service. Which means if your app tries to be super-fast and expect MS DTC to keep coordinating with SQL Server for the MISSED commit/rollbacks, it needs time to do so, as it needs to abort the transaction on the server. From driver end, if you closed a POOLED connection, server-side connection is kept open. So now if you happen to re-open the connection BEFORE MSDTC could terminate your previous transaction, you find that AFTER opening the connection, transaction gets aborted. It should never happen that the same OLD transaction continues to be used, and I hope you'd agree with me on that! So, someone must keep things in order. Either it's your app, or the driver. If you want driver to consistently terminate your connection alongwith Distributed transaction, DON'T use connection pool! Likewise, in the above repro, if you do call I would certainly recommend documentation on this somewhere so it's not a surprise to end users as the default behavior of driver is with pooling enabled, which is clearly not intuitive to users. cc @David-Engel |
Also, to address @DavoudEshtehari 's PR #1814 and why it made visible difference: When the PR changed Pool design from Stack to a Queue, it's clear that instead of fetching the latest connection from stack, you'd always get the oldest connection, which gives more time to the recently added connection in the pool to terminate transaction. Since it's just pushing the latest connection to end of queue, it's not a solution IMO. You can still reproduce this behavior with a single connection going in and out of the pool if you limit your pool size. |
As far as I understand the provided example code should not result in a distributed transaction as the connection gets closed before the next connection is opened. Please note that I am not using the MS DTC service (its not running on my machine). And I think that the .NET version I use does not support distributed transactions? |
@cheenamalhotra There are no distributed transactions in my LinqPad example. MSDTC escalation does not happen because only one connection-string is used. If you replace When the 1st connection-open request occurs, it should enlist that connection into the Transaction. Then 1st-connection-close occurs (but the inner-connection is not really closed - just returns to the pool). Then the 2nd connection-open request should receive the exact same already-open inner-connection back from the pool (ie. the same one durable resource is enlisted). This is exactly what happens when |
This is interesting and I'd agree I didn't look into that to be the possibility, so it certainly points to something missing out with
I possibly was imagining .NET 7, but never mind. It's still a case of delegated transaction as we're looking at Let me dive into it and will get back. |
I reproed again with
The problem seems to be the fact the driver or System.Transactions (to be precise) even attempts to promote transaction, where it should not and does not in a normal flow. The trigger scenario is when below call fails to enlist connection into delegated transaction and returns SqlClient/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlInternalConnection.cs Line 529 in eb1b922
The transaction promotion is triggered by System.Transactions when the driver attempts to grab transaction cookie here: SqlClient/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlInternalConnection.cs Line 586 in eb1b922
As you can see it leads to calling cc @roji do you know why would System.Transactions try to make the transaction distributed, calling promote, in .NET 6? |
@DavoudEshtehari Tried setting the timeout to 180 seconds, issue still occurs. Also tried 60 seconds and 10 minutes ( |
Mostly noting this down as another point of data for the team. After upgrading our project to EF 7/SqlClient 5.1.0 from EF 6/SqlClient 5.0.1, we've started receiving this exception intermittently on our CI while running our test suite. It happens roughly once in 3 runs, which is annoying enough to be noticeable given we have had a fairly stable build up until this point. Setup:
|
I can't see the issue with the repro by upgrading to Windows 11 version 21H2 (OS Build 22000.1936). Please, share your results. Update: |
@DavoudEshtehari I have tested Nuget I have also tested the exact same repro-code with Nuget Can this fix please be ported to the |
IMHO this is a significant bug, and Microsoft should fix it in S.D.SqlClient as well (which a vast number of enterprise software still depends on). |
I'll bring up your concern in the next bug triage meeting and will update here if there's a different outcome. |
Verified on my machine: Works like a charm! Fantastic work @DavoudEshtehari 🥇 |
First thank you very much for prioritizing this issue appropriately. I just wanted to share what we've encountered till now in our tests; Some time ago we migrated from Recently, due to https://msrc.microsoft.com/update-guide/vulnerability/CVE-2024-0056, we tried to upgrade from 3.1.1 to 3.1.5; We started to encounter two different issues from time to time (~20 times per day out of ~2M requests & tens of thousands background tasks):
We also tried to migrate to 5.1.x/5.2.x, but I think that these issues were much more common (~30 times per day); Of course all these metrics might be influenced by actual usage (we tried the migration in various environments which are more or less similar in terms of infrastructure, etc); Based on my understanding this PR: #1801 (3.1.x: #1912) introduced a race condition between I saw that I wanted to ask if there are plans to backport #2301 to older versions like 3.1.x. @DavoudEshtehari Do you know something about it at this moment, or is it too early? Setup: |
Hi @razvalex, This fix is slated to be considered for backporting to all the supported servicing versions, with the initial release expected to be 5.1.5. However, there is no specific timeline set for these releases as of now. |
@DavoudEshtehari, I wanted to inform you that we evaluated the performance of https://github.com/dotnet/SqlClient/releases/tag/v5.2.0-preview5 in one of our environments, which handles approximately 2 million daily requests and tens of thousands of background tasks. The results were very positive. Thanks again! For us it would be helpful to also have #2301 backported to 3.1.x, but I think we can also manage to update everything to 5.1.5 if that won't be the case. |
@razvalex 5.1.5 has been released |
@ErikEJ Thank you. This week, we're aiming to upgrade the majority of our projects to version 5.1.5. I'll provide you with an update on the outcomes soon (mid-February). Backporting #2301 to version 3.1.x would help us (for example) to avoid extra handling for Encrypt (#1210) in certain older projects. |
@razvalex eventually the fix will be backported to 3.1, date yet to be determined. |
@razvalex Thanks for sharing the result. It sounds promising. |
We were able to upgrade the majority of our projects from 3.1.5 to version 5.1.5 with great success; This week we are planning to upgrade from 5.1.5 to 5.2.0, mainly due to increased performance in some key areas (ex: #1544). I was wondering if there are any plans for backporting #2301 to 3.1.x as some of our older projects might benefit having this fix in place. Also, upgrading them to a newer version means dealing with breaking changes (ex: Encrypt) and might be riskier. |
Describe the bug
When performing multiple queries in parallel, and each pair of 2 queries are inside a transaction scope, an unexpected transaction error is thrown (see below).
To reproduce
Below code reproduces the error. It runs 100K times 2 queries in parallel where each 2 queries are inside a single transaction scope.
Note that after each query the database connection is disposed (and thus closed). This is important to trigger the error.
When we change the code, and use 1 connection for both queries and only open 1 connection the error does not occur (see also code comments below).
Expected behavior
Being able to use multiple connections (with the same connection string) in sequence within the same transaction without errors.
Further technical details
Microsoft.Data.SqlClient version: 4.1.0
.NET target: .NET 6
SQL Server version: SQL Server 2019
Operating system: Windows 11
The text was updated successfully, but these errors were encountered: