-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent connection usage and MARS #1234
Comments
Windows or Linux will be an important factor. |
Good point, this happened on my machine running on Ubuntu 21.04. |
I will do a test run on my Windows laptop. |
I can't get any of the SQL Server tests to run on my machine using 4.0.0-preview1.21237.2. I get:
Is there something else I need to configure on my machine to make these tests run? Full output:
|
Yes, you need to add Encrypt=false to your connection string; SqlClient 4.0.0 has Encrypt set to true by default (#1210). |
Preview 4 has a security change which forces encryption and trust. You either need to disable encryption (not recommended because it'll be what users have enabled) or add |
Not seeing a regression on Windows: With 2.1:
With 4.0 preview:
|
Ok, Managed SNI it is. @roji were your tests with Encryption off? and do you know if MARS was on or off? |
Yeah, Encrypt=false and no MARS. Can also run any other tests here - just let me know. |
No Mars is bad news. That means it's not just the usual threadpool exhaustion because with mars off sync IO is really synchronous in managed SNI. |
Okay, so ran with managed SNI on Windows using: static SqlServerTestStore()
{
var ManagedNetworkingAppContextSwitch = "Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows";
AppContext.SetSwitch(ManagedNetworkingAppContextSwitch, true);
} My connection string is:
Now the test runs for about 10-15 seconds, but then consistently fails with:
|
Sounds very similar/related... |
That does smell like threadpool exhaustion.. The immediate easy test it to artificially inflate the threadpool minthreads to the number of connections you expect to open, in this case 100, and then see if it still fails. If it all works then we know what the problem is. Fixing it though, that's a different matter. |
Some more info: this fails in exactly the same way with SqlClient 2.1 on Windows with managed SNI. Increasing thread pool size doesn't make any difference. static SqlServerTestStore()
{
var ManagedNetworkingAppContextSwitch = "Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows";
AppContext.SetSwitch(ManagedNetworkingAppContextSwitch, true);
ThreadPool.GetMaxThreads(out var workerThreads, out var completionPortThreads);
ThreadPool.SetMaxThreads(workerThreads * 20, completionPortThreads * 20);
} |
@Wraith2 I agree this indeed smells like TP exhaustion... But @ajcvickers I don't think this test was slow for me on Linux with SqlClient 2.1 (which is what we're currently using in EF Core)? Weird... |
@roji There is clearly some Windows/Linux difference here, not just managed/native difference. |
The areas that are affected by platform are mainly how pool keys are derived and how networking is implemented, both are switched on the appcontext switch you've used so you're effectively running in linux mode on windows when using that script, I've been able to repro linux-only problems on windows using that switch. Of course there could be some api call which isn't behaving the same way but that would be a bug in it's own right that needs addressing. for the original reported issue would it be possible to get an EF-free repro? |
Hi @roji, @ajcvickers and @Wraith2 |
Sure, I'll do that in the coming days. |
Sat down to do a minimal repro (see below), and found some interesting things - I was unfamiliar with the environment of the test in question:
Stepping back... if this (IMHO questionable!) practice is officially supported, then you guys seem to have a bug in current versions, which manifests much more reliably/often in 4.0.0-preview.1. Since this seems to work flawlessly on Windows with unmanaged networking, I suspect users porting to non-Windows platforms are hitting this from time to time. See below for the repro: leave it running against 3.0.0 and 4.0.0-preview.1, and also compare with unmanaged networking. ReproAppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows", true);
ThreadPool.SetMinThreads(200, 200);
const string ConnectionString =
"Server=localhost;Database=test;User=SA;Password=Abcd5678;Connect Timeout=60;ConnectRetryCount=0;Encrypt=false;Multiple Active Result Sets=true";
using (var conn = new SqlConnection(ConnectionString))
{
conn.Open();
using var cmd = new SqlCommand
{
Connection = conn,
CommandText = @"
DROP TABLE IF EXISTS [Lists];
CREATE TABLE [Lists] (
[Id] INTEGER,
[IsDeleted] BIT
)"
};
cmd.ExecuteNonQuery();
}
while (true)
{
var stopWatch = Stopwatch.StartNew();
using var conn = new SqlConnection(ConnectionString);
conn.Open();
Parallel.For(
0, 100,
i =>
{
using var cmd = new SqlCommand
{
Connection = conn,
CommandText = @"
SELECT [l].[Id], [l].[IsDeleted]
FROM [Lists] AS [l]
WHERE ([l].[IsDeleted] = CAST(0 AS bit)) AND [l].[Id] IN (1, 2, 3)"
};
using var reader = cmd.ExecuteReader();
// using var context = contextFactory.CreateContext();
// var query = context.Lists.Where(l => !l.IsDeleted && ids.Contains(l.Id)).ToList();
});
Console.WriteLine($"Completed in {stopWatch.ElapsedMilliseconds}ms");
} |
That's a relief, sort of. It means that longstanding behaviour I don't know how to fix is still the problem and that we don't have another.
Ew. Not supported. If you were actually making concurrent calls to anything other than Open I expect you'd see really scary exceptions that warn about concurrent usage but since this is about Open it's not unsafe or easily detectable until you've opened.
That should make it easier to trace which is good. |
FWIW I completely agree it shouldn't (and it's even documented that way) - so this is our bad on the EF Core. But I'm a little worried by the fact that the scenario works perfectly on unmanaged Windows - it may indicate that thread-safety was somehow an original goal. But more importantly, the distinction between thread-safety and multiple result sets is a bit subtle for the unsuspecting new user, and I'm guessing there's lots of code out there which does concurrent usage (and would break when ported.
Definitely agree. |
From what I remember mars is there to allow sloppy programming. COM ADO had a fun behaviour whereby is you had an open ResultSet and tried to open another it would use a new thread to open a second one for you silently in the background and handle the marshalling for you (at least in vb6) which let you do things like open a transaction and then start updating one resultset while opening a second resultset inside that same transaction to lookup data. People liked this because it means that the code to do such things was a lot smaller than having to think and code fully transactionally and deal with rollbacks etc. As you can imagine it got into all sorts of Line Of Business and asp programs. When the migration from .net was being pushed it was a major upgrade blocker if you couldn't do the same thing on .net so it had to be implemented. I think MARS is a terrible idea and that no-one should use it. Sadly there's no way anyone will ever let me take it out of the driver. Mars in SqlClient strictly applies to allowing multiple SqlDataReader instances to be active at one time on the same connection and works by multiplexing the traffic using the MC-SMP protocol. In practical terms it changes the way data is received by decoupling the packet receipt loop from the reader because otherwise you can deadlock the two SqlDataReaders, the receive loop with Mars enabled is implemented with async callback so just having mars enabled even if you aren't using it puts you at the mercy of threadpool exhaustion. TLDR; just don't use mars, it's not the right solution to whatever problem you're having. As far as thread safety goes. I'm not aware of anything being thread safe in SqlClient, but also nothing cares about what thread it's running on (apart from connecting with impersonation) so as long as your access is sequential you're unlikely to have a problem. I suspect the test case is serializing connection attempts at the pool level so even though the caller is concurrent the internals are likely sequential so you see no error. You'll probably want an official answer on this, I'm just recalling what I've seen in the code. |
This is indeed #422.
I wouldn't say that because this is a client design issue, I have a feeling it's related to SNIMarsHandle queue design. And the problem is not with multiple data reader design, problem is when We need to remember MARS is enabled using "async" reads, which uses the blocking queue design and that design has been my suspicion is the culprit. But the bigger challenge is we've only known this 1 design pattern in Managed SNI, so we need to really break down the design pattern and come up with something that would solve this sync/async contradiction or do things synchronously in that space. |
I agree. IMHO, I think we should separate sync and |
It immediately hits an assert in debug mode Add this to the list of reasons we need to be able to run tests in debug mode. |
@roji what exception are we expecting to see here? or aren't we? In debug mode the only problem it's having is caused by the fast re-use of a single session. The session identifier is limited to a 16 bit integer and if you create and destroy sessions fast enough you eventually wrap around and hit a number that's already in use which will bubble up as a connection failure. |
If you run the repro from Shay, you won't get exception using v3.0.0, and queries continue (even though slowly) without exception. I'd say compare behavior before and after your change to understand it's impact. |
I'm using Main which is 4.0 preview currently. No crash. I get a slow start because I'm not forcing the threadpool min as I wanted to see any long blocking behaviour but the only crash is from exceeding 65000+ sessions. |
Ok, that's just normal timing out which is what you'd expect. It looks like I can just run it fast enough to succeed most of the time. I've seen large numbers of threads waiting SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNIMarsHandle.cs Line 530 in fbfd5d7
|
I don't think that's a fair analysis or "normal" behavior. The exception is consistently reproducible with new change and the code also "consistently" passes with previous design. There's likely a deadlock in SNI layer, if threads are waiting and not receiving packets. |
So I think it's OK to say "if you use SqlConnection concurrently, the behavior is undefined" (though again, keep in mind this seems totally supported on unmanaged Windows). A better option is to actually detect these cases and throw a "connection used concurrently" InvalidOperationException, which would really help users find errors in their code. Npgsql does this but in a way that isn't 100% reliable, and I've generally lost a lot of time over the years helping users finding their accidental concurrent usage.
This was true for me - but only on unmanaged Windows. 3.0.0 with managed does eventually throw an exception for me, though it took a lot more time than 4.0.0; if that's true, then there may not be an actual regression here...
Yeah, this is how Npgsql is architected, and it has worked quite well - it's a good way to fully support sync and async without code duplication, and you can still split implementations on a per-function basis where you need to. |
I look forward to seeing your analysis.
I can't see one or any evidence of anything other than behaviour consistent with not being able to schedule a task. It took me months to get it this far and then more months to get it reviewed. If there is evidence of a problem with it then I'm happy to help out but I can't see any. There's a performance problem but lacking a reproduction of that problem and not just #422 I can't do much about it. I'm pretty sure the assertions of demuxer lock entry in the SNIMarsHandle methods are outdated and can be removed, the demuxer lock only has to guard the annotated variables and anything that sends a packet including the SNIMarsHandle ctor because someone was stupid when they wrote it. |
I have tried @roji's repro with version 3.0.0 and it works fine, but with the newest preview it fails right away as below: @Wraith2 I am afraid, but I believe this is a regression and has eliminated the only workaround we had for issue 422. Something interesting: I was trying to compare behavior with PLINQ, ( the below code) as the original issue #422 was with PLINQ and Repro static void Main(string[] args)
{
AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows", true);
//ThreadPool.SetMinThreads(200, 200);
SqlConnectionStringBuilder builder = new()
{
DataSource = "localhost",
InitialCatalog = <database name>,
UserID = "sa",
Password = <your password>,
ConnectTimeout = 60,
ConnectRetryCount = 0,
Encrypt = false,
MultipleActiveResultSets = true
};
while (true)
{
RunWithParallelFor(builder.ConnectionString);
}
//RunWithPLinq(builder.ConnectionString);
}
private static void RunWithPLinq(string connString)
{
using SqlConnection conn = new(connString);
conn.Open();
Enumerable.Range(0, count).
AsParallel().
WithDegreeOfParallelism(count).
ForAll(n => RunSqlCommand(conn));
}
private static void RunWithParallelFor(string connString)
{
using var conn = new SqlConnection(connString);
conn.Open();
Parallel.For(
0, 100,
i =>
{
RunSqlCommand(conn);
});
}
private static void RunSqlCommand(SqlConnection conn)
{
var stopWatch = Stopwatch.StartNew();
using var cmd = new SqlCommand
{
Connection = conn,
CommandText = @"
SELECT [l].[Id], [l].[IsDeleted]
FROM [Lists] AS [l]
WHERE ([l].[IsDeleted] = CAST(0 AS bit)) AND [l].[Id] IN (1, 2, 3)"
};
try
{
using SqlDataReader reader = cmd.ExecuteReader();
Console.WriteLine($"Current thread {Thread.CurrentThread.ManagedThreadId} Completed in {stopWatch.ElapsedMilliseconds}ms");
}
catch (SqlException ex)
{
Console.WriteLine(ex.Message);
}
}
private static void TableMaintainer(string connString)
{
using var conn = new SqlConnection(connString);
try
{
conn.Open();
using var cmd = new SqlCommand()
{
Connection = conn,
CommandText = @"
DROP TABLE IF EXISTS [Lists];
CREATE TABLE [Lists] (
[Id] INTEGER,
[IsDeleted] BIT
)"
};
cmd.ExecuteNonQuery();
}
catch (SqlException ex)
{
Console.WriteLine(ex.Message);
}
} |
Yes, it is a regression, I thought that much was clear from the very start of the discussion. Assuming it's from the mars rework is unproven. I agree that it's possible and maybe even likely but so far no evidence. It won't crash with anything other than duplicate key under a debugger for me and we're going to need a debugger to investigate.
What workaround for #422? the only workaround I knew of was to not starve async of threads. I managed to implement a custom scheduler and get it working so that all mars async io is done on dedicated threads. Still incredibly slow and the profiler shows the contention time allocated to the _parserLock which makes sense to me since we've got up to 100 threads contending for the same parser. |
Just to note that running my repro above against 3.0.0 (with managed networking) does fails - it just takes a bit of time (the program eventually blocks and then throws unpredictably). So I'm not sure there's a regression here in 4.0.0-preview.1. |
Interestingly if you manage to force mars async io off threadpool threads and onto dedicated threads the number of thredpool threads required (and created) drops significantly. The time taken to execute doesn't drop because the majority of it is taken up on the parser lock but in general the profile looks a lot cleaner. I couldn't work out how to do this before but working through some of the postgres issues you had with the context switcher let me learn what I needed. |
As part of running some compatibility checks between EF Core 6.0 and SqlClient 3.0.0 and 4.0.0-preview1.21237.2 (dotnet/efcore#25766), I came across what looks like a considerable perf regression. While the EF test suite completed in around 4m30s with SqlClient 3.0.0, it took around 12 minutes with 4.0.0-preview1.21237.2.
Some digging seems to point at test QueryBugsTest.Thread_safety_in_relational_command_cache; with 3.0.0 the total run completes in 20 seconds (including all startup), whereas with 4.0.0-preview1.21237.2 it takes around 5m15s. This test runs 100 parallel queries - no big data is being transferred, and purely sync I/O is used.
To run this, you can check out the EF Core code base, run build.cmd to get the latest dotnet core, go inside test/EFCore.SqlServer.FunctionalTests and run
dotnet test --filter Thread_safety_in_relational_command_cache
The text was updated successfully, but these errors were encountered: