After SqlException application state is corrupted #97

epignosisx · 2018-10-26T15:57:03Z

Environment:
OS - Windows Server 2012 R2
.NET Core: 2.0.5
SQL Server 2016

Exception:

System.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: SSL Provider, error: 0 - The wait operation timed out.) ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
   at bool System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, uint waitForMultipleObjectsTimeout, bool allowCreate, bool onlyOneCheckConnection, DbConnectionOptions userOptions, out DbConnectionInternal connection)
   at bool System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource<DbConnectionInternal> retry, DbConnectionOptions userOptions, out DbConnectionInternal connection)
   at bool System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource<DbConnectionInternal> retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, out DbConnectionInternal connection)
   at bool System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource<DbConnectionInternal> retry, DbConnectionOptions userOptions)
   at bool System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource<DbConnectionInternal> retry)
   at Task System.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken)
   at async Task<byte[]> Microsoft.Extensions.Caching.SqlServer.DatabaseOperations.GetCacheItemAsync(string key, bool includeValue, CancellationToken token)
   at async Task Microsoft.Extensions.Caching.SqlServer.DatabaseOperations.RefreshCacheItemAsync(string key, CancellationToken token)
   at async Task Microsoft.Extensions.Caching.SqlServer.SqlServerCache.RefreshAsync(string key, CancellationToken token)
   at async Task Microsoft.AspNetCore.Session.DistributedSession.CommitAsync(CancellationToken cancellationToken)
   at async Task Microsoft.AspNetCore.Session.SessionMiddleware.Invoke(HttpContext context)
ClientConnectionId:f8356379-665e-484b-9093-e8e99ae10a72
Error Number:258,State:0,Class:20

This issue has come up in production in two separate occasions this week and once the ASP.NET Core app enters into this state only restarting the app resolves the problem. A few points I want to share:

1- It seems like something in the Connection Pool is getting corrupted once this happens. Since a restart fixes it.

2- The error seems related to some SSL issue based on the exception details (Provider SSL). However, our connection strings do not specify the Encrypt attribute.

3- There are a few issues reported with the same exception here, but in Linux/Mac environments. In this case this is an all Windows environment.

The text was updated successfully, but these errors were encountered:

jhudsoncedaron · 2018-10-26T16:12:50Z

Connections that come out of the connection pool are bad and have a trashed transaction isolation. I raised an issue on this almost ten years ago and eventually found out it's considered intended behavior.

The solution for both is the same darn thing:

   for (;;) {
       connection = new Connection(...);
       try {
            using (cmd = connection.CreateCommand()) {
               cmd.CommandText = "SET TRANSACTION ISOLATION LEVEL READ COMMITTED;"; // Default, or use whatever you think your connections should be at
               cmd.ExecuteNonQuery();
             }
             break;
       } catch (DbDataConnection) { /* bad connection from pool */ }
       connection.Dispose();
    }

The bad connections don't get returned to the pool a second time, so the for loop drains all the bad ones out and you're ready to go again.

Wraith2 · 2018-10-26T20:18:49Z

I raised an issue on this almost ten years ago and eventually found out it's considered intended behavior.

It's odd behaviour. Do you remember the details of why it was intended?

jhudsoncedaron · 2018-10-26T23:16:49Z

Back in the .NET Framework 1.0/2.0 days SqlConnection was a wrapper around a native library that did all the work including the pooling and had the "Inherit transaction isolation level behavior" they dared not fix in the native code for backwards compatibility. (I think a native caller could tell if it came out of the pool or not.) I filed the bug on MS Connect against .NET Framework 2.0/3.5 and was given this explanation.

The other half about bad connections connections that were good when Close() was called might not be when you get them back out, which is a problem inherent with pooling unless active code is written to verify the state. Just about every pool API from that era had the same problem. This lead to "Object Cesspool" appearing as an anti-pattern for awhile before some other people obliterated it off the lists. Once you know about it (which the documentation doesn't tell you about) it's easy to fix.

When #12072 was left unfixed too long I went through a bunch of reflection work to implement my own pool. My code depends on the exact implementation of SqlConnection now. I don't have to care anymore.

epignosisx · 2018-10-29T22:23:47Z

@saurabh500, we got a memory dump we can share, hopefully you can get something out of it.

The only change in the application is that it started to get more load than it used to.

The issue is happening weekly, so if you need anything to help troubleshoot we could probably provide it.

saurabh500 · 2018-10-29T22:25:33Z

@afsanehr @keeratsingh to help on this issue.

saurabh500 · 2018-10-29T22:48:12Z

@epignosisx It would be good to provide the version of SqlClient that you are using as well, since SqlClient is a library that doesn't ship as part of .Net Core SDK.

1- It seems like something in the Connection Pool is getting corrupted once this happens. Since a restart fixes it.

I am curious, why do you think that the connection pool is corrupted? Is it because the subsequent connections after this failure, start failing?

2- The error seems related to some SSL issue based on the exception details (Provider SSL). However, our connection strings do not specify the Encrypt attribute.

The Login in Sql Server happens over SSL/TLS. What is happening in this case is, a new connection from the pool is being retrieved, i.e. a new TCP connection is being created. However after TCP connection establishment, we have a pre-login phase and login phase. At the end of Pre-Login phase, the SSL handshake is established with the server, which doesn't seem to have gone through in time in the above error message. Are you seeing, that after one connection failure, all subsequent connections are failing quickly?

epignosisx · 2018-10-30T03:36:32Z

The System.Data.SqlClient version is 4.4.2.

I am curious, why do you think that the connection pool is corrupted? Is it because the subsequent connections after this failure, start failing?

Once this error occurs, subsequent requests start failing consistently, it goes away once the app pool is recycled. This behavior, in addition to the stacktrace is what makes me think it might be something getting corrupted with the connection pool. There are other 3 instances of the app running in other physical servers and did not experience the issue. This helps rule out suspicions that SQL Server was acting up.

The SQL operations that are going through this server are very simple. In fact, it is not our code, it's part of the Microsoft.AspNetCore.Caching.SqlServer package.

SQL queries:
https://github.com/aspnet/Caching/blob/master/src/Microsoft.Extensions.Caching.SqlServer/SqlQueries.cs

I did find that when making the queries, the code was not disposing of the SqlCommand and the SqlDataReader, but the SqlConnection is, not sure if this could be related or if this is not even needed. Here are a few samples:

https://github.com/aspnet/Caching/blob/master/src/Microsoft.Extensions.Caching.SqlServer/DatabaseOperations.cs#L281

https://github.com/aspnet/Caching/blob/master/src/Microsoft.Extensions.Caching.SqlServer/DatabaseOperations.cs#L288

The Login in Sql Server happens over SSL/TLS. What is happening in this case is, a new connection from the pool is being retrieved, i.e. a new TCP connection is being created. However after TCP connection establishment, we have a pre-login phase and login phase. At the end of Pre-Login phase, the SSL handshake is established with the server, which doesn't seem to have gone through in time in the above error message. Are you seeing, that after one connection failure, all subsequent connections are failing quickly?

Understood. Thanks for the explanation. Yes, subsequent connections are failing quickly. HTTP requests are taking 15 ms to complete (IIS logs).

saurabh500 · 2018-10-31T18:15:59Z

There are couple of nuances of connection pool, that should be mentioned here.

If an error occurs while getting a connection from the pool, then subsequent attempts to establish new connections to SQL Server fail for a period of time. More information here
https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/sql-server-connection-pooling
About the timeouts in the SqlClient connection pool. The requests for the connections in the connection pool, are queued serially. The timeout is computed from when the connection is requested to when it is established. Assume that there are 20 connection requests in the pool all requested at the same time i.e. t=0, and the connection timeout is 15 seconds per connection. Lets say that the first 3 connections took 5 seconds each to be established. Then in that case, the 4th connection gets its turn at t=15, at which point, the timeout has already been reached. Hence the connection establishment will timeout.

It is possible that, you could be hitting a combination of the above two issues. Do you have a Minimum Pool Size configured for your connection pool. This makes the connection pool, make sure that Min Pool Size number of connections are maintained in the pool.

epignosisx · 2018-10-31T19:58:21Z

The connection string does not specify a Minimum Pool Size. It looks like this:

SERVER=XXXX;DATABASE=XXXX;UID=XXXX;PWD=XXXX;PACKET SIZE=4096

Our issue did not last for a few minutes, it lasted hours, the timeout logic would have had enough time to recover:

saurabh500 · 2018-10-31T20:20:28Z

Thanks @epignosisx for providing the information. Can you change your config to use the Min Pool Size so that you have ready TCP connections? Do you know how many concurrent connections are open to Sql Server when this issue happens? I am trying to rule out the possibility of Connection Pool max size being reached (viz 100 by default) and not getting a connection in a timely manner.
If you are not reaching the Max Pool Size of Sql Connectoin, can you email me (email in Github bio), about pointers to the memory dump? Even if Min Pool Size mitigates your scenario, and max pool size is not reached, we need to solve this problem.

jhudsoncedaron · 2018-10-31T20:39:37Z

At one time the behavior on reaching the pool size was it throwing InvalidOperationException. The last time I tested this was in .NET Framework 2.0 though. We thought we were hitting the pool size legitimately. Much later we discovered we had a bug that leaked connections.

epignosisx · 2018-10-31T20:51:51Z

@saurabh500 I'm trying to get the concurrent connections to SQL Server when it happened. In the meantime, I can share the memory dump, but I checked your Github profile, but it is not showing an email address.

saurabh500 · 2018-10-31T20:57:45Z

@epignosisx Something changed and my email was private . I have made the email available on the profile

epignosisx · 2018-11-01T21:51:28Z

@saurabh500 I emailed a link to the memory dump to you early today.

saurabh500 · 2018-11-01T21:58:31Z

Ack. I did reply back sometime ago. I will take a look into the dump.

epignosisx · 2018-12-04T12:53:28Z

Hi @saurabh500, the issue keeps happening at least weekly. Have been able to take a look at the memory dumps I have share with you via email?

epignosisx · 2019-01-27T01:33:17Z

I took another stab at the memory dumps with my limited Windbg knowledge and I found something curious:

0:007> !DumpHeap -type System.Data.SqlClient.SqlConnection  -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff8fefa8938        1           32 System.Data.SqlClient.SqlConnection+<>c__DisplayClass107_0`1[[System.Data.SqlClient.SqlDataReader, System.Data.SqlClient]]
00007ff8fe6f2000        1           48 System.Data.SqlClient.SqlConnectionFactory
00007ff8fefa5058        1           80 System.Collections.Generic.Dictionary`2[[System.String, System.Private.CoreLib],[System.Data.SqlClient.SqlConnectionStringBuilder+Keywords, System.Data.SqlClient]]
00007ff8fe6f8dd0        2           96 System.Data.SqlClient.SqlConnectionPoolGroupProviderInfo
00007ff8fe6d4540        2          336 System.Data.SqlClient.SqlConnectionString
00007ff8fe764bb0       12          576 System.Data.SqlClient.SqlConnectionTimeoutErrorInternal
00007ff8fe764f18       47         1128 System.Data.SqlClient.SqlConnectionTimeoutPhaseDuration
00007ff8fefa6280        1         1152 System.Collections.Generic.Dictionary`2+Entry[[System.String, System.Private.CoreLib],[System.Data.SqlClient.SqlConnectionStringBuilder+Keywords, System.Data.SqlClient]][]
00007ff8fe764fa8       12         1152 System.Data.SqlClient.SqlConnectionTimeoutPhaseDuration[]
00007ff8fef8fee0       49         3136 System.Data.SqlClient.SqlConnection+OpenAsyncRetry
00007ff8fe6f0e38      155         4960 System.Data.SqlClient.SqlConnectionPoolKey
00007ff8fef8fdb0      154         6160 System.Data.SqlClient.SqlConnection+<>c__DisplayClass85_0
00007ff8fe6d06e0     2758       507472 System.Data.SqlClient.SqlConnection   //this number looks suspicious

This is the breakdown of the SqlConnection's innerConnection:

System.Data.ProviderBase.DbConnectionClosedPreviouslyOpened: 2629
System.Data.ProviderBase.DbConnectionClosedNeverOpened: 128
System.Data.SqlClient.SqlInternalConnectionTds: 1

@saurabh500 @Wraith2 Do these numbers say anything to you?

Code-DJ · 2019-03-26T16:18:54Z

We were getting the same exception for a console app.

Windows Server 2019
SQL Server 2016 SP1

Application: dotnet.exe
CoreCLR Version: 4.6.27110.4
Description: The process was terminated due to an unhandled exception.
Exception Info: System.Data.SqlClient.SqlException: A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: SSL Provider, error: 0 - The wait operation timed out.) ---> System.ComponentModel.Win32Exception: The wait operation timed out

Tried various things including:

Upgrading to SQL Server 2016 SP2 with latest CU.
Creating a tool that runs the code provided by @jhudsoncedaron above and running it right before running the tool that was failing. It seemed to work for a couple of days before failing again.

In the end what worked (at least for the last 4 days) is to change the connection string to use IP Address instead of the server name. If this continues to hold, it would point to a DNS issue? Not sure how IP vs ServerName affects "pre-login handshake".

jhudsoncedaron · 2019-03-26T16:25:21Z

@Code-DJ : You know connection pools are per-process right? The trick is to replace you code to open a connection with that loop.

Code-DJ · 2019-03-26T16:31:31Z

@jhudsoncedaron didn't know that. That makes sense on why it didn't work. If the tool fails again, will move your logic into the actual tool and try it. Thanks!

divega · 2019-05-16T01:19:57Z

As recently announced in the .NET Blog, focus on new SqlClient features an improvements is moving to the new Microsoft.Data.SqlClient package. For this reason, we are moving this issue to the new repo at https://github.com/dotnet/SqlClient. We will still use https://github.com/dotnet/corefx to track issues on other providers like System.Data.Odbc and System.Data.OleDB, and general ADO.NET and .NET data access issues.

jzabroski · 2022-07-19T15:00:15Z

@Code-DJ : You know connection pools are per-process right? The trick is to replace you code to open a connection with that loop.
-- @jhudsoncedaron

Incorrect. Connection pools are per connection string. e.g., changing the Application Name in the pool will cause a second pool to be created.

Connections that come out of the connection pool are bad and have a trashed transaction isolation. I raised an issue on this almost ten years ago and eventually found out it's considered intended behavior.
-- @jhudsoncedaron

What you're describing isn't a client bug, but a bug in sp_reset_connection (which the client calls when it fetches a connection from the pool - and the client doesn't need to be the SQL Driver, since in Java-land, the JDBC driver for SQL Server does not call sp_reset_connection, but instead delegates that to the client caller, which makes sense, as calling sp_reset_connection many times can increase LOCK_HASH spinlock count to the point of creating resource contention issues on a busy SQL Server.). The transaction isolation level issue with sp_reset_connection was fixed/changed in SQL Server 2014.

The workaround you're describing could be made better by checking the @@VERSION value, caching it and checking it to see if you're on a version that doesn't require this chicanery.

#97 (comment)
-- @saurabh500

@saurabh500 , are you saying that connections that wait too long to go through the queue can in turn re-trigger scenario 1 (If an error occurs while getting a connection from the pool, then subsequent attempts to establish new connections to SQL Server fail for a period of time), resulting in a possible meta-stable failure state when connections consistently take 5 seconds to handshake?

jhudsoncedaron · 2022-07-19T15:06:44Z

"The transaction isolation level issue with sp_reset_connection was fixed/changed in SQL Server 2014."

It wasn't. I reproduced it on SQL 2014 back when I used to run that.

The workaround needs to remain even if that particular issue is fixed because it's still fixing the stale connection in pool problem.

jzabroski · 2022-07-19T15:26:37Z

"The transaction isolation level issue with sp_reset_connection was fixed/changed in SQL Server 2014."

It wasn't. I reproduced it on SQL 2014 back when I used to run that.

It appears there are two types of isolation level issues, and the changes to sp_reset_connection only deal with when xact enforcement isn't triggered by creating a SqlTransaction.

#96 and #1098 (duplicate) independently verify your statement on SQL Server 2017 or later.

So, both statements are true - sp_reset_connection did improve in SQL Server 2014 to improve transaction isolation levels, but it did not address inner transaction blocks committing or rolling back with a different isolation level.

cheenamalhotra · 2024-07-08T18:55:17Z

Closing issue as stale.
Please open a new issue with repro details if a similar problem occurs with latest version of MDS.

David-Engel assigned saurabh500 and AfsanehR-zz Mar 1, 2019

divega transferred this issue from dotnet/corefx May 16, 2019

divega added this to the 1.0.0 milestone May 16, 2019

divega added the 🐛 Bug! Issues that are bugs in the drivers we maintain. label May 16, 2019

divega modified the milestone: 1.0.0 May 17, 2019

cheenamalhotra unassigned saurabh500 Jul 25, 2019

cheenamalhotra unassigned AfsanehR-zz Jul 25, 2019

cheenamalhotra modified the milestones: 1.0.0, 1.1.0 Jul 25, 2019

cheenamalhotra modified the milestones: 1.1.0, 1.2.0 Nov 20, 2019

cheenamalhotra removed this from the 2.0.0 milestone Jun 18, 2020

cheenamalhotra closed this as not planned Won't fix, can't repro, duplicate, stale Jul 8, 2024

github-project-automation bot added this to SqlClient Triage Board Aug 28, 2024

github-project-automation bot moved this to Closed in SqlClient Triage Board Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After SqlException application state is corrupted #97

After SqlException application state is corrupted #97

epignosisx commented Oct 26, 2018

jhudsoncedaron commented Oct 26, 2018 •

edited by karelz

Loading

Wraith2 commented Oct 26, 2018

jhudsoncedaron commented Oct 26, 2018

epignosisx commented Oct 29, 2018

saurabh500 commented Oct 29, 2018

saurabh500 commented Oct 29, 2018

epignosisx commented Oct 30, 2018 •

edited

Loading

saurabh500 commented Oct 31, 2018

epignosisx commented Oct 31, 2018 •

edited

Loading

saurabh500 commented Oct 31, 2018

jhudsoncedaron commented Oct 31, 2018

epignosisx commented Oct 31, 2018

saurabh500 commented Oct 31, 2018

epignosisx commented Nov 1, 2018 •

edited

Loading

saurabh500 commented Nov 1, 2018

epignosisx commented Dec 4, 2018

epignosisx commented Jan 27, 2019

Code-DJ commented Mar 26, 2019 •

edited

Loading

jhudsoncedaron commented Mar 26, 2019

Code-DJ commented Mar 26, 2019 •

edited

Loading

divega commented May 16, 2019

jzabroski commented Jul 19, 2022 •

edited

Loading

jhudsoncedaron commented Jul 19, 2022 •

edited

Loading

jzabroski commented Jul 19, 2022 •

edited

Loading

cheenamalhotra commented Jul 8, 2024

After SqlException application state is corrupted #97

After SqlException application state is corrupted #97

Comments

epignosisx commented Oct 26, 2018

jhudsoncedaron commented Oct 26, 2018 • edited by karelz Loading

Wraith2 commented Oct 26, 2018

jhudsoncedaron commented Oct 26, 2018

epignosisx commented Oct 29, 2018

saurabh500 commented Oct 29, 2018

saurabh500 commented Oct 29, 2018

epignosisx commented Oct 30, 2018 • edited Loading

saurabh500 commented Oct 31, 2018

epignosisx commented Oct 31, 2018 • edited Loading

saurabh500 commented Oct 31, 2018

jhudsoncedaron commented Oct 31, 2018

epignosisx commented Oct 31, 2018

saurabh500 commented Oct 31, 2018

epignosisx commented Nov 1, 2018 • edited Loading

saurabh500 commented Nov 1, 2018

epignosisx commented Dec 4, 2018

epignosisx commented Jan 27, 2019

Code-DJ commented Mar 26, 2019 • edited Loading

jhudsoncedaron commented Mar 26, 2019

Code-DJ commented Mar 26, 2019 • edited Loading

divega commented May 16, 2019

jzabroski commented Jul 19, 2022 • edited Loading

jhudsoncedaron commented Jul 19, 2022 • edited Loading

jzabroski commented Jul 19, 2022 • edited Loading

cheenamalhotra commented Jul 8, 2024

jhudsoncedaron commented Oct 26, 2018 •

edited by karelz

Loading

epignosisx commented Oct 30, 2018 •

edited

Loading

epignosisx commented Oct 31, 2018 •

edited

Loading

epignosisx commented Nov 1, 2018 •

edited

Loading

Code-DJ commented Mar 26, 2019 •

edited

Loading

Code-DJ commented Mar 26, 2019 •

edited

Loading

jzabroski commented Jul 19, 2022 •

edited

Loading

jhudsoncedaron commented Jul 19, 2022 •

edited

Loading

jzabroski commented Jul 19, 2022 •

edited

Loading