Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After SqlException application state is corrupted #97

Closed
epignosisx opened this issue Oct 26, 2018 · 25 comments
Closed

After SqlException application state is corrupted #97

epignosisx opened this issue Oct 26, 2018 · 25 comments
Labels
🐛 Bug! Issues that are bugs in the drivers we maintain.

Comments

@epignosisx
Copy link

Environment:
OS - Windows Server 2012 R2
.NET Core: 2.0.5
SQL Server 2016

Exception:

System.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: SSL Provider, error: 0 - The wait operation timed out.) ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
   at bool System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, uint waitForMultipleObjectsTimeout, bool allowCreate, bool onlyOneCheckConnection, DbConnectionOptions userOptions, out DbConnectionInternal connection)
   at bool System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource<DbConnectionInternal> retry, DbConnectionOptions userOptions, out DbConnectionInternal connection)
   at bool System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource<DbConnectionInternal> retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, out DbConnectionInternal connection)
   at bool System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource<DbConnectionInternal> retry, DbConnectionOptions userOptions)
   at bool System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource<DbConnectionInternal> retry)
   at Task System.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken)
   at async Task<byte[]> Microsoft.Extensions.Caching.SqlServer.DatabaseOperations.GetCacheItemAsync(string key, bool includeValue, CancellationToken token)
   at async Task Microsoft.Extensions.Caching.SqlServer.DatabaseOperations.RefreshCacheItemAsync(string key, CancellationToken token)
   at async Task Microsoft.Extensions.Caching.SqlServer.SqlServerCache.RefreshAsync(string key, CancellationToken token)
   at async Task Microsoft.AspNetCore.Session.DistributedSession.CommitAsync(CancellationToken cancellationToken)
   at async Task Microsoft.AspNetCore.Session.SessionMiddleware.Invoke(HttpContext context)
ClientConnectionId:f8356379-665e-484b-9093-e8e99ae10a72
Error Number:258,State:0,Class:20

This issue has come up in production in two separate occasions this week and once the ASP.NET Core app enters into this state only restarting the app resolves the problem. A few points I want to share:

1- It seems like something in the Connection Pool is getting corrupted once this happens. Since a restart fixes it.

2- The error seems related to some SSL issue based on the exception details (Provider SSL). However, our connection strings do not specify the Encrypt attribute.

3- There are a few issues reported with the same exception here, but in Linux/Mac environments. In this case this is an all Windows environment.

@jhudsoncedaron
Copy link

jhudsoncedaron commented Oct 26, 2018

Connections that come out of the connection pool are bad and have a trashed transaction isolation. I raised an issue on this almost ten years ago and eventually found out it's considered intended behavior.

The solution for both is the same darn thing:

   for (;;) {
       connection = new Connection(...);
       try {
            using (cmd = connection.CreateCommand()) {
               cmd.CommandText = "SET TRANSACTION ISOLATION LEVEL READ COMMITTED;"; // Default, or use whatever you think your connections should be at
               cmd.ExecuteNonQuery();
             }
             break;
       } catch (DbDataConnection) { /* bad connection from pool */ }
       connection.Dispose();
    }

The bad connections don't get returned to the pool a second time, so the for loop drains all the bad ones out and you're ready to go again.

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 26, 2018

I raised an issue on this almost ten years ago and eventually found out it's considered intended behavior.

It's odd behaviour. Do you remember the details of why it was intended?

@jhudsoncedaron
Copy link

Back in the .NET Framework 1.0/2.0 days SqlConnection was a wrapper around a native library that did all the work including the pooling and had the "Inherit transaction isolation level behavior" they dared not fix in the native code for backwards compatibility. (I think a native caller could tell if it came out of the pool or not.) I filed the bug on MS Connect against .NET Framework 2.0/3.5 and was given this explanation.

The other half about bad connections connections that were good when Close() was called might not be when you get them back out, which is a problem inherent with pooling unless active code is written to verify the state. Just about every pool API from that era had the same problem. This lead to "Object Cesspool" appearing as an anti-pattern for awhile before some other people obliterated it off the lists. Once you know about it (which the documentation doesn't tell you about) it's easy to fix.

When #12072 was left unfixed too long I went through a bunch of reflection work to implement my own pool. My code depends on the exact implementation of SqlConnection now. I don't have to care anymore.

@epignosisx
Copy link
Author

@saurabh500, we got a memory dump we can share, hopefully you can get something out of it.

The only change in the application is that it started to get more load than it used to.

The issue is happening weekly, so if you need anything to help troubleshoot we could probably provide it.

@saurabh500
Copy link
Contributor

@afsanehr @keeratsingh to help on this issue.

@saurabh500
Copy link
Contributor

@epignosisx It would be good to provide the version of SqlClient that you are using as well, since SqlClient is a library that doesn't ship as part of .Net Core SDK.

1- It seems like something in the Connection Pool is getting corrupted once this happens. Since a restart fixes it.

I am curious, why do you think that the connection pool is corrupted? Is it because the subsequent connections after this failure, start failing?

2- The error seems related to some SSL issue based on the exception details (Provider SSL). However, our connection strings do not specify the Encrypt attribute.

The Login in Sql Server happens over SSL/TLS. What is happening in this case is, a new connection from the pool is being retrieved, i.e. a new TCP connection is being created. However after TCP connection establishment, we have a pre-login phase and login phase. At the end of Pre-Login phase, the SSL handshake is established with the server, which doesn't seem to have gone through in time in the above error message. Are you seeing, that after one connection failure, all subsequent connections are failing quickly?

@epignosisx
Copy link
Author

epignosisx commented Oct 30, 2018

The System.Data.SqlClient version is 4.4.2.

I am curious, why do you think that the connection pool is corrupted? Is it because the subsequent connections after this failure, start failing?

Once this error occurs, subsequent requests start failing consistently, it goes away once the app pool is recycled. This behavior, in addition to the stacktrace is what makes me think it might be something getting corrupted with the connection pool. There are other 3 instances of the app running in other physical servers and did not experience the issue. This helps rule out suspicions that SQL Server was acting up.

The SQL operations that are going through this server are very simple. In fact, it is not our code, it's part of the Microsoft.AspNetCore.Caching.SqlServer package.

SQL queries:
https://github.com/aspnet/Caching/blob/master/src/Microsoft.Extensions.Caching.SqlServer/SqlQueries.cs

I did find that when making the queries, the code was not disposing of the SqlCommand and the SqlDataReader, but the SqlConnection is, not sure if this could be related or if this is not even needed. Here are a few samples:

https://github.com/aspnet/Caching/blob/master/src/Microsoft.Extensions.Caching.SqlServer/DatabaseOperations.cs#L281

https://github.com/aspnet/Caching/blob/master/src/Microsoft.Extensions.Caching.SqlServer/DatabaseOperations.cs#L288

The Login in Sql Server happens over SSL/TLS. What is happening in this case is, a new connection from the pool is being retrieved, i.e. a new TCP connection is being created. However after TCP connection establishment, we have a pre-login phase and login phase. At the end of Pre-Login phase, the SSL handshake is established with the server, which doesn't seem to have gone through in time in the above error message. Are you seeing, that after one connection failure, all subsequent connections are failing quickly?

Understood. Thanks for the explanation. Yes, subsequent connections are failing quickly. HTTP requests are taking 15 ms to complete (IIS logs).

@saurabh500
Copy link
Contributor

There are couple of nuances of connection pool, that should be mentioned here.

  1. If an error occurs while getting a connection from the pool, then subsequent attempts to establish new connections to SQL Server fail for a period of time. More information here
    https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/sql-server-connection-pooling
    image

  2. About the timeouts in the SqlClient connection pool. The requests for the connections in the connection pool, are queued serially. The timeout is computed from when the connection is requested to when it is established. Assume that there are 20 connection requests in the pool all requested at the same time i.e. t=0, and the connection timeout is 15 seconds per connection. Lets say that the first 3 connections took 5 seconds each to be established. Then in that case, the 4th connection gets its turn at t=15, at which point, the timeout has already been reached. Hence the connection establishment will timeout.

It is possible that, you could be hitting a combination of the above two issues. Do you have a Minimum Pool Size configured for your connection pool. This makes the connection pool, make sure that Min Pool Size number of connections are maintained in the pool.

@epignosisx
Copy link
Author

epignosisx commented Oct 31, 2018

The connection string does not specify a Minimum Pool Size. It looks like this:

SERVER=XXXX;DATABASE=XXXX;UID=XXXX;PWD=XXXX;PACKET SIZE=4096

Our issue did not last for a few minutes, it lasted hours, the timeout logic would have had enough time to recover:

image

@saurabh500
Copy link
Contributor

Thanks @epignosisx for providing the information. Can you change your config to use the Min Pool Size so that you have ready TCP connections? Do you know how many concurrent connections are open to Sql Server when this issue happens? I am trying to rule out the possibility of Connection Pool max size being reached (viz 100 by default) and not getting a connection in a timely manner.
If you are not reaching the Max Pool Size of Sql Connectoin, can you email me (email in Github bio), about pointers to the memory dump? Even if Min Pool Size mitigates your scenario, and max pool size is not reached, we need to solve this problem.

@jhudsoncedaron
Copy link

At one time the behavior on reaching the pool size was it throwing InvalidOperationException. The last time I tested this was in .NET Framework 2.0 though. We thought we were hitting the pool size legitimately. Much later we discovered we had a bug that leaked connections.

@epignosisx
Copy link
Author

@saurabh500 I'm trying to get the concurrent connections to SQL Server when it happened. In the meantime, I can share the memory dump, but I checked your Github profile, but it is not showing an email address.

@saurabh500
Copy link
Contributor

@epignosisx Something changed and my email was private . I have made the email available on the profile

@epignosisx
Copy link
Author

epignosisx commented Nov 1, 2018

@saurabh500 I emailed a link to the memory dump to you early today.

@saurabh500
Copy link
Contributor

Ack. I did reply back sometime ago. I will take a look into the dump.

@epignosisx
Copy link
Author

Hi @saurabh500, the issue keeps happening at least weekly. Have been able to take a look at the memory dumps I have share with you via email?

@epignosisx
Copy link
Author

I took another stab at the memory dumps with my limited Windbg knowledge and I found something curious:

0:007> !DumpHeap -type System.Data.SqlClient.SqlConnection  -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff8fefa8938        1           32 System.Data.SqlClient.SqlConnection+<>c__DisplayClass107_0`1[[System.Data.SqlClient.SqlDataReader, System.Data.SqlClient]]
00007ff8fe6f2000        1           48 System.Data.SqlClient.SqlConnectionFactory
00007ff8fefa5058        1           80 System.Collections.Generic.Dictionary`2[[System.String, System.Private.CoreLib],[System.Data.SqlClient.SqlConnectionStringBuilder+Keywords, System.Data.SqlClient]]
00007ff8fe6f8dd0        2           96 System.Data.SqlClient.SqlConnectionPoolGroupProviderInfo
00007ff8fe6d4540        2          336 System.Data.SqlClient.SqlConnectionString
00007ff8fe764bb0       12          576 System.Data.SqlClient.SqlConnectionTimeoutErrorInternal
00007ff8fe764f18       47         1128 System.Data.SqlClient.SqlConnectionTimeoutPhaseDuration
00007ff8fefa6280        1         1152 System.Collections.Generic.Dictionary`2+Entry[[System.String, System.Private.CoreLib],[System.Data.SqlClient.SqlConnectionStringBuilder+Keywords, System.Data.SqlClient]][]
00007ff8fe764fa8       12         1152 System.Data.SqlClient.SqlConnectionTimeoutPhaseDuration[]
00007ff8fef8fee0       49         3136 System.Data.SqlClient.SqlConnection+OpenAsyncRetry
00007ff8fe6f0e38      155         4960 System.Data.SqlClient.SqlConnectionPoolKey
00007ff8fef8fdb0      154         6160 System.Data.SqlClient.SqlConnection+<>c__DisplayClass85_0
00007ff8fe6d06e0     2758       507472 System.Data.SqlClient.SqlConnection   //this number looks suspicious

This is the breakdown of the SqlConnection's innerConnection:

System.Data.ProviderBase.DbConnectionClosedPreviouslyOpened: 2629
System.Data.ProviderBase.DbConnectionClosedNeverOpened: 128
System.Data.SqlClient.SqlInternalConnectionTds: 1

@saurabh500 @Wraith2 Do these numbers say anything to you?

@Code-DJ
Copy link

Code-DJ commented Mar 26, 2019

We were getting the same exception for a console app.

Windows Server 2019
SQL Server 2016 SP1
Application: dotnet.exe
CoreCLR Version: 4.6.27110.4
Description: The process was terminated due to an unhandled exception.
Exception Info: System.Data.SqlClient.SqlException: A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: SSL Provider, error: 0 - The wait operation timed out.) ---> System.ComponentModel.Win32Exception: The wait operation timed out

Tried various things including:

  1. Upgrading to SQL Server 2016 SP2 with latest CU.
  2. Creating a tool that runs the code provided by @jhudsoncedaron above and running it right before running the tool that was failing. It seemed to work for a couple of days before failing again.

In the end what worked (at least for the last 4 days) is to change the connection string to use IP Address instead of the server name. If this continues to hold, it would point to a DNS issue? Not sure how IP vs ServerName affects "pre-login handshake".

@jhudsoncedaron
Copy link

@Code-DJ : You know connection pools are per-process right? The trick is to replace you code to open a connection with that loop.

@Code-DJ
Copy link

Code-DJ commented Mar 26, 2019

@jhudsoncedaron didn't know that. That makes sense on why it didn't work. If the tool fails again, will move your logic into the actual tool and try it. Thanks!

@divega divega transferred this issue from dotnet/corefx May 16, 2019
@divega
Copy link

divega commented May 16, 2019

As recently announced in the .NET Blog, focus on new SqlClient features an improvements is moving to the new Microsoft.Data.SqlClient package. For this reason, we are moving this issue to the new repo at https://github.com/dotnet/SqlClient. We will still use https://github.com/dotnet/corefx to track issues on other providers like System.Data.Odbc and System.Data.OleDB, and general ADO.NET and .NET data access issues.

@divega divega added this to the 1.0.0 milestone May 16, 2019
@divega divega added the 🐛 Bug! Issues that are bugs in the drivers we maintain. label May 16, 2019
@divega divega modified the milestone: 1.0.0 May 17, 2019
@cheenamalhotra cheenamalhotra modified the milestones: 1.0.0, 1.1.0 Jul 25, 2019
@cheenamalhotra cheenamalhotra modified the milestones: 1.1.0, 1.2.0 Nov 20, 2019
@cheenamalhotra cheenamalhotra removed this from the 2.0.0 milestone Jun 18, 2020
@jzabroski
Copy link

jzabroski commented Jul 19, 2022

@Code-DJ : You know connection pools are per-process right? The trick is to replace you code to open a connection with that loop.
-- @jhudsoncedaron

Incorrect. Connection pools are per connection string. e.g., changing the Application Name in the pool will cause a second pool to be created.

Connections that come out of the connection pool are bad and have a trashed transaction isolation. I raised an issue on this almost ten years ago and eventually found out it's considered intended behavior.
-- @jhudsoncedaron

What you're describing isn't a client bug, but a bug in sp_reset_connection (which the client calls when it fetches a connection from the pool - and the client doesn't need to be the SQL Driver, since in Java-land, the JDBC driver for SQL Server does not call sp_reset_connection, but instead delegates that to the client caller, which makes sense, as calling sp_reset_connection many times can increase LOCK_HASH spinlock count to the point of creating resource contention issues on a busy SQL Server.). The transaction isolation level issue with sp_reset_connection was fixed/changed in SQL Server 2014.

The workaround you're describing could be made better by checking the @@VERSION value, caching it and checking it to see if you're on a version that doesn't require this chicanery.

#97 (comment)
-- @saurabh500

@saurabh500 , are you saying that connections that wait too long to go through the queue can in turn re-trigger scenario 1 (If an error occurs while getting a connection from the pool, then subsequent attempts to establish new connections to SQL Server fail for a period of time), resulting in a possible meta-stable failure state when connections consistently take 5 seconds to handshake?

@jhudsoncedaron
Copy link

jhudsoncedaron commented Jul 19, 2022

"The transaction isolation level issue with sp_reset_connection was fixed/changed in SQL Server 2014."

It wasn't. I reproduced it on SQL 2014 back when I used to run that.

The workaround needs to remain even if that particular issue is fixed because it's still fixing the stale connection in pool problem.

@jzabroski
Copy link

jzabroski commented Jul 19, 2022

"The transaction isolation level issue with sp_reset_connection was fixed/changed in SQL Server 2014."

It wasn't. I reproduced it on SQL 2014 back when I used to run that.

It appears there are two types of isolation level issues, and the changes to sp_reset_connection only deal with when xact enforcement isn't triggered by creating a SqlTransaction.

#96 and #1098 (duplicate) independently verify your statement on SQL Server 2017 or later.

So, both statements are true - sp_reset_connection did improve in SQL Server 2014 to improve transaction isolation levels, but it did not address inner transaction blocks committing or rolling back with a different isolation level.

@cheenamalhotra
Copy link
Member

Closing issue as stale.
Please open a new issue with repro details if a similar problem occurs with latest version of MDS.

@cheenamalhotra cheenamalhotra closed this as not planned Won't fix, can't repro, duplicate, stale Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 Bug! Issues that are bugs in the drivers we maintain.
Projects
Development

No branches or pull requests

9 participants