Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.NET Core 2.1 new HttpClient based on .NET sockets and Span<T>. seems to have a problem #27611

Closed
dkhanaferov opened this issue Oct 11, 2018 · 15 comments
Labels
bug tenet-compatibility Incompatibility with previous versions or .NET Framework
Milestone

Comments

@dkhanaferov
Copy link

I have a strange issue I'm trying to triage having to do with the new HttpClient on .NET Core 2.1. Since the HttpClient has been completely rewritten to use a different low level library for handling HTTP requests. I'm wondering if anyone has seen any issues with the new implementation.

What I'm seeing is a strange case where my application (.NET Core 2.1) which sends a POST request to some API periodically (every 10 seconds) a few of times every 15 min it will throw an exception with the error: An error occurred while sending the request.; The server returned an invalid or unrecognized response.

No other details are available, it's just an exception when I make a call like this:

using (var res = await _httpClient.PostAsync(uriBuilder.Uri, new StringContent(serializedRequestBody, Encoding.UTF8, "application/json")))
            {
               //Do something here
            }

The exception caught is a System.Net.Http.HttpRequestException and it has some inner exception with the above error message.

So as I mentioned this does NOT happen all time, it happens seemingly at random, or at least I can not discern any particular pattern. All I can say is these POST requests are made once every 10 seconds 24/7 and anywhere between 5% and 10% of the POST requests fail with the above exception.

So used tcpdump and piped it into wireshark to examine the requests to see what's actually happening when the requests fail and what i see is the following:

On a good POST I see: my app sends the request to server, server sends response back, my app sends ACK to server and server responds with FIN,ACK. Done. Good Stuff.

On POST which gets the above exception I see the following: my app sends the request to server, and almost immediately after (like a few milliseconds after) my application sends FIN, ACK to server.

This seems consistent with what I see in my application logs, which show that the request duration is 0 before the exception is thrown.

So what it looks like to me is, my application sends the request and then immediately after closes the connection for some reason. However, I don't understand why this happens. I tried comparing the raw HTTP requests (good POST vs bad POST) to see any differences and I can not see any difference.

One last thing to mention, is that I ONLY see this in applications running on .NET Core 2.1. When I run my application on .NET 2.0 I do not see this problem. Also when I use the same library (where the HTTP call is being made) in the .NET 4.5.1 application (I use multi-targeting to compile the library targeting .net standard and net451) I also do NOT see this problem. So it seems to affect only .NET Core 2.1

@LuizAdolphs
Copy link

I don't know if is possible, but maybe you can investigate that part of framework and try to understand what is happening (assuming that the httpclient wasn't completely rewritten)

@stephentoub
Copy link
Member

@dkhanaferov, any chance you'd be able to simplify your case down to a runnable repro we could try? Could you share the wireshark trace of the failing request?

@dkhanaferov
Copy link
Author

image

In the screenshot after the very last POST there is a FIN, ACK and it shows Ack=7187, so I look back I see the previous FIN,ACK had sequence=7186. Now, I'm by far not an expect in TCP or networking so I maybe saying something completely dumb, but does that mean that the last FIN,ACK (which comes from my host to server) is essentially my host FIN,ACK'ing the previous FIN,ACK (from server to my host) and essentially closing the connection.

So since the next POST is made to the same host:port, using the same connection and yet the connection is closed (via that last FIN,ACK) that's why I never get a response back?

Since my host FIN,ACK'ing the previous FIN,ACK just happens to be by chance right after the next POST, the connection is closed as it's supposed to be because FIN,ACK means we're done. But the next POST is attempting to re-use the same connection and it just gets screwed there.

This would explain the apparent randomness of this issue.

@dkhanaferov
Copy link
Author

Some additional evidence:

I was able to fix the issue i.e. prevent the connection from being closed and exception being thrown by setting the following appcontext switch: AppContext.SetSwitch("System.Net.Http.UseSocketsHttpHandler", false); Which according to documentation will revert to using the older libcurl implementation of HttpClient on Linux (I'm only running on Linux so I haven't tested this on Windows)

I then ran the application for 48 hours and did not see a single error. While without setting that switch the error appear a few times every 15 min or so.

I will try to put together a small repo which could be used to reproduce this issue next, apologies for not having done this earlier but this is affecting a prod app so I'm trying to fix it as best as I can first.

@geoffkizer
Copy link
Contributor

There's an unavoidable timing issue with these sorts of scenarios. The server is going to close an idle connection sooner or later. If it closes the connection just as we are trying to reuse it for a subsequent request, then things get complicated. We will retry the request on another connection if we are sure it is safe to do so. But if we've sent part of the request body then we won't resend then request, because we can't be sure it's safe to do so.

I'm guessing that's what is happening here.

I'm not sure why you see different behavior with the libcurl implementation. It could be that libcurl just has a different connection reuse policy and so is not hitting this issue. Or it could be that libcurl is more aggressive about retrying requests. It would be interesting to see a trace of libcurl and compare how it is handling this situation.

@karelz
Copy link
Member

karelz commented Nov 2, 2018

@dkhanaferov any luck on smaller repro we could verify locally?

@wfurt
Copy link
Member

wfurt commented Feb 22, 2019

Should we consider using TCP keep-alive @geoffkizer?
I know it wont completely fix the race condition but we added it to curl handler to help with getting through firewalls and load-ballancers.

Aside from that this does not seems actionable and perhaps should be closed?

@geoffkizer
Copy link
Contributor

I don't see how TCP keep-alive would help here. The server is closing the connection because it's decided the connection is idle.

@geoffkizer
Copy link
Contributor

I'll echo what I said above:

It would be interesting to see a trace of libcurl and compare how it is handling this situation.

This seems like the best way to investigate further.

@wfurt
Copy link
Member

wfurt commented Feb 22, 2019

In some cases I saw, it was not server it self but networking gear along the way. From client tcpdump, you cannot tell the difference.

@geoffkizer
Copy link
Contributor

I honestly don't have a good sense for whether TCP keep alive would help much. I don't know how common it is for a router to drop a connection before the server does. My guess is it's not very common, because servers usually have pretty small connection timeouts (e.g. the default for Apache is 5 seconds).

@geoffkizer
Copy link
Contributor

That said, I do think it's worthwhile to understand why libcurl doesn't seem to hit the same issue.

@wfurt
Copy link
Member

wfurt commented Feb 22, 2019

can you isolate runable repro @dkhanaferov? I can investigate more.

@davidsh
Copy link
Contributor

davidsh commented Apr 28, 2019

@dkhanaferov

We've added more diagnostics and better error messages as of PR dotnet/corefx#37251. We very rarely now have a simple error message of "The server returned an invalid or unrecognized response." The errors will be more descriptive about what part of the server response might be "invalid or unrecognized".

Please try out the latest daily build or wait until .NET Core 3.0 Preview 5. If you are able to repro the problem, please open a new issue. Thanks.

@davidsh davidsh closed this as completed Apr 28, 2019
@stephentoub
Copy link
Member

wait until .NET Core 3.0 Preview 5

Just FYI, the error message changes from dotnet/corefx#37251 will be in Preview 6.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug tenet-compatibility Incompatibility with previous versions or .NET Framework
Projects
None yet
Development

No branches or pull requests

8 participants