Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect the Keep-Alive response header on HTTP/1.1 as well #73585

Merged
merged 2 commits into from
Aug 12, 2022

Conversation

MihaZupan
Copy link
Member

Followup after #73020
Contributes to #72958

This changes #73020 to apply to all 1.x responses, not just 1.0.

It also brings back the 1 second offset (discussed at #73020 (comment)) we apply to the timeout to avoid reusing the connection as it's about to time out.
An example from a real service closing the connection right as we've sent the request right about 5 seconds after the last one:

tempsnip2

With these changes, an internal service I was testing with experiences no more "Connection reset by peer" request failures.

cc: @halter73

@MihaZupan MihaZupan added this to the 7.0.0 milestone Aug 8, 2022
@ghost ghost assigned MihaZupan Aug 8, 2022
@ghost
Copy link

ghost commented Aug 8, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Followup after #73020
Contributes to #72958

This changes #73020 to apply to all 1.x responses, not just 1.0.

It also brings back the 1 second offset (discussed at #73020 (comment)) we apply to the timeout to avoid reusing the connection as it's about to time out.
An example from a real service closing the connection right as we've sent the request right about 5 seconds after the last one:

tempsnip2

With these changes, an internal service I was testing with experiences no more "Connection reset by peer" request failures.

cc: @halter73

Author: MihaZupan
Assignees: -
Labels:

area-System.Net.Http

Milestone: 7.0.0

@MihaZupan MihaZupan marked this pull request as ready for review August 8, 2022 20:26
@MihaZupan MihaZupan requested a review from a team August 8, 2022 20:26
@MihaZupan
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephentoub
Copy link
Member

stephentoub commented Aug 9, 2022

It also brings back the 1 second offset

I still don't understand why this is the "right" offset. Why 1 second rather than 0.5 seconds or 2 seconds? Why is 1 second somehow magic? Isn't it likely you could come up with similar examples for any "offset" you selected? It seems like we're really just saying "that timeout you asked us to respect, we're going to choose a much smaller value in order to minimize the chances of a race condition", at which point it begs the question why it's ok to not respect the keep-alive value the server is sending (when the whole point of the original PR was to respect it)... if we're not going to keep it alive for the requested duration, are we really respecting it?

@MihaZupan
Copy link
Member Author

The race we are trying to avoid spans the time from when we decide to reuse a specific HttpConnection, to when the server starts reading the request and decides that the connection is no longer idle.
This includes network latency (including possible time for TCP retransmission) and processing delays at the client and server.

The exact value of the offset we choose has to balance between:

  1. Avoiding as many of these races, as they otherwise lead to request failures
  2. Reusing the connection as much as possible to avoid wasting resources and latency

1 second is an arbitrary choice I made that should cover the vast majority of such races while not completely preventing connection reuse.
I'm happy to change it to whatever other suggested value though. I don't see the need to go beyond a second, but even 0.5 seconds would likely be good enough in this case.

While arguably the server should be the one with enough leeway here, this does not appear to be the case in the wild. I do think having some leeway on the client to mitigate that is valuable.
For servers that close the connection right at the timeout, #73020 effectively does nothing without the offset.

it begs the question why it's ok to not respect the keep-alive value the server is sending
if we're not going to keep it alive for the requested duration, are we really respecting it?

We are respecting it in the sense that we will never send a request after the timeout.
We are also attempting to never send it so late that the server will receive it after the timeout.

@stephentoub
Copy link
Member

stephentoub commented Aug 9, 2022

We are also attempting to never send it so late that the server will receive it after the timeout.

There is no positive value we could pick that would make this a reality.

@MihaZupan
Copy link
Member Author

It's not a guarantee, but it should cover the 99%.

@stephentoub
Copy link
Member

It's not a guarantee, but it should cover the 99%.

Does it? In the tests you've done for 1 second, are those with co-located client and server, or have you validated this 99% with client and server spanning the globe, on slower connections, mobile devices, etc?

Copy link
Member

@rzikm rzikm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MihaZupan
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@MihaZupan
Copy link
Member Author

Testing HTTP RTT to multiple different regions:
global-latency

The worst P999 was 720 ms in my case, meaning a 360 ms offset would have been sufficient 99.9% of the time when talking to the worst server. This still leaves a ~3x buffer for even worse network conditions to maintain 99.9% in the worst-case delay scenarios.

Adjusting the delay between pings for the worst-case failure rate:
failure-original
And then applying this change:
failure-patched

We do believe a 1 second offset will be sufficient for a substantial amount of scenarios.

@MihaZupan MihaZupan merged commit 4f1c56f into dotnet:main Aug 12, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants