Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

websockets: fix ping_timeout #3376

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

oliver-sanders
Copy link

@oliver-sanders oliver-sanders commented Apr 30, 2024

Closes #3258
Closes #2905

Fixes an issue with the calculation of ping timeout interval which caused connections to be erroneously closed from the server end. This could happen shortly after the connection was opened, before the ping was even sent (as reported in #3258) as the result of startup logic. It could also happen when pong responses were sent back from the client within the configured timeout as the result of a race condition.

To test this fix, try repeating the steps in this example: #3258 (comment)

I found a TODO to implement testing for ping timeout. I tried to fill this in but couldn't get it to work, any pointers appreciated (note, it has to be tested from the server side).

* Closes tornadoweb#3258
* Fixes an issue with the calculation of ping timeout interval that
  could cause connections to be erroneously timed out and closed
  from the server end.
Copy link
Member

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this is probably going to involve some ugly hacks one way or another. My first thought is to just overwrite the write_ping method: self.ws_connection.write_ping = lambda x: None (and factor out a write_pong method for the other side where it currently calls self._write_frame(True, 0xA, data)).

I think there are multiple distinct problems with this ping timeout code:

  • If ping_timeout == ping_interval, the connection breaks before the first ping is sent, and if that is avoided, it is likely to break other times due to race conditions. This was your finding in websocket disconnect and reconnect every 10 minutes #3258 (comment) and is fixed by this PR
  • If ping_timeout > ping_interval, we are slow to detect failure - we don't check for timeout until the next ping is sent (websocket: Fix ping_timeout < ping_interval #2905). This is also fixed by your PR (and it's a good reason to restructure the code instead of a small change to the since_last_(ping,pong) logic in periodic_ping.
  • Other folks, such as the original message in websocket disconnect and reconnect every 10 minutes #3258, have reported problems when ping_interval and ping_timeout are not the same. It's not clear to me what exactly the problem is here, or whether this PR fixes it.
  • Does it actually make sense to set ping_interval < ping_timeout? That's the default, but since websockets go over TCP I'm not sure it makes sense to allow multiple pings in flight before detecting failure. I'm not sure what I was thinking when this was written/merged, but now I feel like a sensible configuration would be something like ping_interval=30; ping_timeout=5.


if self.ping_timeout and self.ping_timeout > 0:
# wait for the pong
await asyncio.sleep(self.ping_timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditionally sleeping for ping_timeout will cause problems if ping_timeout is greater than ping_interval (which is the default) - we won't start the next ping until the old timeout sleep has passed.

Unless we want to effectively require that ping_interval >= ping_timeout, we can't use a sleep here. (Since websockets are TCP, I don't think it really makes sense to use a short ping interval with a long timeout, so maybe we'd want to make that policy change, but we have the existing defaults and documentation that suggest otherwise). We could use other synchronization primitives (like an asyncio.Condition) that would allow for an early wakeup, or we could go back to the original pattern that checks for a timeout at the ping interval.

I also find the combination of PeriodicCallback and coroutines kind of non-obvious and if we're going to use a coroutine here I think I'd rather get rid of the PeriodicCallback and send the pings from the same coroutine. This would make the interaction of the sending of pings and waiting for pongs more explicit.

@oliver-sanders
Copy link
Author

Does it actually make sense to set ping_interval < ping_timeout?

I did find it a bit strange that this is possible.


Since websockets are TCP, I don't think it really makes sense to use a short ping interval with a long timeout,

I don't think pings & pongs can carry any [meta]data, so it is not possible to associate a pong with the ping that it was responding to. As a result, ping_timeout > ping_interval does not make sense at the Websocket layer either as we don't have the information required to implement it that way.


I feel like a sensible configuration would be something like ping_interval=30; ping_timeout=5.

👍


Unconditionally sleeping for ping_timeout will cause problems if ping_timeout is greater than ping_interval (which is the default) - we won't start the next ping until the old timeout sleep has passed.

I did consider this, but thought of it more as the result of a quirky configuration rather than a bug in the implementation.

For configurations where ping_timeout > ping_interval:

  • Under normal circumstances, it will ping at the ping_interval.
  • If the client gets laggy, the ping frequency will drop to ~2x ping time.
  • If the client goes quiet, the server will stop pinging until the timeout is hit (or a pong is received).

Unless we want to effectively require that ping_interval >= ping_timeout

One option would be to reject such configurations:

assert ping_timeout is None or ping_timeout <= ping_interval

However, that might be disruptive. Another option would be to raise a warning and fall back to the ping_interval:

if ping_timeout > ping_interval:
    app_log.warn(
        'The configured websocket_ping_timeout is longer than the websocket_ping_interval.'
        '\nSetting websocket_ping_timeout = websocket_ping_interval'
    )
    ping_timeout = ping_interval

Other folks, such as the original message in #3258, have reported problems when ping_interval and ping_timeout are not the same.

Setting ping_timeout = ping_interval is an easy way to reveal the flaws with the current logic, however, it can fail when these values are not the same too.



Here are my suggestions:

  1. Send pings and respond to timeouts in the same coroutine i.e. remove the PeriodicCallback (as suggested above).
  2. Change the default ping_timeout from max(3 * self.ping_interval, 30) to max(self.ping_interval, 30).
  3. If ping_timeout is configured longer than ping_interval, set ping_timeout = ping_interval and log a warning.

What do you think?

@bdarnell
Copy link
Member

I don't think pings & pongs can carry any [meta]data, so it is not possible to associate a pong with the ping that it was responding to. As a result, ping_timeout > ping_interval does not make sense at the Websocket layer either as we don't have the information required to implement it that way.

They can, actually. RFC 6455 section 5.5.2 and 5.5.3 say that a ping frame may include "application data", and a pong frame must copy the application data from the ping frame it is responding to. So the information is there to match up pings and pongs. But even then, would it be useful to send a flurry of pings faster than the RTT? I'm not seeing any use case for that (at least in HTTP/1 and HTTP/2 - maybe in HTTP/3 depending on how they model the websocket stream in a reorderable connection)

For configurations where ping_timeout > ping_interval:

Under normal circumstances, it will ping at the ping_interval.

I don't think that's true - there's no early exit from the sleep, so it will ping at ping_timeout instead of ping_interval. That's what I was getting at when I said "unconditional sleep". Am I missing something? I'd be OK with the behavior you're describing in this message but not what I think the code is doing.

One option would be to reject such configurations:

This is probably the right answer; I only hesitate because the default ping_timeout property gets this backwards. We can change this default, but the question is whether people have made explicit configurations based on this default too. Overall I think it's probably best to set ping_timeout = ping_interval if they're inverted just to avoid transition pains.

  1. Send pings and respond to timeouts in the same coroutine i.e. remove the PeriodicCallback (as suggested above).
  2. Change the default ping_timeout from max(3 * self.ping_interval, 30) to max(self.ping_interval, 30).
  3. If ping_timeout is configured longer than ping_interval, set ping_timeout = ping_interval and log a warning.

Yes, except that I'd rather not have a warning at all than have a warning every time a websocket is opened. I'm fine without a warning; if you'd like to keep it put it behind some kind of flag so it only logs once (I don't think we have an existing idiom for this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

websocket disconnect and reconnect every 10 minutes websocket: Fix ping_timeout < ping_interval
2 participants