Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problems with timeouts in graphql_transport_ws #2703

Merged

Conversation

kristjanvalur
Copy link
Contributor

This fixes an issue mentioned in #2702

Description

After merging uniform websocket tests for Starlite extension, sporadic deadlocks were observed.
It turned out that the timeout trigger, part of the graphql_transport_ws protocol, could trigger too early,
when the initial websockets handshake was still being done. This caused the whole websocket connection
attempt to be rejected and this triggered a deadlock in Starlite, which still appears a bit un-robust.

This PR does a few things

  1. provides an API to start the timeout, after the integration specific handler has completed the Websockets handshake.
  2. Add initial synchronization (connection_timed_out:bool) to ensure that there is never a race between a timeout and accepting a connection.
  3. Robustly terminate the timeout task when it is done, or no longer needed.
  4. Add error handing around the timeout task.
  5. Add a Task error handling hook, to report unhandled errors in Tasks. This is not yet in use, need to consult on which log handler to use, perhaps "strawberry.task"
  6. Provide a shutdown() api to the integrations, rather than the former, rather unwieldly, boilerplate code.

Types of Changes

  • Core
  • Bugfix
  • New feature
  • Enhancement/optimization
  • Documentation

Issues Fixed or Closed by This PR

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • I have tested the changes and verified that they work and don't break anything (as well as I can manage).

@codecov
Copy link

codecov bot commented Apr 10, 2023

Codecov Report

Merging #2703 (4669480) into main (e9faa9c) will increase coverage by 0.01%.
The diff coverage is 92.50%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2703      +/-   ##
==========================================
+ Coverage   96.48%   96.49%   +0.01%     
==========================================
  Files         194      197       +3     
  Lines        7988     8070      +82     
  Branches     1449     1457       +8     
==========================================
+ Hits         7707     7787      +80     
+ Misses        181      180       -1     
- Partials      100      103       +3     

@kristjanvalur kristjanvalur marked this pull request as ready for review April 10, 2023 19:27
@botberry
Copy link
Member

botberry commented Apr 10, 2023

Thanks for adding the RELEASE.md file!

Here's a preview of the changelog:


This release improves the graphql-transport-ws implementation by starting the sub-protocol timeout only when the connection handshake is completed.


Here's the preview release card for twitter:

Here's the tweet text:

🆕 Release (next) is out! Thanks to @kristjanvalur for the PR 👏

Get it here 👉 https://github.com/strawberry-graphql/strawberry/releases/tag/(next)

@DoctorJohn DoctorJohn self-requested a review April 13, 2023 13:38
@kristjanvalur
Copy link
Contributor Author

kristjanvalur commented Apr 17, 2023

Because the timeout is started as soon as the 'handle' method is called, it is possible for it to trigger before an 'accept' message is sent. This is valid Websockets, it will reject the connection.

However, this causes a race with the app, which subsequently is trying to send a websockets accept message. This is likely what is causing the race condition here, and the client fails to shut down in some weird deadlock.

This pr makes sure that the trigger is only started after the base websocket connection is accepted and then maintains a synchronized, and race-free state wrt the sub-protocol handshake.

@kristjanvalur kristjanvalur force-pushed the kristjan/timeouts branch 2 times, most recently from 40b351d to 265cf47 Compare April 21, 2023 09:23
@kristjanvalur kristjanvalur changed the title Fix problems with timeouts in strawberry_transport_ws Fix problems with timeouts in graphql_transport_ws May 2, 2023
RELEASE.md Outdated Show resolved Hide resolved
Co-authored-by: Patrick Arminio <[email protected]>
@kristjanvalur
Copy link
Contributor Author

One question:
I added a skeleton error handler for background tasks. The only sensible thing to do is to log them. but I didn't implement that for this PR. I could do that now, if we agree on a log channel to use. I suggested strawberry.task, maybe that is ok?
These are the only tasks created by strawberry, all other top level task logging is done by the web framework in question.

@patrick91
Copy link
Member

One question: I added a skeleton error handler for background tasks. The only sensible thing to do is to log them. but I didn't implement that for this PR. I could do that now, if we agree on a log channel to use. I suggested strawberry.task, maybe that is ok? These are the only tasks created by strawberry, all other top level task logging is done by the web framework in question.

when are these errors happening?

@kristjanvalur
Copy link
Contributor Author

kristjanvalur commented May 2, 2023

when are these errors happening?

Never, hopefully. But whenever you create a background task, it is prudent to install a top-level error handler to catch and log it. If you don't, Python will create a warning about an task with an exception not being "awaited" but that warning may end up anywhere. In case of the timeout thread, there really aren't many things which can go wrong. But for the subscription thread, all kinds of errors can occur and it is best to handle them in-task by a top level error handler.

The alternative is to have the main task "await" all background tasks and catch and log any errors which occur there.

(asyncio.CancelledErrors don't need to be handled and are ignored if they are raised to the top, but all other errors will cause a warning somewhere)

@patrick91
Copy link
Member

when are these errors happening?

Never, hopefully. But whenever you create a background task, it is prudent to install a top-level error handler to catch and log it. If you don't, Python will create a warning about an task with an exception not being "awaited" but that warning may end up anywhere. In case of the timeout thread, there really aren't many things which can go wrong. But for the subscription thread, all kinds of errors can occur and it is best to handle them in-task by a top level error handler.

The alternative is to have the main task "await" all background tasks and catch and log any errors which occur there.

(asyncio.CancelledErrors don't need to be handled and are ignored if they are raised to the top, but all other errors will cause a warning somewhere)

ok, let's add a log then!

let's maybe do strawberry.ws.task?

@patrick91 patrick91 merged commit 6e730d9 into strawberry-graphql:main May 2, 2023
@kristjanvalur kristjanvalur deleted the kristjan/timeouts branch May 2, 2023 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants