Fix problems with timeouts in graphql_transport_ws #2703

kristjanvalur · 2023-04-10T18:51:34Z

This fixes an issue mentioned in #2702

Description

After merging uniform websocket tests for Starlite extension, sporadic deadlocks were observed.
It turned out that the timeout trigger, part of the graphql_transport_ws protocol, could trigger too early,
when the initial websockets handshake was still being done. This caused the whole websocket connection
attempt to be rejected and this triggered a deadlock in Starlite, which still appears a bit un-robust.

This PR does a few things

provides an API to start the timeout, after the integration specific handler has completed the Websockets handshake.
Add initial synchronization (connection_timed_out:bool) to ensure that there is never a race between a timeout and accepting a connection.
Robustly terminate the timeout task when it is done, or no longer needed.
Add error handing around the timeout task.
Add a Task error handling hook, to report unhandled errors in Tasks. This is not yet in use, need to consult on which log handler to use, perhaps "strawberry.task"
Provide a shutdown() api to the integrations, rather than the former, rather unwieldly, boilerplate code.

Types of Changes

Issues Fixed or Closed by This PR

Issue with starlite websockets test #2702

Checklist

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
I have tested the changes and verified that they work and don't break anything (as well as I can manage).

codecov · 2023-04-10T18:55:40Z

Codecov Report

Merging #2703 (4669480) into main (e9faa9c) will increase coverage by 0.01%.
The diff coverage is 92.50%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2703      +/-   ##
==========================================
+ Coverage   96.48%   96.49%   +0.01%     
==========================================
  Files         194      197       +3     
  Lines        7988     8070      +82     
  Branches     1449     1457       +8     
==========================================
+ Hits         7707     7787      +80     
+ Misses        181      180       -1     
- Partials      100      103       +3

botberry · 2023-04-10T19:30:30Z

Thanks for adding the RELEASE.md file!

Here's a preview of the changelog:

This release improves the graphql-transport-ws implementation by starting the sub-protocol timeout only when the connection handshake is completed.

Here's the preview release card for twitter:

Here's the tweet text:

🆕 Release (next) is out! Thanks to @kristjanvalur for the PR 👏

Get it here 👉 https://github.com/strawberry-graphql/strawberry/releases/tag/(next)

kristjanvalur · 2023-04-17T22:05:23Z

Because the timeout is started as soon as the 'handle' method is called, it is possible for it to trigger before an 'accept' message is sent. This is valid Websockets, it will reject the connection.

However, this causes a race with the app, which subsequently is trying to send a websockets accept message. This is likely what is causing the race condition here, and the client fails to shut down in some weird deadlock.

This pr makes sure that the trigger is only started after the base websocket connection is accepted and then maintains a synchronized, and race-free state wrt the sub-protocol handshake.

…led.

strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py

…rs.py Co-authored-by: Patrick Arminio <[email protected]>

RELEASE.md

Co-authored-by: Patrick Arminio <[email protected]>

kristjanvalur · 2023-05-02T10:53:00Z

One question:
I added a skeleton error handler for background tasks. The only sensible thing to do is to log them. but I didn't implement that for this PR. I could do that now, if we agree on a log channel to use. I suggested strawberry.task, maybe that is ok?
These are the only tasks created by strawberry, all other top level task logging is done by the web framework in question.

patrick91 · 2023-05-02T10:54:51Z

One question: I added a skeleton error handler for background tasks. The only sensible thing to do is to log them. but I didn't implement that for this PR. I could do that now, if we agree on a log channel to use. I suggested strawberry.task, maybe that is ok? These are the only tasks created by strawberry, all other top level task logging is done by the web framework in question.

when are these errors happening?

kristjanvalur · 2023-05-02T12:38:10Z

when are these errors happening?

Never, hopefully. But whenever you create a background task, it is prudent to install a top-level error handler to catch and log it. If you don't, Python will create a warning about an task with an exception not being "awaited" but that warning may end up anywhere. In case of the timeout thread, there really aren't many things which can go wrong. But for the subscription thread, all kinds of errors can occur and it is best to handle them in-task by a top level error handler.

The alternative is to have the main task "await" all background tasks and catch and log any errors which occur there.

(asyncio.CancelledErrors don't need to be handled and are ignored if they are raised to the top, but all other errors will cause a warning somewhere)

patrick91 · 2023-05-02T13:00:48Z

when are these errors happening?

Never, hopefully. But whenever you create a background task, it is prudent to install a top-level error handler to catch and log it. If you don't, Python will create a warning about an task with an exception not being "awaited" but that warning may end up anywhere. In case of the timeout thread, there really aren't many things which can go wrong. But for the subscription thread, all kinds of errors can occur and it is best to handle them in-task by a top level error handler.

The alternative is to have the main task "await" all background tasks and catch and log any errors which occur there.

(asyncio.CancelledErrors don't need to be handled and are ignored if they are raised to the top, but all other errors will cause a warning somewhere)

ok, let's add a log then!

let's maybe do strawberry.ws.task?

kristjanvalur force-pushed the kristjan/timeouts branch from 20c1429 to a7b1c88 Compare April 10, 2023 19:04

kristjanvalur marked this pull request as ready for review April 10, 2023 19:27

DoctorJohn self-requested a review April 13, 2023 13:38

kristjanvalur force-pushed the kristjan/timeouts branch 2 times, most recently from 40b351d to 265cf47 Compare April 21, 2023 09:23

kristjanvalur added 5 commits April 22, 2023 10:42

Fix race condition with connection timeouts

8602aab

Start protocol timeout only after websocket handshake is done.

b68971e

Provide a special shutdown() method for the graphql_transport_ws handler

a0b50ff

mark coverage

03da051

Modify connection timeout test now that timeout is immediately cancel…

89e1865

…led.

kristjanvalur force-pushed the kristjan/timeouts branch from 265cf47 to c20bdc1 Compare April 22, 2023 12:23

Add unit test for the error handling in the timeout task

3717320

kristjanvalur force-pushed the kristjan/timeouts branch from c20bdc1 to 3717320 Compare April 22, 2023 14:28

rjwills28 mentioned this pull request Apr 25, 2023

Warnings when running tests using the graphql-transport-ws websocket protocol #2720

Open

patrick91 reviewed May 1, 2023

View reviewed changes

strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py Outdated Show resolved Hide resolved

Update strawberry/subscriptions/protocols/graphql_transport_ws/handle…

f49fa85

…rs.py Co-authored-by: Patrick Arminio <[email protected]>

kristjanvalur changed the title ~~Fix problems with timeouts in strawberry_transport_ws~~ Fix problems with timeouts in graphql_transport_ws May 2, 2023

add RELEASE.md

8db98ad

botberry added bot:has-release-file labels May 2, 2023

patrick91 reviewed May 2, 2023

View reviewed changes

RELEASE.md Outdated Show resolved Hide resolved

Update RELEASE.md

e54848d

Co-authored-by: Patrick Arminio <[email protected]>

Log errors in the timeout background task

4669480

patrick91 approved these changes May 2, 2023

View reviewed changes

patrick91 merged commit 6e730d9 into strawberry-graphql:main May 2, 2023

kristjanvalur deleted the kristjan/timeouts branch May 2, 2023 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix problems with timeouts in graphql_transport_ws #2703

Fix problems with timeouts in graphql_transport_ws #2703

kristjanvalur commented Apr 10, 2023

codecov bot commented Apr 10, 2023 •

edited

Loading

botberry commented Apr 10, 2023 •

edited

Loading

kristjanvalur commented Apr 17, 2023 •

edited

Loading

kristjanvalur commented May 2, 2023

patrick91 commented May 2, 2023

kristjanvalur commented May 2, 2023 •

edited

Loading

patrick91 commented May 2, 2023

Fix problems with timeouts in graphql_transport_ws #2703

Fix problems with timeouts in graphql_transport_ws #2703

Conversation

kristjanvalur commented Apr 10, 2023

Description

Types of Changes

Issues Fixed or Closed by This PR

Checklist

codecov bot commented Apr 10, 2023 • edited Loading

Codecov Report

botberry commented Apr 10, 2023 • edited Loading

kristjanvalur commented Apr 17, 2023 • edited Loading

kristjanvalur commented May 2, 2023

patrick91 commented May 2, 2023

kristjanvalur commented May 2, 2023 • edited Loading

patrick91 commented May 2, 2023

codecov bot commented Apr 10, 2023 •

edited

Loading

botberry commented Apr 10, 2023 •

edited

Loading

kristjanvalur commented Apr 17, 2023 •

edited

Loading

kristjanvalur commented May 2, 2023 •

edited

Loading