Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent garbage collection of main lifespan task #972

Merged
merged 1 commit into from
Mar 4, 2021
Merged

Prevent garbage collection of main lifespan task #972

merged 1 commit into from
Mar 4, 2021

Conversation

MatthewScholefield
Copy link
Contributor

@MatthewScholefield MatthewScholefield commented Mar 2, 2021

This maintains a reference to the created task in the lifespan to fix some subtle bugs that arise from the entire app being garbage collected due to no references of the running coroutine being kept.

TL;DR:

  • asyncio.create_task(foo()) = bad
  • task = asyncio.create_task(foo()) = good

This bug can be illustrated with a relatively simple app using aioredis (I could probably find an example without aioredis, but this example should suffice):

import aioredis
pool = None

async def app(scope, receive, send):
    global pool
    if scope['type'] == 'lifespan':
        message = await receive()  # On startup
        pool = await aioredis.create_redis_pool('redis://localhost:6379')
        await send({"type": "lifespan.startup.complete"})
        message = await receive()  # Wait until shutdown
    else:
        await pool.ping()  # (Use pool during requests)

When running this with uvicorn example:app it seems like everything works (the app starts up correctly), but if we force garbage collection on a specific line within the event loop, we consistently encounter the following error:

Task was destroyed but it is pending!
task: <Task pending name='Task-3' coro=<RedisConnection._read_data() running at .../site-packages/aioredis/connection.py:186> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fb4af031f70>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at .../site-packages/aioredis/connection.py:168]>
Task was destroyed but it is pending!
task: <Task pending name='Task-2' coro=<LifespanOn.main() running at .../site-packages/uvicorn/lifespan/on.py:55> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fb4aefbf070>()]>>

While a bit hacky, to debug we can force this garbage collection in the event loop as follows:

  • Edit /usr/lib/python3.*/asyncio/base_events.py and within def _run_once, near the bottom immediately within the for i in range(ntodo):, add import gc; gc.collect().
  • Force Uvicorn to use the asyncio event loop so that it uses this modified code by running with: uvicorn example:app --loop asyncio

After this change, every execution of uvicorn should result in the error shown above.

If we apply the changes from this PR we can see this will no longer error.

Note, this was initially discovered and reported within aioredis. However, I've since realized that the error lied in the uvicorn code that ran above it.

@euri10
Copy link
Member

euri10 commented Mar 3, 2021

it would help having a reproducible example without aioredis and that doesn't require to edit the standard library, bit hacky is quite an understatement here.
I get that you stop seeing this error with the fix proposed, however that statement feels awkward to me :

the entire app being garbage collected due to no references of the running coroutine being kept.

a very simple example with the "bad" way of creating tasks like the below does not show this error, you can even add a gc.collect() in the counter task :

async def index(request: Request):
    return JSONResponse({"hello": "e"})


routes = [
    Route("/", index),
    WebSocketRoute("/ws", Echo)
]


async def counter(msg, delay):
    i = 0
    while True:
        print(msg, i)
        i += 1
        await asyncio.sleep(delay)


async def on_startup():
    logger.debug('startup lifespan')
    asyncio.create_task(counter('count', 1))


async def on_shutdown():
    logger.debug('shutdown lifespan')


app = Starlette(routes=routes, on_startup=[on_startup], on_shutdown=[on_shutdown])

so we need to understand why it is gc in the first place imho, quick google reveals some interesting links, another redis issue that looks very similar : jonathanslenders/asyncio-redis#56

https://bugs.python.org/issue21163 is also interesting and does not suggest anything related to hard-references.

@MatthewScholefield
Copy link
Contributor Author

MatthewScholefield commented Mar 3, 2021

Great questions! I went ahead and did a little more digging and have some more solid examples.

So, a simple example that fails with garbage collection is an app that uses StreamReader (for example, to do a web request):

import asyncio
import gc


async def request_google():
    reader, writer = await asyncio.open_connection('google.com', 80)
    writer.write(b'GET / HTTP/2\n\n')
    await writer.drain()
    response = await reader.read()
    return response

def app(scope):
    async def asgi(receive, send):
        google_response = await request_google()
        await send({"type": "http.response.start", "status": 200, "headers": [[b"content-type", b"text/plain"]]})
        await send({"type": "http.response.body", "body": google_response})

    return asgi

If we run this with the garbage collection modification, we see that it immediately errors with:

Task was destroyed but it is pending!
task: <Task pending name='Task-2' coro=<LifespanOn.main() running at .../lib/python3.9/site-packages/uvicorn/lifespan/on.py:55> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f2162fe2970>()]>>

You may notice that if we replace the request_google() with asyncio.sleep, this error no longer happens. The reason for this is only by coincidence. I looked into it and asyncio.sleep works by scheduling a call to submitting a future via asyncio.call_at which saves a strong reference to the arguments (in this case the future created by asyncio.sleep) within the loop._scheduled list. However, as Guido mentions in that link you found:

Most likely your program is simply relying on undefined behavior and the right way to fix it is to keep strong references to all tasks until they self-destruct.

So, it sounds like asyncio.sleep keeping a strong reference isn't a guarantee and only happens to be the case.

Note, we can replace asyncio.sleep with request_google in the Starlette example you provided and with the garbage collection modification in place, we will see the same issue:

Task was destroyed but it is pending!
task: <Task pending name='Task-3' coro=<counter() running at ./example2.py:43> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fd0fabbfa30>()]>>

If, however, we would like to create an example where this happens without modifying the standard library code, we simply need to create an example where our application waits on a web request (request_google) and another function on the event loop runs garbage collection:

import asyncio
import gc


async def request_google():
    reader, writer = await asyncio.open_connection('google.com', 80)
    writer.write(b'GET / HTTP/2\n\n')
    await writer.drain()
    response = await reader.read()
    return response


async def do_gc():
    for i in range(10):
        await request_google()
        gc.collect()


def app(scope):
    async def asgi(receive, send):
        google_responses = b''
        task = asyncio.create_task(do_gc())
        for i in range(10):
            google_responses += await request_google()
        await task
        await send({"type": "http.response.start", "status": 200, "headers": [[b"content-type", b"text/plain"]]})
        await send({"type": "http.response.body", "body": google_responses})

    return asgi

Here, even without the standard library modification in place, we consistently see:

Task was destroyed but it is pending!
task: <Task pending name='Task-2' coro=<LifespanOn.main() running at .../lib/python3.9/site-packages/uvicorn/lifespan/on.py:55> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fd42a390040>()]>>

Interestingly, perhaps related to when Python performs garbage collection, we can also see this in the starlette example when we replace asyncio.sleep with request_google right after we hit ctrl+c:

count 19
INFO:     Shutting down
count 20
count 21
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [6039]
Task was destroyed but it is pending!
task: <Task pending name='Task-3' coro=<counter() running at ./example2.py:43> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f7d2fbe2490>()]>>

In all of these cases, we don't see any instances of these errors with the modification proposed in this PR.

Anyways, let me know if this all makes sense. Would be happy to provide more details.

@euri10
Copy link
Member

euri10 commented Mar 4, 2021

ok @MatthewScholefield your example was definitely very helpful
I tweaked it a little and can trigger the exception pretty much every time, it still can run fine sometimes so it's a little bit flaky.
I wish there was a non-flaky way we could incorporate as a test.
However what I can attest is that the fix seems to work.
So I'm leaving comments on the implementation soon, you also can run ./scripts/lint to follow our formatting and we'll be set

thanks

import asyncio
import gc


async def request_google(x):
    reader, writer = await asyncio.open_connection('google.com', 80)
    writer.write(b'GET / HTTP/2\n\n')
    await writer.drain()
    response = await reader.read()
    print(x)
    return response


async def do_gc():
    for i in range(10):
        await request_google(f"do_gc: {i}")
        gc.collect()


async def startup():
    task = asyncio.create_task(do_gc())
    for i in range(10):
        await request_google(f"startup: {i}")
    await task


async def app(scope, receive, send):
    if scope['type'] == 'lifespan':
        message = await receive()
        if message['type'] == 'lifespan.startup':
            await startup()
            await send({'type': 'lifespan.startup.complete'})
    elif scope['type'] == 'http':
        await send({"type": "http.response.start", "status": 200, "headers": [[b"content-type", b"text/plain"]]})
        await send({"type": "http.response.body", "body": b"1"})

traceback:

/home/lotso/PycharmProjects/uvicorn/venv/bin/python -m uvicorn 972:app --log-level=debug
INFO:     Started server process [32655]
INFO:     Waiting for application startup.
startup: 0
do_gc: 0
startup: 1
do_gc: 1
startup: 2
do_gc: 2
startup: 3
do_gc: 3
startup: 4
do_gc: 4
startup: 5
do_gc: 5
do_gc: 6
Task was destroyed but it is pending!
source_traceback: Object created at (most recent call last):
  File "/home/lotso/.asdf/installs/python/3.9.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/lotso/.asdf/installs/python/3.9.1/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/lotso/PycharmProjects/uvicorn/uvicorn/__main__.py", line 4, in <module>
    uvicorn.main()
  File "/home/lotso/PycharmProjects/uvicorn/venv/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/lotso/PycharmProjects/uvicorn/venv/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/lotso/PycharmProjects/uvicorn/venv/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/lotso/PycharmProjects/uvicorn/venv/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/lotso/PycharmProjects/uvicorn/uvicorn/main.py", line 362, in main
    run(**kwargs)
  File "/home/lotso/PycharmProjects/uvicorn/uvicorn/main.py", line 386, in run
    server.run()
  File "/home/lotso/PycharmProjects/uvicorn/uvicorn/server.py", line 49, in run
    loop.run_until_complete(self.serve(sockets=sockets))
  File "/home/lotso/PycharmProjects/uvicorn/uvicorn/server.py", line 66, in serve
    await self.startup(sockets=sockets)
  File "/home/lotso/PycharmProjects/uvicorn/uvicorn/server.py", line 81, in startup
    await self.lifespan.startup()
  File "/home/lotso/PycharmProjects/uvicorn/uvicorn/lifespan/on.py", line 30, in startup
    loop.create_task(self.main())
task: <Task pending name='Task-2' coro=<LifespanOn.main() running at /home/lotso/PycharmProjects/uvicorn/uvicorn/lifespan/on.py:64> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f6e578c6e50>()] created at /home/lotso/.asdf/installs/python/3.9.1/lib/python3.9/asyncio/streams.py:515> created at /home/lotso/PycharmProjects/uvicorn/uvicorn/lifespan/on.py:30>
INFO:     ASGI 'lifespan' protocol appears unsupported.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
do_gc: 7
do_gc: 8
do_gc: 9

@@ -22,12 +22,14 @@ def __init__(self, config: Config) -> None:
self.startup_failed = False
self.shutdown_failed = False
self.should_exit = False
self.main_task = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wont be used anywhere, let's remove that


async def startup(self) -> None:
self.logger.info("Waiting for application startup.")

loop = asyncio.get_event_loop()
loop.create_task(self.main())
task = loop.create_task(self.main())
self.main_task = task # Keep a hard reference to prevent garbage collection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.main_task = task # Keep a hard reference to prevent garbage collection
main_lifespan_task = loop.create_task(self.main()) # noqa: F841 # Keep a hard reference to prevent garbage collection, see https://github.com/encode/uvicorn/pull/972

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is enough just to create the variable, let's also add a comment pointing to the PR as it's definitely non-trivial

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up, I realized recently that this isn't enough to just create the variable. If the app waits on any io immediately after sending the startup complete message it will still get collected by GC. Probably not something many apps do though so relatively low impact.

@MatthewScholefield
Copy link
Contributor Author

Hmm, intuitively I would have expected that the task would continue running outside of startup() and so we would need to hold a reference to it within the parent scope, but you're definitely right, performing only this change does indeed fix the issue in all cases.

Anyways, I've made the suggested changes and the linting errors should be fixed.

Let me know if you'd like me to do anything else.

- Matthew

Copy link
Member

@euri10 euri10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for this.

@euri10 euri10 merged commit 83303ff into encode:master Mar 4, 2021
@MatthewScholefield MatthewScholefield deleted the bugfix/main-task-reference branch March 4, 2021 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants