Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASGIRef 3.4.1 + Channels 3.0.3 causes non-deterministic 500 errors serving static files #1722

Closed
rdmrocha opened this issue Jul 2, 2021 · 34 comments · Fixed by #1890
Closed

Comments

@rdmrocha
Copy link

rdmrocha commented Jul 2, 2021

So this was the very first scenario in which the Single thread executor error was found and that lead to me opening django/asgiref#275

While trying to get a simple repro-case for it, we figured out a way to trigger an error related to it in a very simple way and this was fixed with https://github.com/django/asgiref/releases/tag/3.4.1

But testing the new 3.4.1 version against our code-base still yielded the same 500 errors while serving static files (at least) in the dev environment.

I've updated https://github.com/rdmrocha/asgiref-thread-bug with this new repro-case, by loading a crapload of JS files
(1500) but that can be changed in the views.py file.

It doesn't ALWAYS happen (so you might need a hard-refresh or two) but when it does, you'll be greeted with something like this:
Screenshot 2021-07-02 at 16 30 50
Screenshot 2021-07-02 at 16 33 17

I believe this is still related django/asgiref@13d0b82 as reverting to v3.3.4 via requirements.txt makes the error go away.

Looking at the offending code inside channels/http.py it looks like this might be a thread exhaustion issue but this is pure speculation.
image

since the handle is decorated as sync_to_async:
Screenshot 2021-07-02 at 17 01 40
This is forcing the send to become sync and we're waiting on it like this: await self.handle(scope, async_to_sync(send), body_stream).
If there's no more threads available, I speculate that they might end up in a deadlock waiting for the unwrap of this await async_to_sync(async_to_sync) call, eventually triggering the protection introduced in django/asgiref@13d0b82

But take this last part with a grain of salt as this is pure speculation without diving into the code and debugging it.
Hope it helps

@andrewgodwin
Copy link
Member

Would you say this is an asgiref issue? It would seem in this case that the deadlock detection is doing the thing it's meant to do, which is preventing you from a thread-exhaustion (if that is the cause) deadlock? What in asgiref do you think needs changing to fix it?

@rdmrocha
Copy link
Author

rdmrocha commented Jul 2, 2021

Being Channels or ASGIRef, one of them needs a fix to be able to handle large amounts of fast requests, which is what's currently breaking our SPA app being served by Django using Channels 3.0.3 and (currently locked) ASGIRef 3.3.4 (since the behaviour mentioned was introduced after by django/asgiref@13d0b82)

In a fresh "app-root" request we serve hundreds of small html/script files as our architecture is component based.
The provided example shows how the bug manifests by serving an absurd amount of files (since it was random, this maximizes the chance of it happening). Serving the files should never yield a 500 error if the thread-pool/queue is full (if this is indeed the case). They need to be kept on hold.
Since it's ASGIRef that's throwing the exception this only happens AFTER the mentioned django/asgiref@13d0b82, I'm assuming that there's either a bug or an edge-case that the current code doesn't handle.

If you feel that this should be a Channels issue, I'm happy to replicate this ticket over there

@carltongibson
Copy link
Member

I'm happy to take this over to Channels. If we're hitting an issue here then the new code is doing the right thing.

I'll comment on this:

...needs a fix to be able to handle large amounts of fast requests, which is what's currently breaking our SPA app being served by...

The static files handler is a development convenience. It's not at all intended for production. Rather you should run collectstatic and configure your web server or CDN or ... to serve the staticfiles. (See the Django docs for some guidance there.)

Also, if you're still hitting this after that, if you serve any plain Django views via WSGI (gunicorn say) and leave only the Consumers to be handled by ASGI, you'll not trigger this code path at all. (That's not a fix, but it will get you going in a robust way.)

@carltongibson carltongibson transferred this issue from django/asgiref Jul 3, 2021
@rdmrocha
Copy link
Author

rdmrocha commented Jul 3, 2021

Thanks for the feedback. Our static files in production are being correctly served by S3. The issue did not trigger in production but we saw it in dev environment directly tied to an analogous example as the provided one.

I understand that staticfile serving is a convenience and is not representative of how production is working. I'm just afraid that this specific behavior may be triggered in other scenarios. And investigating this already lead to the ASGIRef 3.4.1 so I still believe we are on to something.

@carltongibson
Copy link
Member

HI @rdmrocha — Yes, I think it's worth looking into, but I think the error is showing we have a potential issue in static files handler, rather than an issue in asgiref… — If you're happy to keep digging that's super. Thanks!

@ckcollab
Copy link

Btw if anyone gets this error during selenium tests, you can turn off static file serving. We use Whitenoise anyway so disabling serving static files fixed it for us:

class OurSpecialTestCases(ChannelsLiveServerTestCase):
    serve_static = False


class OurTestCase(OurSpecialTestCases)
    ...

(ps tyvm for Channels, it has made our lives much easier!)

@codekiln
Copy link

codekiln commented Sep 1, 2021

We're consistently seeing this in development with runserver often enough for it to be a major inconvenience since we upgraded channels dependencies:

asgiref==3.3.4 => 3.4.1
channels-redis==3.2.0 => 3.3.0
channels==3.0.3 => 3.0.4
Twisted==21.2.0 => 21.7.0

@blayzen-w
Copy link

blayzen-w commented Sep 8, 2021

I've been seeing it in development too with around 0.5% to 1% of the requests failing with asgiref 3.4.1 & channels 3.0.4. It's not a big deal when an image fails but we use code splitting so when it hits one of our js files then the entire app fails to load. When I was testing a page that makes around 20 static file requests at a time, we observed the exception being thrown in as little as the 5th request made.

I have not seen the error in production but we serve all static files either though s3 or uwsgi.

@Simanas
Copy link

Simanas commented Nov 11, 2021

My two cents here!

I really appreciate everyone who contributed to Django, ASGI and Channels, but this needs very serous investigation. We all know that static files must be served separately and we do it in production, but that's not about it.

Essentially it is no longer capable of handling the same amount of load as previous versions of channels and asgi, therefore whoever implemented these updates must go back and make sure, that we do not degrade Django-ASGI-Channels stack performance, otherwise all the great things that were introduced with the updates, becomes pointless, even harmful and can cause a lot of troubles for many.

For example I see this exact behavior in a fairly standard django admin, where admin tries to load a few additional script and style files, but it seems like dev server is unable to handle it. Feels like it's unable to handle more than ~20 concurrent requests at the time to /static/ endpoint.

@karatemir
Copy link

Hi,

What is the recommended course of action regarding this problem?
I followed the advice here and downgraded asgiref from 3.4.1 to 3.3.4 and the problem disappeared but I was wondering whether this is the right way to go.

Thanks.

@ShaheedHaque
Copy link

We are seeing this in development too.

@LinkkG
Copy link

LinkkG commented Dec 25, 2021

Right now I'm using the most updated versions with python(3.9.9)

django==4.0
channels==3.0.4 
asgiref==3.4.1

I hope I can be of help:

I don't know if its just me. But I only saw this happen with /static folder and not with /media... Being that the case. It would mean that django STATIC_URL default serving, maybe has a buggy difference with adding a custom static in urls.py (django/views/static/serve method works ok)

To test this,

  1. I collected /static/ files to another folder and then...
  2. in settings.py
    NEW_ROOT = BASE_DIR / "public"
  3. like media, added it to urls.py this way at the end
    urlpatterns += static('/public/', document_root=settings.NEW_ROOT)
  4. Of course that would mean changing all the template references to the NEW_URL value...

A simpler way to test this, without changing the templates is

  1. Adding static as a custom like (in urls.py)
from django.conf.urls.static import static
# your urlpatterns here
urlpatterns += static(settings.STATIC_URL, document_root=settings.STATIC_ROOT)
  1. then running the server with --nostatic flag
    python manage.py --nostatic

(I tried both ways realoading many times and the error 500 is gone)
Note: remember removing/commenting the STATIC_ROOT folder from STATICFILES_DIRS array to avoid error

@tony
Copy link

tony commented Jan 3, 2022

Before this, staying under 3.4 of asgiref would work. Recent releases open systems to this issue

  • uvicorn: With python 3.10 and uvicorn 16.0 is needed for websockets, but it requires 3.4.1, so 3.3.4 is no longer an option. asgiref is used for typings, the 3.4.1 constraint happened in ⬆️ Bump asgiref to 3.4.0 encode/uvicorn#1100
  • django 4.0: requires asgiref >=3.4.1

@tony
Copy link

tony commented Jan 3, 2022

Also, if you're still hitting this after that, if you serve any plain Django views via WSGI (gunicorn say) and leave only the Consumers to be handled by ASGI, you'll not trigger this code path at all. (That's not a fix, but it will get you going in a robust way.)

@carltongibson regarding this

Hi and thank you! I know this was from a few months back, could you perhaps provide an example of what this would look like and where it'd be delegated (gunicorn command, asgi application config, django settings)?

e.g. assuming the most basic setup with gunicorn with uvicorn, e.g.

gunicorn --worker-class=uvicorn.workers.UvicornWorker project.asgi:application the ASGI_APPLICATION being the most basic possible, application = ProtocolTypeRouter({}), beyond that, how could/would it be possible serve static files via WSGI or avoid the codepath?

In my case I'm also using whitenoise (not sure if that's a plus or not)

@tony
Copy link

tony commented Jan 6, 2022

@rdmrocha, @ckcollab, @codekiln, @blayzen-w @Simanas @karatemir @ShaheedHaque @LinkkG

Some questions about when you experience this:

  • Python version, asgiref, channels version? (If not python 3.10.1, asgiref 3.4.1, channels 3.0.3)
  • Where is server running? local shell, docker, github action, heroku, ec2, etc?
  • Server? e.g. ./manage.py runserver? Running gunicorn, daphne etc? If so, what sort of arguments do you pass?
  • When does it happen? e.g. Normal browser usage, automated testing?
  • Is this affecting you in production, test cases, etc?
  • What views/files does it happen with? Static files, normal views?
  • Any other things relevant / worth mentioning?

My case:

  • python 3.10.1, asgiref 3.4.1, channels 3.0.3

    • can happen with < channels 3.0.3
  • Server: github action in docker container

  • gunicorn 20.0.4 w/ uvicorn 16.0

    gunicorn \
        --workers=1 \
        --bind=0.0.0.0:8000 \
        --max-requests=0 \
        --max-requests-jitter=0 \
        --worker-class=uvicorn.workers.UvicornWorker
        path.to.asgi:application
    

    Tried variations of workers, max requests and max requests jitter

  • Happens during selenium testing

  • Static files w/ whitenoise (with and without)

@JeroenvO
Copy link

same issue.

  • python 3.10.1 in pipenv, django 3.2.11, channels 3.0.4, asgiref 3.4.1 (problem disappears with asgiref 3.3.4)
  • localhost dev server in jetbrains pycharm manage.py runserver localhost:8000
  • including websockets (synchronous)
  • static files are not served: RuntimeError: Single thread executor already being used, would deadlock

@ShaheedHaque
Copy link

ShaheedHaque commented Jan 13, 2022

  • Python 3.8.10, asgiref 3.4.1, channels 3.0.4.
  • Where is server running? local shell
  • Server? ./manage.py runserver
  • When does it happen? Selenium automated testing
  • Is this affecting you in production, test cases, etc? Test cases. In production, we use gunicorn with uvicorn behinbd nginx (which serves the static files) so I am hoping that I have correctly read that should not be an issue.
  • What views/files does it happen with? Static files AFAIK.
  • Any other things relevant / worth mentioning? The problem is new since upgrading from asgiref 3.3.x, but I assume that is because the check was not present in that code.

@fmgoncalves
Copy link

The underlying problem is channels' StaticFilesWrapper.
This implementation is used to serve static files on the development server (runserver). However, it's use of sync_to_async and async_to_sync is prone to deadlocks on asgiref.
One fix would be to pass thread_sensitive=False to the channels' asgiref handler's decorator, although, not being intimately familiar with channels, I can't be sure there aren't undesirable side effects (my tests did not hit any, though).

Fortunately, there is a workaround that doesn't involve patching channels:

  1. Remove django.contrib.staticfiles from the installed apps in DEBUG (runserver) runs. This confuses channels' runserver and keeps it from short-circuiting the request flow through StaticFilesWrapper. In your settings.py
if DEBUG:
  INSTALLED_APPS = [a for a in INSTALLED_APPS if a != 'django.contrib.staticfiles']
  1. Since StaticFilesWrapper is no longer intercepting static files, we need to add regular routes for them. Luckily, the infrastructure for that is already available. In your toplevel urls.py
if settings.DEBUG:
    from django.apps import apps
    if not apps.is_installed('django.contrib.staticfiles'):
        from django.contrib.staticfiles.urls import staticfiles_urlpatterns
        urlpatterns = staticfiles_urlpatterns() + urlpatterns

And you're done. Local tests indicate that the problems with concurrent requests for static files are gone and all works as intended.

@kezabelle
Copy link

Fair warning, I don't use channels so I could be entirely barking up the wrong tree, but here I am anyway, pointing things out (and teaching grandmother to suck eggs, possibly) in case they are relevant in sparking further discussion of solutions.

From what I can see of the git blame for the channels runserver and staticfiles bits linked above by @fmgoncalves, predominantly the commits are from before Django got it's own ASGIStaticFilesHandler by mixing together the StaticFilesHandlerMixin and ASGIHandler classes in django/django@a415ce7 ... I wondering if perhaps wholly subsuming that would affect things.

i.e. Command.get_application could return ASGIStaticFilesHandler(get_default_application() ... ) where ASGIStaticFilesHandler is either the Django provided one or if not appropriate (I dunno the implementation), a mixed together version based on channels.http.AsgiHandler and django.contrib.staticfiles.handlers.StaticFilesHandlerMixin

Apologies if I'm just muddying things further or not proving useful...

@carltongibson
Copy link
Member

carltongibson commented Jan 27, 2022

@kezabelle Yes. Using Django’s implementation and removing the Channels one is the way to go.

Channel’s AsgiHandler predates Django’s ASGI support from Django 3.0. From then on, providing such is not part of Channel’s job. Much better is to defer to the Django implementation, and channels can focus on the Consumers and Chanel Layer, and other bits, that aren’t going to make their way into Django.

The proximate reason that’s not been done was the need to maintain Django 2.2 support, but we can let that go now. (See also #1795 — time for a major version bump, again.)

We need to keep runserver, since Django doesn’t have an ASGI simple server available (and likely never will for websockets), but I’d happily take PRs dropping AsgiHandler and StaticWrapper and such.

In the meantime (responding to various points above):

  • It’s not worth putting time into fixing this issue — we need to migrate to the Django implementation, not patch up this one.
  • using WSGI to serve Django views (unless you’re actively using ASGI) is still the proven way.
  • I’d imagine Whitenoise would just work™ If you’re using that.
  • Using (say) nginx allows you to route some requests to an ASGI process, and others to a WSGI one. (Define two upstreams, and have the Channels consumers at /channels/… for example — you can Google the rest.)
  • @fmgoncalves suggestion would very likely work (that’s how we used to do it)
  • If you’re hitting issues here, I have to think you’re pushing runserver too hard. Maybe it could go faster but switching to a proper setup will serve you better. (I imagine push back on this, but if you’re hitting this issue, that’s the clearest way to work around it, short of an actual PR to channels.)
  • Sorry if it’s degraded from previously. The deadlocks were real. They we just hidden. The change to asgiref just brought them out. Everything in the async domain is still very new: even the experts are learning.
  • Development is slow. It’s all volunteer. But it’s inexorable. Year-by-year it comes on. (Contributions welcome!)
  • That applies double during the pandemic. My own bandwidth for Channels had been exceedingly constrained since (approx) last summer.

@tony
Copy link

tony commented Jan 27, 2022

  • Development is slow. It’s all volunteer. But it’s inexorable. Year-by-year it comes on. (Contributions welcome!)
  • That applies double during the pandemic. My own bandwidth for Channels had been exceedingly constrained since (approx) last summer.

@carltongibson @andrewgodwin Thank you for your being courteous and your efforts on the project in general!

@tony
Copy link

tony commented Jan 27, 2022

  • It’s not worth putting time into fixing this issue — we need to migrate to the Django implementation, not patch up this one.

Does this imply there will be a change to channels itself and this issue will/may be overcome by events?

As for the other points: I hope to see some examples materialize from this. Maybe it warrants a Q&A discussion to share those if this gets closed?

@ShaheedHaque
Copy link

ShaheedHaque commented Jan 27, 2022 via email

@dr-luke
Copy link

dr-luke commented Feb 8, 2022

As @fmgoncalves has mentioned, one way is to alter how the files are served, but I have found a little more reliable patch that I have implemented in my Django Channels Setup based on the information provided by their post.

It seems that the change of thread_sensitive=False to default to True is causing these deadlock detection messages. As far as I can see, Django Channels doesn't seem to mind the occurrences of the deadlocks.

That being said, I felt it would be safe to monkey patch the sync_to_async function for my Django installation.

project_app/moneky_patches.py

from asgiref.sync import sync_to_async

def patch_sync_to_async(*args, **kwargs):
	"""
	Monkey Patch the sync_to_async decorator
	---------------------------------------
	ASGIRef made a change in their defaults that has caused major problems
	for channels. The decorator below needs to be updated to use
	thread_sensitive=False, thats why we are patching it on our side for now.
	https://github.com/django/channels/blob/main/channels/http.py#L220
	"""
	kwargs['thread_sensitive'] = False
	return sync_to_async(*args, **kwargs)

Then you just overwrite the existing instance of asgiref's sync_to_async method with our patched wrapper that enforces thread_sensitive=False

project_app/__init__.py

from . import monkey_patches
import asgiref

### Monkey Patches
asgiref.sync.sync_to_async = monkey_patches.patch_sync_to_async

This is make channels, and all of Django run in a insensitive manner like it did before the ASGIRef update.

@brownan
Copy link

brownan commented May 12, 2022

Having ran into this issue and dug into its root cause, I think I can provide some insight. As I understand it, the deadlock detection in asgiref works like this:

  1. A context variable is set when entering the single threaded executor used for sync tasks. https://github.com/django/asgiref/blob/3.5.1/asgiref/sync.py#L399-L401
  2. That context variable is un-set when leaving the single threaded executor. https://github.com/django/asgiref/blob/3.5.1/asgiref/sync.py#L430-L431
  3. That context variable is checked before entering, and if it's already set, an error is raised. https://github.com/django/asgiref/blob/3.5.1/asgiref/sync.py#L393-L396

The issue here is that contexts may be re-used by daphne / twisted in the case of persistent connections. When a second HTTP request is sent on the same TCP connection, twisted re-uses the same context from the existing connection instead of creating a new one.

So in twisted, context variables are per connection not per http request. This subtle difference then causes a problem due to how the StaticFilesHandler, built on the channels AsgiHandler works. It uses async_to_sync(send) to pass the send method into self.handle(), which is itself decorated with sync_to_async.

So what I think is happening is this sequence of events:

  1. The sync thread is busy handling a static files request
  2. The sync thread finishes its request by calling the last call to send() (but the sync thread does not yet exit!)
  3. Async thread finishes the response
  4. Async thread sees there's another request on the connection
  5. Async thread launches a new application in the same context
  6. ASGIHandler and SyncToAsync try to use the single thread executor but it's busy, so it errors
  7. The first static request in the sync thread finishes successfully, which it would have anyways

If step 6 blocked instead of erroring, all would be fine, since the sync thread would have finished anyways. I don't think there's a deadlock here, and I don't thing the deadlock detection code in asgiref is working properly.

@tony
Copy link

tony commented Jun 7, 2022

So what I think is happening is this sequence of events:

@brownan and anyone else:

Would these steps be possible to reproduce the behavior in a test? Would this test need to be written in asgiref, channels, or django itself?

@mpasternak
Copy link

mpasternak commented Jul 10, 2022

@tony @brownan My experience is getting 500 errors from Daphne while running Firefox while trying to debug a live-server running test. On macOS. Some files work, some don't -- I guess that Firefox is trying to open a few connections to the live sever to get static files as soon as possible.

So I guess I would try with some command-line tool that's opening multiple connections to a single server and trying to download something. Tools for testing HTTP, like ab ("Apache Bench") could be parametrized. Perhaps there is a tool like that in Python that we could use to run in tests...

I'll be examining this bug this week, I think, as I'd like it to be fixed.

@brownan
Copy link

brownan commented Jul 11, 2022

I haven't had any time lately to dig into this further, but keep me updated and I'll help out how I can.

@carltongibson
Copy link
Member

carltongibson commented Jul 25, 2022

Hi @brownan — nice investigation #1722 (comment)

This will be resolved by moving to Django's static files handler, which doesn't exhibit the same issue, but if you think this is true...

... I don't thing the deadlock detection code in asgiref is working properly.

... and you can reduce that to a minimal example (without Twisted and all that if possible) a report to django/asgiref would be worthwhile!
Thanks.

@brownan
Copy link

brownan commented Jul 25, 2022

yeah, I realize that's a bold claim to make without a minimal test case 😀 I'll see if I can get the time to do that soon

@carltongibson
Copy link
Member

OK, I've started work on what will be v4.0 #1890 moves to use django.contrib.staticfiles, and should address this.

If anyone wants to give that a run, or follow main over the next few weeks to help spot any issues, that would be great.

Once I've made a bit more progress, I'll open a tracking issue for v4.0 as well.

@JulienPalard
Copy link
Contributor

If anyone wants to give that a run, or follow main over the next few weeks to help spot any issues, that would be great.

I tried the main channel branch today on my project (just on my laptop, not in prod!) and it works as expected! Thanks!

@carltongibson
Copy link
Member

Thanks for confirming @JulienPalard. I'm pulling together the releases now, so a few days for the final versions to be live.

@khamaileon
Copy link

4.0.0b1 fixed it for me as well. Can' t wait for v4! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.