-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various gunicorn/timeout related upload failure bugs. #2640
Comments
Since you are using k8s, have you considered using nginx-ingress? If so, nginx will do the caching you are looking for. I have done it once, in the past, for a bare-metal server. |
Thanks for your persistence, @ttys0dev. I think AWS's application load balancer tracks timeouts. I'll check when I get a sec, and do more research on switching the gunicorn worker model. Many of the fibers of my being resist adding another layer to our architecture, if we can avoid it, @troglodite2 :) |
Agree that's a good option, nginx has good performance and lots of flexibility for handling this sort of thing.
This is a good stopgap but probably isn't the best long term option, using a nginx cache is going to have much better performance overall since it doesn't use the relatively ineffecient thread-per-connection model like non-asyncio django does for handling IO.
One potential option is to migrate the django application to an asyncio request model which is conceptually similar to how something like nginx processes requests, this is not a trivial change however as you would effectively need to migrate from WSGI to ASGI and likely make other changes to avoid the synchronous codepaths. |
So I've been hesitant to fix this issue for a few reasons:
So I guess I'm sort of +0 on this. It seems like a good idea, and I'm pretty convinced y'all know what you're talking about, but I'd like to either have instructions to test it myself or to have somebody say they built the image with the tweak and it worked fine. I hope that makes sense and explains my trepidation here. |
From my reading of the docs a 504 error is an idle connection timeout error, that sort of timeout is from my understanding unrelated to the gunicorn 3 minute worker timeout issue as the timeout there is for a non-idle connection(ie upload timeout does not trigger due to the connecting being idle here but rather due to the worker timeout terminating a non-idle connection).
How are changes to the gunicorn docker build normally tested?
I'll try and test things out locally. |
I could be wrong here, but I wasn't sure, so I checked what status code it was by going to a page on CL that I know times out, and it returned a 504?
Just by hand. They're rarely changed.
That'll be great. If you can say how you tested, that'd be helpful to future reviews too. |
I mean, you could have a page with a 504 timeout issue but it's probably not the same issue as the 3 minute upload timeout issue(since the upload is not idle when the connection times out), does the 504 come back after 3 minutes or does it come back earlier on that page you tested with? If the 504 error response comes back before 3 minutes it's unlikely to be related as that's well before the worker timeout should trigger. |
OK, good call. I reproduced the issue by visiting the slow URL, and it returned a 504, but I throttled the RECAP extension itself (in Chrome) while attempting to upload a very big document, and that threw at 502. So looking at 502 errors, there's more like 1,000/day. That seems like enough to qualify as a problem! |
Did some testing by tweaking the docker compose to use gunicorn. Static assets don't seem to be set up right but seems to be enough to verify basic django app functionality works as before. diff --git a/docker/courtlistener/docker-compose.yml b/docker/courtlistener/docker-compose.yml
index ac3e7a8f6..248769d18 100644
--- a/docker/courtlistener/docker-compose.yml
+++ b/docker/courtlistener/docker-compose.yml
@@ -62,7 +62,7 @@ services:
cl-django:
container_name: cl-django
- image: ${DJANGO_DOCKER_IMAGE:-freelawproject/courtlistener:latest-web-dev}
+ image: ${DJANGO_DOCKER_IMAGE:-freelawproject/courtlistener:latest-web-prod}
depends_on:
- cl-postgresql
- cl-redis
diff --git a/docker/django/Dockerfile b/docker/django/Dockerfile
index eb0b9b961..503b46a48 100644
--- a/docker/django/Dockerfile
+++ b/docker/django/Dockerfile
@@ -111,12 +111,16 @@ CMD python /opt/courtlistener/manage.py migrate && \
#freelawproject/courtlistener:latest-web-prod
FROM python-base as web-prod
-CMD gunicorn cl_wsgi:application \
+CMD python /opt/courtlistener/manage.py migrate && \
+ python /opt/courtlistener/manage.py createcachetable && \
+ gunicorn cl_wsgi:application \
--chdir /opt/courtlistener/docker/django/wsgi-configs/ \
--user www-data \
--group www-data \
# Set high number of workers. Docs recommend 2-4× core count`
--workers ${NUM_WORKERS:-48} \
+ --worker-class gthread \
+ --threads 10 \
# Allow longer queries to solr.
--limit-request-line 6000 \
# Reset each worker once in a while I then rebuilt the image and ran the docker compose env normally:
Yep, that sounds a lot more plausible based on what I've seen. |
Cool, thank you, this is re-assuring. I did some more reading and I'm sold. Right now we use k8s with:
And:
Spread across a bunch of pod replicas. From my reading, it seems like we might as well keep the workers and CPU where they are, and to add some threads. The pr doesn't have threads set up, and I haven't researched yet what that value should be, but it seems like we should do that at the same time, right? |
One other comment. The gunicorn docs and default are not great. I get it if you're stuck behind a bad default from ages ago, but the docs seem like they should be louder about recommending gthread. It feels like only upside. I found this article useful too, when thinking about workers vs. threads: https://stackoverflow.com/questions/38425620/gunicorn-workers-and-threads |
Yeah, I think that makes the most sense, we probably want a good amount of threads(I now set it to 10 threads per worker without changing the total workers in the pr) since we need to handle potentially a bunch of concurrent uploads which only need their own threads and not their own workers(since uploads are io bound workloads until complete). |
Cool. I merged it. It'll auto-deploy next time a staff member's PR gets merged (unfortunately non-staff PRs don't have the needed permissions). Thank you for pushing on this and making sure it happened. I guess the next step is to switch all our microservices to use threads too.... Anything else to do here though? |
Is anything needed for that or should that get picked up automatically?
Well, having a retry mechanism in the extension would probably be a good idea. Maybe with some sort of upload success/failure log that can be accessed by clicking on the extension icon. I think most users are unlikely to pull up the developer network/console logs so they likely won't report issues unless their is an easy way for them to see that they have an issue. |
I'm not opposed to an auto-retry mechanism, except for the thundering herds problem it could create, I guess. I'm also +1 for some sort of upload log the user can see, though that has always seemed harder. If you want to pursue these ideas, let's get issues opened in freelawproject/recap so they can be hashed out. For the microservices, no, they'd need to be tweaked same as the stuff here. I think the only two we really use are doctor, and disclosure extractor. I think we can close this, then, right? |
Opened tracking issue #338.
I think so. |
Just a heads up, this deployed yesterday, so I repeated (my test from above using "Slow 3G" throttling to upload about 30MB](#2640 (comment)). After about four minutes, it was still going, so I disabled throttling. We're now several hours later and it's still hanging, uncompleted. It seems like "Slow 3G" means about 52kBps, which would take about 10 minutes to do 30MB, so something went wrong. Meanwhile, I looked at our 502 logs in the ALB and they don't seem to be changed much. The line seems flatter, but it's not like our 502's are fixed. The line is flatter, but still present: (Units are 502's per hour.) |
Yeah, from the looks of it there's an overall reliability improvement and no longer a hard 3 minute timeout...but still reliability issues for larger uploads on slower connections...so we must be hitting another issue then. How does resource usage look? Is there any memory pressure that might be killing worker processes? |
I'm not sure how to answer this, actually. Memory pressure on the nodes is fine. Individual pods could have memory pressure, but if they were running out, I think they'd get killed, right, and I'd see that? |
Hard to say, often stuff just gets respawned automatically when getting OOM killed. What's the breakdown for how memory is provisioned? |
Each gunicorn pod has a 1G: The nodes are about 49% memory utilization. |
Our memory limit == our memory request. I just bumped it to 1.5G. We'll see if that helps. I think htop inside a container is only the processes in that container. I don't see celery in there, for example. I did notice that when looking at the pods, they were all using all their memory, so that's more evidence of that issue. I'm surprised they need so much just to run gunicorn though, that's for sure. I wonder if some dependency is pulling in way too much code (we do have a lot of dependencies). |
Well, it's up to 96% of that. Something ain't right here. Could it be |
Well gunicorn itself shouldn't use much memory.
Might be a memory leak I'm thinking.
I don't think it would be that, however if there was a preexisting memory leak...might be some impact. Did increasing the memory reduce the worker OOM kills? |
Yeah, something is wrong. I put it at 2GB per pod and all the pods are hovering around 90+% while killing workers. I'd be pretty surprised if we have a memory leak unless it was a gthread thing. We don't use threading or mess with memory otherwise.
This still seems strange to me. If we have three workers with three threads each, shouldn't that show up in htop as nine things? OTOH, maybe the reason it shows 16GB of memory in htop is because it's showing all processes/threads for all pods on the node (but it's still weird because where are the celery pods, or the doctor pods?) Anyway, something is very wrong here now. I don't know if we had the memory issue before, but we should really see if we can get on top of it. Any ideas? |
I think it's fairly unlikely gthread itself has a memory leak, it's fairly widely used and would be something that would get fixed quickly if it were an issue. I'm thinking we must have some dependencies using threading somewhere. We are definitely using memory, by that I mean our python objects use memory and can effectively cause memory leaks if references somehow get held when they should be dropped.
Yeah, I think there's more threads coming from somewhere else, guessing some library.
Well memory is allocated by process, threads within the process share the memory.
Well my ASGI PR may help since it does a bit of refactoring, but hard to say, the django ASGI implementation should avoid threading when possible in some request codepaths(it uses a threadpool when required rather than all the time like with gthread). But if there's a memory leak...good chance it would get hit there as well. I made an attempt at memory profiling the django application but wasn't able to get much of anything useful from that, it's apparently quite tricky to trace django memory leaks in general. |
Hm, well, I guess the first step is to reproduce it. On a dev machine, we use the Django webserver. I suppose we should try using gunicorn with gthread and see if we can make memory go up?
I'm going to regret saying this, but I can't think of anything fancy that we're using that could cause something like this. |
I mean, from what I could tell memory goes up with the normal django dev webserver as well, so it doesn't seem to be gthread specific. Although the dev server kind of intentionally leaks memory apparently by storing stuff like sql query history in a buffer.
Ehhhh...with a codebase of this size and complexity...I think there's a lot of potential places we could be leaking memory even if we're not really doing something fancy/unusual. We're moving around and serializing/deserializing/parsing a lot of reasonably large requests/files in memory which could in theory cause massive memory leaks if there's a bug somewhere that holds an object reference when it should be dropped. |
There could also just be a random bug in a dependency of ours, once #2744 is merged I'll try and go through all our deps and see what's still on old versions which have updates available in case there's a fix. |
gunicorn has the setting for only serving so many requests/worker. Maybe we set that to something lower and see if that helps. |
Well, I upped the pods to 3GB and they seem stable without dropping workers. I still think something is off though. We're at about 9GB of memory to serve the site right now, just consumed by gunicorn. That feels like a lot. I wish I had the stats for this from before we switched to gthread. |
I think that's more of a workaround rather than a proper fix.
From the profiling I was able to do the baseline application memory usage does seem to be rather high, seems some of the memory usage may be due to pulling in large dependencies/libraries like numpy/scipy. |
Are we actually pulling those dependencies into gunicorn though? I think we have them installed, but are they actually imported? |
They get pulled into the worker processes from my understanding but not the parent gunicorn process.
Yes, it appears they are imported. |
Are you sure? I always understood that if you don't do a python |
Yes, that's true, they are imported via a python |
The memory profiler flags these imports as using relatively large amounts of memory: |
Oh. Great. Can we stop that? I can't think of any features that use these except management commands? |
I'm not sure, I listed the locations of some of the flagged imports above that get pulled in when running under gunicorn. |
I inlined pandas (see: #2782), but I don't think there's much we can do about datasketch. Do we feel like this issue is more or less fixed for now? Seems like it's repeated across other bugs at this point? |
I think the issues are still present although I think things are improving. |
Hm, if we spin out the problem with big packages sucking up memory, do we feel good about closing this? |
Well there's some secondary issues such as a lack of a retry mechanism and proper server side success confirmation to ensure transient network/application failures doesn't result in lost uploads that probably should also be looked into. I think #2720 may help fix the issue with a proper lack of server side upload confirmation(right now celery I think masks failures from the recap extension). |
Hm, well, today's drama is that we've got a LOT of pods using all our memory, and the auto-scaler doesn't seem to know to create nodes if we run out of memory. CPU is fine and the site is up, but at least one pod has been trashed so far when the node ran out of memory. Another node is hovering at about 93% of memory usage. We were at 60% or so before all this. |
Hm, I tweaked the auto-scaler back to 70% and I set the CPU limit to 1000 instead of 300. That seemed to help, but the memory was still maxing out, so I bumped it to 3.5GB. Now the limit is:
These changes brought us down from 29 pods to 5 (for now). |
Sorry, I'm crossing the beams a bit posting about performance stuff here, but responding to your other comments:
I think this is that: freelawproject/recap#338 So I think that means we can close this one. I'd like to since it's getting very long. |
You mean cpu requests not limit right? We shouldn't be using the CPU limit.
Yeah, may as well close this for now I guess. |
yes, sorry, the CPU limit is off! |
Creating an issue to track the known timeout related upload failure bugs as they can be difficult to diagnose.
If anyone thinks they are affected by this bug please add a comment as most users are unlikely to report these sort of transient failures, if you have an upload speed that is not reliably above 2.22Mbits/s you will definitely be effected by this issue for larger uploads so please comment here as well if that is the case for you.
Various known issues preventing reliable uploads:
Due to a lack of a retry mechanism in the recap extension transient connection and server side failures can also cause permanent upload failures.
There is a known gunicorn misconfiguration issue that prevents large uploads over slower connections due to gunicorn workers being killed prematurely after 3 minutes regardless of if there is an in progress upload. This is due to gunicorn sync workers not being suitable for production use without a caching reverse proxy in front.
Workers/upload connections occasionally appear to get killed before the 3 minute timeout, this coupled with the lack of a retry mechanism in recap causes the upload to fail entirely, it's less clear why this occurs but it appears to coincide with error messages indicating the server is overloaded so workers being killed is likely related.
There is no obvious way for users to properly identify/debug these issues as they are primarily caused by a server side issue and are often transient. The issue can be seen in the network log of the extension.
It appears some larger zip uploads which succeed to upload initially fail to fully process before the worker gets killed(for example only half of the files in the zip process successfully on the first try).
Various potential fixes:
Put a reverse proxy in front of gunicorn that caches all uploads before sending them to gunicorn, this would ensure that slow uploads don't fail due to workers being killed.
Add an upload retry on failure mechanism to the recap extension with appropriate http response status code handling.
Reconfigure gunicorn to be less prone to workers being killed while uploads are active.
Additional recommendations to make future diagnostics easier:
Add server side monitoring which tracks upload timeouts and worker timeouts so that these issues aren't missed.
Add client side upload failure monitoring/reporting mechanisms to the recap extension so that upload failure rates can be tracked appropriately.
The text was updated successfully, but these errors were encountered: