Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various gunicorn/timeout related upload failure bugs. #2640

Closed
ttys0dev opened this issue Apr 12, 2023 · 66 comments
Closed

Various gunicorn/timeout related upload failure bugs. #2640

ttys0dev opened this issue Apr 12, 2023 · 66 comments

Comments

@ttys0dev
Copy link
Contributor

Creating an issue to track the known timeout related upload failure bugs as they can be difficult to diagnose.

If anyone thinks they are affected by this bug please add a comment as most users are unlikely to report these sort of transient failures, if you have an upload speed that is not reliably above 2.22Mbits/s you will definitely be effected by this issue for larger uploads so please comment here as well if that is the case for you.

Various known issues preventing reliable uploads:

  • Due to a lack of a retry mechanism in the recap extension transient connection and server side failures can also cause permanent upload failures.

  • There is a known gunicorn misconfiguration issue that prevents large uploads over slower connections due to gunicorn workers being killed prematurely after 3 minutes regardless of if there is an in progress upload. This is due to gunicorn sync workers not being suitable for production use without a caching reverse proxy in front.

  • Workers/upload connections occasionally appear to get killed before the 3 minute timeout, this coupled with the lack of a retry mechanism in recap causes the upload to fail entirely, it's less clear why this occurs but it appears to coincide with error messages indicating the server is overloaded so workers being killed is likely related.

  • There is no obvious way for users to properly identify/debug these issues as they are primarily caused by a server side issue and are often transient. The issue can be seen in the network log of the extension.

  • It appears some larger zip uploads which succeed to upload initially fail to fully process before the worker gets killed(for example only half of the files in the zip process successfully on the first try).

Various potential fixes:

  • Put a reverse proxy in front of gunicorn that caches all uploads before sending them to gunicorn, this would ensure that slow uploads don't fail due to workers being killed.

  • Add an upload retry on failure mechanism to the recap extension with appropriate http response status code handling.

  • Reconfigure gunicorn to be less prone to workers being killed while uploads are active.

Additional recommendations to make future diagnostics easier:

  • Add server side monitoring which tracks upload timeouts and worker timeouts so that these issues aren't missed.

  • Add client side upload failure monitoring/reporting mechanisms to the recap extension so that upload failure rates can be tracked appropriately.

@troglodite2
Copy link
Contributor

Since you are using k8s, have you considered using nginx-ingress? If so, nginx will do the caching you are looking for. I have done it once, in the past, for a bare-metal server.

@mlissner
Copy link
Member

Thanks for your persistence, @ttys0dev. I think AWS's application load balancer tracks timeouts. I'll check when I get a sec, and do more research on switching the gunicorn worker model.

Many of the fibers of my being resist adding another layer to our architecture, if we can avoid it, @troglodite2 :)

@ttys0dev
Copy link
Contributor Author

Since you are using k8s, have you considered using nginx-ingress? If so, nginx will do the caching you are looking for.

Agree that's a good option, nginx has good performance and lots of flexibility for handling this sort of thing.

I'll check when I get a sec, and do more research on switching the gunicorn worker model.

This is a good stopgap but probably isn't the best long term option, using a nginx cache is going to have much better performance overall since it doesn't use the relatively ineffecient thread-per-connection model like non-asyncio django does for handling IO.

Many of the fibers of my being resist adding another layer to our architecture, if we can avoid it, @troglodite2 :)

One potential option is to migrate the django application to an asyncio request model which is conceptually similar to how something like nginx processes requests, this is not a trivial change however as you would effectively need to migrate from WSGI to ASGI and likely make other changes to avoid the synchronous codepaths.

@mlissner
Copy link
Member

So I've been hesitant to fix this issue for a few reasons:

  1. I don't have evidence from the field that it's a real problem. For example, when I look at our 504 errors from the ALB, there are about nine of them per week, unless the server is overwhelmed (but that's another story). Nine isn't great, and I'd like for it to not be an issue, but it's not a lot, I'll say that much.

  2. It's a risky change that's hard to test in advance. The Gunicorn docker build isn't normally part of our docker compose, so testing it takes a bit of work.

  3. And if we do make a mistake, it's likely to crash the front end, which is scary.

So I guess I'm sort of +0 on this. It seems like a good idea, and I'm pretty convinced y'all know what you're talking about, but I'd like to either have instructions to test it myself or to have somebody say they built the image with the tweak and it worked fine.

I hope that makes sense and explains my trepidation here.

@mlissner mlissner moved this from 🆕 New to ✅ Done in @mlissner's backlog Apr 18, 2023
@ttys0dev
Copy link
Contributor Author

  1. For example, when I look at our 504 errors from the ALB, there are about nine of them per week, unless the server is overwhelmed (but that's another story). Nine isn't great, and I'd like for it to not be an issue, but it's not a lot, I'll say that much.

From my reading of the docs a 504 error is an idle connection timeout error, that sort of timeout is from my understanding unrelated to the gunicorn 3 minute worker timeout issue as the timeout there is for a non-idle connection(ie upload timeout does not trigger due to the connecting being idle here but rather due to the worker timeout terminating a non-idle connection).

  1. It's a risky change that's hard to test in advance. The Gunicorn docker build isn't normally part of our docker compose, so testing it takes a bit of work.

How are changes to the gunicorn docker build normally tested?

I'd like to either have instructions to test it myself or to have somebody say they built the image with the tweak and it worked fine.

I'll try and test things out locally.

@mlissner
Copy link
Member

From my reading of the docs a 504 error is an idle connection timeout error

I could be wrong here, but I wasn't sure, so I checked what status code it was by going to a page on CL that I know times out, and it returned a 504?

How are changes to the gunicorn docker build normally tested?

Just by hand. They're rarely changed.

I'll try and test things out locally.

That'll be great. If you can say how you tested, that'd be helpful to future reviews too.

@ttys0dev
Copy link
Contributor Author

I checked what status code it was by going to a page on CL that I know times out, and it returned a 504?

I mean, you could have a page with a 504 timeout issue but it's probably not the same issue as the 3 minute upload timeout issue(since the upload is not idle when the connection times out), does the 504 come back after 3 minutes or does it come back earlier on that page you tested with? If the 504 error response comes back before 3 minutes it's unlikely to be related as that's well before the worker timeout should trigger.

@mlissner
Copy link
Member

OK, good call. I reproduced the issue by visiting the slow URL, and it returned a 504, but I throttled the RECAP extension itself (in Chrome) while attempting to upload a very big document, and that threw at 502.

So looking at 502 errors, there's more like 1,000/day. That seems like enough to qualify as a problem!

@ttys0dev
Copy link
Contributor Author

If you can say how you tested, that'd be helpful to future reviews too.

Did some testing by tweaking the docker compose to use gunicorn. Static assets don't seem to be set up right but seems to be enough to verify basic django app functionality works as before.

diff --git a/docker/courtlistener/docker-compose.yml b/docker/courtlistener/docker-compose.yml
index ac3e7a8f6..248769d18 100644
--- a/docker/courtlistener/docker-compose.yml
+++ b/docker/courtlistener/docker-compose.yml
@@ -62,7 +62,7 @@ services:
 
   cl-django:
     container_name: cl-django
-    image: ${DJANGO_DOCKER_IMAGE:-freelawproject/courtlistener:latest-web-dev}
+    image: ${DJANGO_DOCKER_IMAGE:-freelawproject/courtlistener:latest-web-prod}
     depends_on:
       - cl-postgresql
       - cl-redis
diff --git a/docker/django/Dockerfile b/docker/django/Dockerfile
index eb0b9b961..503b46a48 100644
--- a/docker/django/Dockerfile
+++ b/docker/django/Dockerfile
@@ -111,12 +111,16 @@ CMD python /opt/courtlistener/manage.py migrate && \
 #freelawproject/courtlistener:latest-web-prod
 FROM python-base as web-prod
 
-CMD gunicorn cl_wsgi:application \
+CMD python /opt/courtlistener/manage.py migrate && \
+    python /opt/courtlistener/manage.py createcachetable && \
+    gunicorn cl_wsgi:application \
     --chdir /opt/courtlistener/docker/django/wsgi-configs/ \
     --user www-data \
     --group www-data \
     # Set high number of workers. Docs recommend 2-4× core count`
     --workers ${NUM_WORKERS:-48} \
+    --worker-class gthread \
+    --threads 10 \
     # Allow longer queries to solr.
     --limit-request-line 6000 \
     # Reset each worker once in a while

I then rebuilt the image and ran the docker compose env normally:

make image --file docker/django/Makefile -e VERSION=$(git rev-parse --short HEAD)

So looking at 502 errors, there's more like 1,000/day. That seems like enough to qualify as a problem!

Yep, that sounds a lot more plausible based on what I've seen.

@mlissner
Copy link
Member

Cool, thank you, this is re-assuring. I did some more reading and I'm sold. Right now we use k8s with:

          # Gunicorn recommends 2-4 workers per CPU so this number is tied
          # directly to the CPU requests key, below.
          - name: NUM_WORKERS
            value: "3"

And:

          resources:
            requests:
              cpu: 300m
            limits:
              cpu: 600m

Spread across a bunch of pod replicas. From my reading, it seems like we might as well keep the workers and CPU where they are, and to add some threads. The pr doesn't have threads set up, and I haven't researched yet what that value should be, but it seems like we should do that at the same time, right?

@mlissner
Copy link
Member

One other comment. The gunicorn docs and default are not great. I get it if you're stuck behind a bad default from ages ago, but the docs seem like they should be louder about recommending gthread. It feels like only upside.

I found this article useful too, when thinking about workers vs. threads: https://stackoverflow.com/questions/38425620/gunicorn-workers-and-threads

@ttys0dev
Copy link
Contributor Author

From my reading, it seems like we might as well keep the workers and CPU where they are, and to add some threads. The pr doesn't have threads set up, and I haven't researched yet what that value should be, but it seems like we should do that at the same time, right?

Yeah, I think that makes the most sense, we probably want a good amount of threads(I now set it to 10 threads per worker without changing the total workers in the pr) since we need to handle potentially a bunch of concurrent uploads which only need their own threads and not their own workers(since uploads are io bound workloads until complete).

@mlissner
Copy link
Member

Cool. I merged it. It'll auto-deploy next time a staff member's PR gets merged (unfortunately non-staff PRs don't have the needed permissions).

Thank you for pushing on this and making sure it happened. I guess the next step is to switch all our microservices to use threads too....

Anything else to do here though?

@ttys0dev
Copy link
Contributor Author

I guess the next step is to switch all our microservices to use threads too....

Is anything needed for that or should that get picked up automatically?

Anything else to do here though?

Well, having a retry mechanism in the extension would probably be a good idea. Maybe with some sort of upload success/failure log that can be accessed by clicking on the extension icon. I think most users are unlikely to pull up the developer network/console logs so they likely won't report issues unless their is an easy way for them to see that they have an issue.

@mlissner
Copy link
Member

I'm not opposed to an auto-retry mechanism, except for the thundering herds problem it could create, I guess. I'm also +1 for some sort of upload log the user can see, though that has always seemed harder. If you want to pursue these ideas, let's get issues opened in freelawproject/recap so they can be hashed out.

For the microservices, no, they'd need to be tweaked same as the stuff here. I think the only two we really use are doctor, and disclosure extractor.

I think we can close this, then, right?

@ttys0dev
Copy link
Contributor Author

If you want to pursue these ideas, let's get issues opened in freelawproject/recap so they can be hashed out.

Opened tracking issue #338.

I think we can close this, then, right?

I think so.

@mlissner
Copy link
Member

Just a heads up, this deployed yesterday, so I repeated (my test from above using "Slow 3G" throttling to upload about 30MB](#2640 (comment)). After about four minutes, it was still going, so I disabled throttling. We're now several hours later and it's still hanging, uncompleted.

It seems like "Slow 3G" means about 52kBps, which would take about 10 minutes to do 30MB, so something went wrong.

Meanwhile, I looked at our 502 logs in the ALB and they don't seem to be changed much. The line seems flatter, but it's not like our 502's are fixed. The line is flatter, but still present:

image

(Units are 502's per hour.)

@ttys0dev
Copy link
Contributor Author

Meanwhile, I looked at our 502 logs in the ALB and they don't seem to be changed much. The line seems flatter, but it's not like our 502's are fixed.

Yeah, from the looks of it there's an overall reliability improvement and no longer a hard 3 minute timeout...but still reliability issues for larger uploads on slower connections...so we must be hitting another issue then.

How does resource usage look? Is there any memory pressure that might be killing worker processes?

@ttys0dev ttys0dev reopened this Apr 19, 2023
@github-project-automation github-project-automation bot moved this from ✅ Done to 🆕 New in @mlissner's backlog Apr 19, 2023
@mlissner
Copy link
Member

How does resource usage look? Is there any memory pressure that might be killing worker processes?

I'm not sure how to answer this, actually. Memory pressure on the nodes is fine. Individual pods could have memory pressure, but if they were running out, I think they'd get killed, right, and I'd see that?

@ttys0dev
Copy link
Contributor Author

Individual pods could have memory pressure, but if they were running out, I think they'd get killed, right, and I'd see that?

Hard to say, often stuff just gets respawned automatically when getting OOM killed. What's the breakdown for how memory is provisioned?

@mlissner
Copy link
Member

Each gunicorn pod has a 1G: memory: 1G.

The nodes are about 49% memory utilization.

@mlissner mlissner moved this from 🆕 New to 🏗 In progress in @mlissner's backlog Apr 20, 2023
@mlissner mlissner moved this from 🏗 In progress to ✅ Done in @mlissner's backlog Apr 25, 2023
@mlissner
Copy link
Member

Our memory limit == our memory request. I just bumped it to 1.5G. We'll see if that helps. I think htop inside a container is only the processes in that container. I don't see celery in there, for example.

I did notice that when looking at the pods, they were all using all their memory, so that's more evidence of that issue. I'm surprised they need so much just to run gunicorn though, that's for sure. I wonder if some dependency is pulling in way too much code (we do have a lot of dependencies).

@mlissner
Copy link
Member

Within a few minutes of writing that some pods were already hitting their 1.5G limit:

image

So I gave 'em 2G! Let's see if that helps.

@mlissner
Copy link
Member

Well, it's up to 96% of that. Something ain't right here. Could it be gthread eating memory somehow? Three workers with a few threads shouldn't be going nuts like this, right?

@ttys0dev
Copy link
Contributor Author

I'm surprised they need so much just to run gunicorn though, that's for sure.

Well gunicorn itself shouldn't use much memory.

I wonder if some dependency is pulling in way too much code (we do have a lot of dependencies).

Might be a memory leak I'm thinking.

Could it be gthread eating memory somehow?

I don't think it would be that, however if there was a preexisting memory leak...might be some impact.

Did increasing the memory reduce the worker OOM kills?

@mlissner
Copy link
Member

Might be a memory leak I'm thinking.

Yeah, something is wrong. I put it at 2GB per pod and all the pods are hovering around 90+% while killing workers.

I'd be pretty surprised if we have a memory leak unless it was a gthread thing. We don't use threading or mess with memory otherwise.

Oh, so looks like htop defaults to showing threads and processes, so I think that's normal, looks like there's only three actual gunicorn processes(the ones colored white).

This still seems strange to me. If we have three workers with three threads each, shouldn't that show up in htop as nine things? OTOH, maybe the reason it shows 16GB of memory in htop is because it's showing all processes/threads for all pods on the node (but it's still weird because where are the celery pods, or the doctor pods?)

Anyway, something is very wrong here now. I don't know if we had the memory issue before, but we should really see if we can get on top of it. Any ideas?

@ttys0dev
Copy link
Contributor Author

I'd be pretty surprised if we have a memory leak unless it was a gthread thing. We don't use threading or mess with memory otherwise.

I think it's fairly unlikely gthread itself has a memory leak, it's fairly widely used and would be something that would get fixed quickly if it were an issue. I'm thinking we must have some dependencies using threading somewhere. We are definitely using memory, by that I mean our python objects use memory and can effectively cause memory leaks if references somehow get held when they should be dropped.

This still seems strange to me. If we have three workers with three threads each, shouldn't that show up in htop as nine things?

Yeah, I think there's more threads coming from somewhere else, guessing some library.

OTOH, maybe the reason it shows 16GB of memory in htop is because it's showing all processes/threads for all pods on the node (but it's still weird because where are the celery pods, or the doctor pods?)

Well memory is allocated by process, threads within the process share the memory.

Anyway, something is very wrong here now. I don't know if we had the memory issue before, but we should really see if we can get on top of it. Any ideas?

Well my ASGI PR may help since it does a bit of refactoring, but hard to say, the django ASGI implementation should avoid threading when possible in some request codepaths(it uses a threadpool when required rather than all the time like with gthread). But if there's a memory leak...good chance it would get hit there as well.

I made an attempt at memory profiling the django application but wasn't able to get much of anything useful from that, it's apparently quite tricky to trace django memory leaks in general.

@mlissner
Copy link
Member

Hm, well, I guess the first step is to reproduce it. On a dev machine, we use the Django webserver. I suppose we should try using gunicorn with gthread and see if we can make memory go up?

I'm thinking we must have some dependencies using threading somewhere. We are definitely using memory, by that I mean our python objects use memory and can effectively cause memory leaks if references somehow get held when they should be dropped.

I'm going to regret saying this, but I can't think of anything fancy that we're using that could cause something like this.

@ttys0dev
Copy link
Contributor Author

Hm, well, I guess the first step is to reproduce it. On a dev machine, we use the Django webserver. I suppose we should try using gunicorn with gthread and see if we can make memory go up?

I mean, from what I could tell memory goes up with the normal django dev webserver as well, so it doesn't seem to be gthread specific. Although the dev server kind of intentionally leaks memory apparently by storing stuff like sql query history in a buffer.

I'm going to regret saying this, but I can't think of anything fancy that we're using that could cause something like this.

Ehhhh...with a codebase of this size and complexity...I think there's a lot of potential places we could be leaking memory even if we're not really doing something fancy/unusual. We're moving around and serializing/deserializing/parsing a lot of reasonably large requests/files in memory which could in theory cause massive memory leaks if there's a bug somewhere that holds an object reference when it should be dropped.

@ttys0dev
Copy link
Contributor Author

I'm thinking we must have some dependencies using threading somewhere.

There could also just be a random bug in a dependency of ours, once #2744 is merged I'll try and go through all our deps and see what's still on old versions which have updates available in case there's a fix.

@mlissner
Copy link
Member

gunicorn has the setting for only serving so many requests/worker. Maybe we set that to something lower and see if that helps.

@mlissner
Copy link
Member

Well, I upped the pods to 3GB and they seem stable without dropping workers. I still think something is off though. We're at about 9GB of memory to serve the site right now, just consumed by gunicorn. That feels like a lot. I wish I had the stats for this from before we switched to gthread.

@ttys0dev
Copy link
Contributor Author

gunicorn has the setting for only serving so many requests/worker. Maybe we set that to something lower and see if that helps.

I think that's more of a workaround rather than a proper fix.

Well, I upped the pods to 3GB and they seem stable without dropping workers. I still think something is off though. We're at about 9GB of memory to serve the site right now, just consumed by gunicorn. That feels like a lot. I wish I had the stats for this from before we switched to gthread.

From the profiling I was able to do the baseline application memory usage does seem to be rather high, seems some of the memory usage may be due to pulling in large dependencies/libraries like numpy/scipy.

@mlissner
Copy link
Member

Are we actually pulling those dependencies into gunicorn though? I think we have them installed, but are they actually imported?

@ttys0dev
Copy link
Contributor Author

Are we actually pulling those dependencies into gunicorn though?

They get pulled into the worker processes from my understanding but not the parent gunicorn process.

I think we have them installed, but are they actually imported?

Yes, it appears they are imported.

@mlissner
Copy link
Member

Are you sure? I always understood that if you don't do a python import, you can install whatever you want without it being in the interpreter's memory?

@ttys0dev
Copy link
Contributor Author

I always understood that if you don't do a python import, you can install whatever you want without it being in the interpreter's memory?

Yes, that's true, they are imported via a python import in this case however.

@ttys0dev
Copy link
Contributor Author

The memory profiler flags these imports as using relatively large amounts of memory:
pandas
datasketch > scipy
datasketch > numpy

@mlissner
Copy link
Member

they are imported via a python import in this case however.

Oh. Great. Can we stop that? I can't think of any features that use these except management commands?

@ttys0dev
Copy link
Contributor Author

Can we stop that? I can't think of any features that use these except management commands?

I'm not sure, I listed the locations of some of the flagged imports above that get pulled in when running under gunicorn.

@mlissner
Copy link
Member

I inlined pandas (see: #2782), but I don't think there's much we can do about datasketch.

Do we feel like this issue is more or less fixed for now? Seems like it's repeated across other bugs at this point?

@ttys0dev
Copy link
Contributor Author

Do we feel like this issue is more or less fixed for now?

I think the issues are still present although I think things are improving.

@mlissner
Copy link
Member

mlissner commented Jun 6, 2023

Hm, if we spin out the problem with big packages sucking up memory, do we feel good about closing this?

@ttys0dev
Copy link
Contributor Author

ttys0dev commented Jun 6, 2023

Hm, if we spin out the problem with big packages sucking up memory, do we feel good about closing this?

Well there's some secondary issues such as a lack of a retry mechanism and proper server side success confirmation to ensure transient network/application failures doesn't result in lost uploads that probably should also be looked into.

I think #2720 may help fix the issue with a proper lack of server side upload confirmation(right now celery I think masks failures from the recap extension).

@mlissner
Copy link
Member

mlissner commented Jun 6, 2023

Hm, well, today's drama is that we've got a LOT of pods using all our memory, and the auto-scaler doesn't seem to know to create nodes if we run out of memory.

CPU is fine and the site is up, but at least one pod has been trashed so far when the node ran out of memory. Another node is hovering at about 93% of memory usage.

We were at 60% or so before all this.

@mlissner
Copy link
Member

mlissner commented Jun 6, 2023

Hm, I tweaked the auto-scaler back to 70% and I set the CPU limit to 1000 instead of 300. That seemed to help, but the memory was still maxing out, so I bumped it to 3.5GB. Now the limit is:

          resources:
            requests:
              cpu: 1000m
            limits:
              # See: https://home.robusta.dev/blog/stop-using-cpu-limits
              #cpu: 600m
              memory: 3.5G

These changes brought us down from 29 pods to 5 (for now).

@mlissner
Copy link
Member

mlissner commented Jun 6, 2023

Sorry, I'm crossing the beams a bit posting about performance stuff here, but responding to your other comments:

Well there's some secondary issues such as a lack of a retry mechanism and proper server side success confirmation to ensure transient network/application failures doesn't result in lost uploads that probably should also be looked into.

I think this is that: freelawproject/recap#338

So I think that means we can close this one. I'd like to since it's getting very long.

@ttys0dev
Copy link
Contributor Author

ttys0dev commented Jun 7, 2023

I tweaked the auto-scaler back to 70% and I set the CPU limit to 1000 instead of 300.

You mean cpu requests not limit right? We shouldn't be using the CPU limit.

So I think that means we can close this one. I'd like to since it's getting very long.

Yeah, may as well close this for now I guess.

@ttys0dev ttys0dev closed this as completed Jun 7, 2023
@mlissner
Copy link
Member

mlissner commented Jun 7, 2023

You mean cpu requests not limit right? We shouldn't be using the CPU limit.

yes, sorry, the CPU limit is off!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants