Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production thumbnails memory leak #3353

Closed
sarayourfriend opened this issue Nov 14, 2023 · 3 comments
Closed

Production thumbnails memory leak #3353

sarayourfriend opened this issue Nov 14, 2023 · 3 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API

Comments

@sarayourfriend
Copy link
Collaborator

Description

We are seeing a memory leak pattern in production thumbnails since deploying the ASGI worker to thumbnails

image

After we reduced task resources, which caused a redeployment, we've seen memory usage climb to 29% maximum usage and appears to continue to climb.

@sarayourfriend
Copy link
Collaborator Author

sarayourfriend commented Nov 19, 2023

Update on thumbnails memory usage: It looks like there is a leak, but that it's very slow. Memory jumps in cliffs a little less than once a day (like 1.2 days, ish?), but is more or less stable the rest of the time. If it spikes, it comes back down to the percentage it last did a big jump to. The last three big jumps in the maximum usage are all almost exactly 2% (~20 MB). Outside the three 2% jumps, there are also some smaller 1% jumps, but not more than one a day in the last four days since things really stabilised after the last deployment.

At this rate, if we deploy the thumbnails once a week, it would be safe to lower it to half it's current memory allocation.

Another interesting observation is that in the last 18 hours the maximum memory of the thumbnails is actually spiking down then back up within 0.5% of the highest it's reached. That's in contrast to the stable periods before that where the maximum was pretty much pinned at one value with a few outlier spikes up and then back down to the stable usage. Maybe that's a sign the service has reached it's actual stable point? Or maybe there's another explanation. It is due for another 2% jump within the next 8 hours or so, maybe it'll happen before I log off for the day after the team meeting in my evening. Curious to see if the pattern continues. For now though, nothing bad is happening, so it's fine to let it keep rolling on to see how it goes until we have changes to deploy again.

image

The regular API, which handles all other requests, looks stable as well, but with some interesting spikes:

image

I think that is fine though. It is also not causing any issues.

@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents and removed 🟥 priority: critical Must be addressed ASAP labels Nov 20, 2023
@sarayourfriend
Copy link
Collaborator Author

Downgrading to high priority. This isn't causing any immediate issues in our thumbnails service. We'll work on getting more ASGI-fication changes out to see if we can improve this situation that way.

#3024 and #3020 in particular.

@sarayourfriend sarayourfriend moved this from 📋 Backlog to 🏗 In progress in Openverse Backlog Nov 20, 2023
@sarayourfriend sarayourfriend moved this from 🏗 In Progress to ⛔ Blocked in Openverse Backlog Dec 4, 2023
@sarayourfriend
Copy link
Collaborator Author

I'm setting this to blocked because there's not much we can or want to do about this right now. The memory does increase steadily but so far it's never caused an issue and doesn't get much beyond 40% maximum usage, which is a perfectly comfortable place to be (we won't suddenly run out).

I think actually it would be better to close this, and to open a new issue if we identify it again after we merge the two API services back together.

@sarayourfriend sarayourfriend closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023
@openverse-bot openverse-bot moved this from ⛔ Blocked to 🗑 Discarded in Openverse Backlog Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

No branches or pull requests

2 participants