Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SystemExit #658

Closed
sarayourfriend opened this issue Jan 22, 2023 · 9 comments
Closed

SystemExit #658

sarayourfriend opened this issue Jan 22, 2023 · 9 comments
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🐛 tooling: sentry Sentry issue

Comments

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Jan 22, 2023

Sentry link

https://sentry.io/share/issue/877df096da704c489499fec10762d95b/
https://sentry.io/share/issue/061ba99fc3df4c23bdb7643d337bbda0/
https://sentry.io/share/issue/b02d92b6b1f64b7987a1d817ab298ce9/

The second is associated with an already resolved issue but has come back.

Description

The errors started back in this Sentry issue a couple days after the latest API deployment, which was this version: https://github.com/WordPress/openverse-api/releases/tag/v2.7.2

However, note that 2.7.1 was skipped for deployment so it also includes these changes: https://github.com/WordPress/openverse-api/releases/tag/v2.7.1

The errors start a few days after this was deployed, however, so I'm not sure whether they're actually related. If they are, it might be a dependency change as the errors are being thrown in randomish places in the search_controller, but not any specific area related to the code changes we made in those releases.

To confirm, neither CPU nor traffic to the production API have changed. We do not currently have visibility into memory usage on the EC2 boxes (I may try to rectify this in the coming couple of weeks).

If the issue is related to the changes linked above, we could try a revert. This would be slightly more complicated and require a revert back to serving thumbnails via our local thumbnail service rather than Photon. This requires a change to the infrastructure repository (a selective revert of the changes to remove the thumbnail service).

Setting as critical because there are a lot of 500s for this.

Reproduction

@sarayourfriend sarayourfriend added 🟥 priority: critical Must be addressed ASAP 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🐛 tooling: sentry Sentry issue labels Jan 22, 2023
@sarayourfriend
Copy link
Collaborator Author

I haven't set up a revert yet because I'm the only one around. The API is still serving the vast majority of requests successfully. The error numbers are high compared to our typical service health, but not disastrous quite yet.

@sarayourfriend
Copy link
Collaborator Author

I haven't been able to reproduce a memory leak locally, but we're working with a much smaller data set, and it is hard to automatically test the local instance with the load testing tools we have.

I'm going to work on getting CloudWatch agent installed on the API boxes and see if we can get memory monitoring on the production API to see if memory really is part of the problem here.

@sarayourfriend
Copy link
Collaborator Author

Locally I can get the memory to increase steadily somewhat sometimes in about half megabyte increments for each search, but it fluctuates between increasing by between 0.1 to half a megabyte and sometimes not at all. The increase is sometimes at the actual search execution itself (the call to Elasticsearch API to execute) or elsewhere. When I profile the entire search route, there are no discernible places where the memory usage increases.

@sarayourfriend
Copy link
Collaborator Author

I'm not able to spend the time required each day to make much progress on this issue. I've got a draft PR in the private infrastructure repository with CloudWatch set up on the API EC2 boxes, but it's not going to help us find the issue, just identify whether it really is a memory leak or something else.

I would appreciate it if someone else from @WordPress/openverse would take on the role of finishing that PR this week to help diagnose this issue.

@sarayourfriend
Copy link
Collaborator Author

This issue turned out to be caused by a (potentially) malicious actor making some weird requests. I still think we have an issue because the boxes shouldn't crash just from some big requests being sent, but @zackkrida was able to mitigate the issue in other ways, so I'm downgrading this to medium. We will still add CloudWatch agent to the boxes, so we get memory monitoring. The work to add it to the API EC2 boxes can be used to add it to our other EC2 boxes while we still have them.

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟥 priority: critical Must be addressed ASAP labels Jan 24, 2023
@obulat obulat transferred this issue from WordPress/openverse-api Feb 22, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Feb 23, 2023
@AetherUnbound AetherUnbound added 🧱 stack: api Related to the Django API and removed 🧱 stack: backend labels May 15, 2023
@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Jun 23, 2023

We care continuing to see these SystemExit errors. Sentry suspects they are the result of d78abb9, but the transactions show that the issue is not exclusive to the thumbnails endpoint:

image

In the most recent alert, this has happened 540+ times since May 18th, 2023.

Sentry link: https://openverse.sentry.io/share/issue/6f89f6b794154e0c8f135bb55e681d19/

@sarayourfriend
Copy link
Collaborator Author

We're working on https://github.com/WordPress/openverse-infrastructure/pull/545 to help debug whether this is a thumbnails issue.

@sarayourfriend
Copy link
Collaborator Author

We confirmed this is a thumbnails issue and are working on the Django ASGI milestone with #2843 project thread to track this work.

@AetherUnbound Do you think we can close this issue now that the investigation is over?

@AetherUnbound
Copy link
Collaborator

Great call @sarayourfriend, I believe so!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🐛 tooling: sentry Sentry issue
Projects
Archived in project
Development

No branches or pull requests

3 participants