SystemExit #658

sarayourfriend · 2023-01-22T23:27:25Z

Sentry link

https://sentry.io/share/issue/877df096da704c489499fec10762d95b/
https://sentry.io/share/issue/061ba99fc3df4c23bdb7643d337bbda0/
https://sentry.io/share/issue/b02d92b6b1f64b7987a1d817ab298ce9/

The second is associated with an already resolved issue but has come back.

Description

The errors started back in this Sentry issue a couple days after the latest API deployment, which was this version: https://github.com/WordPress/openverse-api/releases/tag/v2.7.2

However, note that 2.7.1 was skipped for deployment so it also includes these changes: https://github.com/WordPress/openverse-api/releases/tag/v2.7.1

The errors start a few days after this was deployed, however, so I'm not sure whether they're actually related. If they are, it might be a dependency change as the errors are being thrown in randomish places in the search_controller, but not any specific area related to the code changes we made in those releases.

To confirm, neither CPU nor traffic to the production API have changed. We do not currently have visibility into memory usage on the EC2 boxes (I may try to rectify this in the coming couple of weeks).

If the issue is related to the changes linked above, we could try a revert. This would be slightly more complicated and require a revert back to serving thumbnails via our local thumbnail service rather than Photon. This requires a change to the infrastructure repository (a selective revert of the changes to remove the thumbnail service).

Setting as critical because there are a lot of 500s for this.

Reproduction

sarayourfriend · 2023-01-23T02:36:21Z

I haven't set up a revert yet because I'm the only one around. The API is still serving the vast majority of requests successfully. The error numbers are high compared to our typical service health, but not disastrous quite yet.

sarayourfriend · 2023-01-23T05:00:33Z

I haven't been able to reproduce a memory leak locally, but we're working with a much smaller data set, and it is hard to automatically test the local instance with the load testing tools we have.

I'm going to work on getting CloudWatch agent installed on the API boxes and see if we can get memory monitoring on the production API to see if memory really is part of the problem here.

sarayourfriend · 2023-01-23T22:09:59Z

Locally I can get the memory to increase steadily somewhat sometimes in about half megabyte increments for each search, but it fluctuates between increasing by between 0.1 to half a megabyte and sometimes not at all. The increase is sometimes at the actual search execution itself (the call to Elasticsearch API to execute) or elsewhere. When I profile the entire search route, there are no discernible places where the memory usage increases.

sarayourfriend · 2023-01-24T02:12:52Z

I'm not able to spend the time required each day to make much progress on this issue. I've got a draft PR in the private infrastructure repository with CloudWatch set up on the API EC2 boxes, but it's not going to help us find the issue, just identify whether it really is a memory leak or something else.

I would appreciate it if someone else from @WordPress/openverse would take on the role of finishing that PR this week to help diagnose this issue.

sarayourfriend · 2023-01-24T23:29:32Z

This issue turned out to be caused by a (potentially) malicious actor making some weird requests. I still think we have an issue because the boxes shouldn't crash just from some big requests being sent, but @zackkrida was able to mitigate the issue in other ways, so I'm downgrading this to medium. We will still add CloudWatch agent to the boxes, so we get memory monitoring. The work to add it to the API EC2 boxes can be used to add it to our other EC2 boxes while we still have them.

AetherUnbound · 2023-06-23T20:44:53Z

We care continuing to see these SystemExit errors. Sentry suspects they are the result of d78abb9, but the transactions show that the issue is not exclusive to the thumbnails endpoint:

In the most recent alert, this has happened 540+ times since May 18th, 2023.

Sentry link: https://openverse.sentry.io/share/issue/6f89f6b794154e0c8f135bb55e681d19/

sarayourfriend · 2023-06-26T00:31:39Z

We're working on https://github.com/WordPress/openverse-infrastructure/pull/545 to help debug whether this is a thumbnails issue.

sarayourfriend · 2023-08-25T05:41:23Z

We confirmed this is a thumbnails issue and are working on the Django ASGI milestone with #2843 project thread to track this work.

@AetherUnbound Do you think we can close this issue now that the investigation is over?

AetherUnbound · 2023-08-25T21:01:12Z

Great call @sarayourfriend, I believe so!

sarayourfriend added 🟥 priority: critical Must be addressed ASAP 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🐛 tooling: sentry Sentry issue labels Jan 22, 2023

sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟥 priority: critical Must be addressed ASAP labels Jan 24, 2023

obulat transferred this issue from WordPress/openverse-api Feb 22, 2023

obulat added the stack: backend label Feb 22, 2023

obulat added this to Openverse Backlog Feb 23, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Feb 23, 2023

AetherUnbound added 🧱 stack: api Related to the Django API and removed 🧱 stack: backend labels May 15, 2023

AetherUnbound closed this as completed Aug 25, 2023

github-project-automation bot moved this from 📋 Backlog to ✅ Done in Openverse Backlog Aug 25, 2023

sarayourfriend mentioned this issue Dec 1, 2023

Thumbnails connection reset #664

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SystemExit #658

SystemExit #658

sarayourfriend commented Jan 22, 2023 •

edited

Loading

sarayourfriend commented Jan 23, 2023

sarayourfriend commented Jan 23, 2023

sarayourfriend commented Jan 23, 2023

sarayourfriend commented Jan 24, 2023

sarayourfriend commented Jan 24, 2023

AetherUnbound commented Jun 23, 2023 •

edited

Loading

sarayourfriend commented Jun 26, 2023

sarayourfriend commented Aug 25, 2023

AetherUnbound commented Aug 25, 2023

SystemExit #658

SystemExit #658

Comments

sarayourfriend commented Jan 22, 2023 • edited Loading

Sentry link

Description

Reproduction

sarayourfriend commented Jan 23, 2023

sarayourfriend commented Jan 23, 2023

sarayourfriend commented Jan 23, 2023

sarayourfriend commented Jan 24, 2023

sarayourfriend commented Jan 24, 2023

AetherUnbound commented Jun 23, 2023 • edited Loading

sarayourfriend commented Jun 26, 2023

sarayourfriend commented Aug 25, 2023

AetherUnbound commented Aug 25, 2023

sarayourfriend commented Jan 22, 2023 •

edited

Loading

AetherUnbound commented Jun 23, 2023 •

edited

Loading