-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SystemExit #658
Comments
I haven't set up a revert yet because I'm the only one around. The API is still serving the vast majority of requests successfully. The error numbers are high compared to our typical service health, but not disastrous quite yet. |
I haven't been able to reproduce a memory leak locally, but we're working with a much smaller data set, and it is hard to automatically test the local instance with the load testing tools we have. I'm going to work on getting CloudWatch agent installed on the API boxes and see if we can get memory monitoring on the production API to see if memory really is part of the problem here. |
Locally I can get the memory to increase steadily somewhat sometimes in about half megabyte increments for each search, but it fluctuates between increasing by between 0.1 to half a megabyte and sometimes not at all. The increase is sometimes at the actual search execution itself (the call to Elasticsearch API to execute) or elsewhere. When I profile the entire search route, there are no discernible places where the memory usage increases. |
I'm not able to spend the time required each day to make much progress on this issue. I've got a draft PR in the private infrastructure repository with CloudWatch set up on the API EC2 boxes, but it's not going to help us find the issue, just identify whether it really is a memory leak or something else. I would appreciate it if someone else from @WordPress/openverse would take on the role of finishing that PR this week to help diagnose this issue. |
This issue turned out to be caused by a (potentially) malicious actor making some weird requests. I still think we have an issue because the boxes shouldn't crash just from some big requests being sent, but @zackkrida was able to mitigate the issue in other ways, so I'm downgrading this to medium. We will still add CloudWatch agent to the boxes, so we get memory monitoring. The work to add it to the API EC2 boxes can be used to add it to our other EC2 boxes while we still have them. |
We care continuing to see these SystemExit errors. Sentry suspects they are the result of d78abb9, but the transactions show that the issue is not exclusive to the thumbnails endpoint: In the most recent alert, this has happened 540+ times since May 18th, 2023. Sentry link: https://openverse.sentry.io/share/issue/6f89f6b794154e0c8f135bb55e681d19/ |
We're working on https://github.com/WordPress/openverse-infrastructure/pull/545 to help debug whether this is a thumbnails issue. |
We confirmed this is a thumbnails issue and are working on the Django ASGI milestone with #2843 project thread to track this work. @AetherUnbound Do you think we can close this issue now that the investigation is over? |
Great call @sarayourfriend, I believe so! |
Sentry link
https://sentry.io/share/issue/877df096da704c489499fec10762d95b/
https://sentry.io/share/issue/061ba99fc3df4c23bdb7643d337bbda0/
https://sentry.io/share/issue/b02d92b6b1f64b7987a1d817ab298ce9/
The second is associated with an already resolved issue but has come back.
Description
The errors started back in this Sentry issue a couple days after the latest API deployment, which was this version: https://github.com/WordPress/openverse-api/releases/tag/v2.7.2
However, note that 2.7.1 was skipped for deployment so it also includes these changes: https://github.com/WordPress/openverse-api/releases/tag/v2.7.1
The errors start a few days after this was deployed, however, so I'm not sure whether they're actually related. If they are, it might be a dependency change as the errors are being thrown in randomish places in the search_controller, but not any specific area related to the code changes we made in those releases.
To confirm, neither CPU nor traffic to the production API have changed. We do not currently have visibility into memory usage on the EC2 boxes (I may try to rectify this in the coming couple of weeks).
If the issue is related to the changes linked above, we could try a revert. This would be slightly more complicated and require a revert back to serving thumbnails via our local thumbnail service rather than Photon. This requires a change to the infrastructure repository (a selective revert of the changes to remove the thumbnail service).
Setting as critical because there are a lot of 500s for this.
Reproduction
The text was updated successfully, but these errors were encountered: