Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache repeated thumbnail failures within configured TTL #4249

Merged
merged 7 commits into from
May 7, 2024

Conversation

sarayourfriend
Copy link
Collaborator

Fixes

Fixes https://github.com/WordPress/openverse-infrastructure/issues/550 by @sarayourfriend
Related to #3798 (doesn't close that issue because there are still places we'll want to clean up direct pook context management where it can be avoided, and .matches assertions as well)

Description

I've added a function to wrap the image_proxy.get function, because the process for caching was complex enough that trying to put it into image_proxy.get itself would get very messy. The new function decorates image_proxy.get and short-cuts requests with a cached failure response when the upstream request has failed enough times in a row within a window of time.

The approach I've chosen works under the assumption that thumbnails are more likely to be flaky, than permanently broken, because dead link filtering should prevent completely unavailable images from appearing in search, which is the main place thumbnail requests would come from or become accessible from (whether on the frontend or through referencing the URLs returned by an API search request).

If we decided not to work with that assumption, then we could consider doing away with a TTL on the cached failures, instead just permanently caching those thumbnails as a failure. I can see an argument for doing that, but it's a bit more extreme, and this at least gets us to a place where we're more intelligently anticipating potential failures, without making it hard for us to experiment with other approaches in the future. In fact, if we wanted to just test a hard-failure, without a TTL, we could just set a very large TTL, and see whether whatever we're hoping happens in that case starts happening.

Because I've gone forward with the assumption that thumbnail requests are more likely to be flaky rather than outright persistent failures in most cases, I've also chosen to decrement the failure count whenever a successful response happens. This is easier than the approach suggested in the issue, which was to create separate started and completed tallies. The approach suggested in the issue was trying to solve a problem where a thumbnail request could fail in such a way that our application never even got to return a failed response (I think because of an OOM or worker timeout, something like that). This was a bigger issue pre-ASGI. However, trying to handle that is far more complex because we'd need to take into consideration that multiple requests could happen concurrently (across different or even the same workers) for the same thumbnail. Imagine we set the "uncompleted request" tolerance to 5. If a thumbnail request came in, uncached, from 6 different users within a short period of time (100s of ms from each other), then the 6th request would see 5 "uncompleted" requests in the tallies, for requests that could succeed (they haven't necessarily failed). There's no way to disambiguate that situation from 5 "actually" uncompleted requests. In that case, the 6th requester would get a "cached" failure, even though the thumbnail would have succeeded. An easy suggestion is to increase the tolerance. However, that's not workable, because there isn't a tolerance that makes sense for the normal case of an unpopular but always failing thumbnail and also prevents popular but successful thumbnails from looking like they've got a bunch of failures to concurrent requests.

As far as I know, the problem where our workers are crashing is not something we're dealing with now, so I've chosen to go with a simpler approach that handles a more common case, where a thumbnail fails in a way we can reliably track, and if it does that too much too quickly, well then assume it's going to keep failing, at least for however long we configure the TTL ("cache window") to be.

Aside from all that, I simplified the test_image_proxy module in the following ways:

I also moved the pytest.ini configuration into pyproject.toml to consolidate configuration there.

Testing Instructions

Check out the code and the unit tests. This is hard to actually test locally unless you modify a work to point to a known bad upstream URL, in which case you can reproduce it by requesting the failing thumbnail more times than the configured tolerance (refer to the new setting) and seeing that you eventually get the bypass error response, rather than the upstream error response.

Otherwise, I think I've covered the new wrapper in the unit tests, but please give feedback if there's a way to improve the clarity of the test or if I missed something.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (just catalog/generate-docs for catalog
    PRs) or the media properties generator (just catalog/generate-docs media-props
    for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend requested a review from a team as a code owner May 1, 2024 07:56
@sarayourfriend sarayourfriend requested review from obulat and stacimc May 1, 2024 07:56
@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels May 1, 2024
Do this by storing a count for each media identifier of requested thumbnails.
When the upstream request succeeds, increment the key. When it fails, decrement it.

If the count goes below -2, then the thumbnail has failed twice within cache frame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the count goes below -2, then the thumbnail has failed twice within cache frame.
If the count goes below -2, then the thumbnail has failed twice within the cache frame.

super nit

)

# The number of times to try a thumbnail request before caching a failure
THUMBNAIL_FAILURE_CACHE_TOLERANCE = config(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: THUMBNAIL_FAILURE_CACHE_TOLERANCE is probably clear enough but THUMBNAIL_FAILURE_CACHE_TOLERANCE_TRIES would be even more explicit.

Comment on lines +75 to +78
# Ignore warnings related to unverified HTTPS requests.
# Reason: This warning is suppressed to avoid raising warnings when making HTTP requests
# to servers with invalid or self-signed SSL certificates. It allows the tests to proceed
# without being interrupted by these warnings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great docs 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I copied these from the pytest.ini, didn't write them myself!

Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome tests. Looks great. I left some nitpicks that I am calling "super nits" as they're beyond minuscule. Take 'em or leave 'em, nice work!

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Well documented, tests are super clear, and the rationale makes sense to me. Approving to unblock, although I do think we should fix my one comment about the docstring to avoid confusion.

and avoid re-requesting images likely to fail.

Do this by storing a count for each media identifier of requested thumbnails.
When the upstream request succeeds, increment the key. When it fails, decrement it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, isn't this the opposite of what we do? We track the failure_count, and increment when it fails/decrement when it succeeds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes! I originally implemented a slightly different approach and forgot to update this docstring. I'll push a commit to address this and the edits @zackkrida proposed.

@sarayourfriend sarayourfriend merged commit 3c1ec34 into main May 7, 2024
45 checks passed
@sarayourfriend sarayourfriend deleted the add/cache-thumbnail-failures-api-side branch May 7, 2024 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants