Cache repeated thumbnail failures within configured TTL #4249

sarayourfriend · 2024-05-01T07:56:48Z

Fixes

Fixes https://github.com/WordPress/openverse-infrastructure/issues/550 by @sarayourfriend
Related to #3798 (doesn't close that issue because there are still places we'll want to clean up direct pook context management where it can be avoided, and .matches assertions as well)

Description

I've added a function to wrap the image_proxy.get function, because the process for caching was complex enough that trying to put it into image_proxy.get itself would get very messy. The new function decorates image_proxy.get and short-cuts requests with a cached failure response when the upstream request has failed enough times in a row within a window of time.

The approach I've chosen works under the assumption that thumbnails are more likely to be flaky, than permanently broken, because dead link filtering should prevent completely unavailable images from appearing in search, which is the main place thumbnail requests would come from or become accessible from (whether on the frontend or through referencing the URLs returned by an API search request).

If we decided not to work with that assumption, then we could consider doing away with a TTL on the cached failures, instead just permanently caching those thumbnails as a failure. I can see an argument for doing that, but it's a bit more extreme, and this at least gets us to a place where we're more intelligently anticipating potential failures, without making it hard for us to experiment with other approaches in the future. In fact, if we wanted to just test a hard-failure, without a TTL, we could just set a very large TTL, and see whether whatever we're hoping happens in that case starts happening.

Because I've gone forward with the assumption that thumbnail requests are more likely to be flaky rather than outright persistent failures in most cases, I've also chosen to decrement the failure count whenever a successful response happens. This is easier than the approach suggested in the issue, which was to create separate started and completed tallies. The approach suggested in the issue was trying to solve a problem where a thumbnail request could fail in such a way that our application never even got to return a failed response (I think because of an OOM or worker timeout, something like that). This was a bigger issue pre-ASGI. However, trying to handle that is far more complex because we'd need to take into consideration that multiple requests could happen concurrently (across different or even the same workers) for the same thumbnail. Imagine we set the "uncompleted request" tolerance to 5. If a thumbnail request came in, uncached, from 6 different users within a short period of time (100s of ms from each other), then the 6th request would see 5 "uncompleted" requests in the tallies, for requests that could succeed (they haven't necessarily failed). There's no way to disambiguate that situation from 5 "actually" uncompleted requests. In that case, the 6th requester would get a "cached" failure, even though the thumbnail would have succeeded. An easy suggestion is to increase the tolerance. However, that's not workable, because there isn't a tolerance that makes sense for the normal case of an unpopular but always failing thumbnail and also prevents popular but successful thumbnails from looking like they've got a bunch of failures to concurrent requests.

As far as I know, the problem where our workers are crashing is not something we're dealing with now, so I've chosen to go with a simpler approach that handles a more common case, where a thumbnail fails in a way we can reliably track, and if it does that too much too quickly, well then assume it's going to keep failing, at least for however long we configure the TTL ("cache window") to be.

Aside from all that, I simplified the test_image_proxy module in the following ways:

Introduce pytest-pook (Use pytest-pook plugin to clean up pook fixtures and usage in tests #3798) and remove the various mock.matches assertions, which the pytest plugin already checks
Remove the custom async wrapper photon_get fixture, and just use async_to_sync instead

I also moved the pytest.ini configuration into pyproject.toml to consolidate configuration there.

Testing Instructions

Check out the code and the unit tests. This is hard to actually test locally unless you modify a work to point to a known bad upstream URL, in which case you can reproduce it by requesting the failing thumbnail more times than the configured tolerance (refer to the new setting) and seeing that you eventually get the bypass error response, rather than the upstream error response.

Otherwise, I think I've covered the new wrapper in the unit tests, but please give feedback if there's a way to improve the clarity of the test or if I missed something.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
[N/A] I ran the DAG documentation generator (just catalog/generate-docs for catalog
PRs) or the media properties generator (just catalog/generate-docs media-props
for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

pytest-pook handles this check automatically

api/api/utils/image_proxy/__init__.py

zackkrida · 2024-05-03T18:55:02Z

api/api/utils/image_proxy/__init__.py

+    Do this by storing a count for each media identifier of requested thumbnails.
+    When the upstream request succeeds, increment the key. When it fails, decrement it.
+
+    If the count goes below -2, then the thumbnail has failed twice within cache frame.


Suggested change

If the count goes below -2, then the thumbnail has failed twice within cache frame.

If the count goes below -2, then the thumbnail has failed twice within the cache frame.

super nit

zackkrida · 2024-05-03T18:58:09Z

api/conf/settings/thumbnails.py

+)
+
+# The number of times to try a thumbnail request before caching a failure
+THUMBNAIL_FAILURE_CACHE_TOLERANCE = config(


super nit: THUMBNAIL_FAILURE_CACHE_TOLERANCE is probably clear enough but THUMBNAIL_FAILURE_CACHE_TOLERANCE_TRIES would be even more explicit.

zackkrida · 2024-05-03T18:58:29Z

api/pyproject.toml

+  # Ignore warnings related to unverified HTTPS requests.
+  # Reason: This warning is suppressed to avoid raising warnings when making HTTP requests
+  # to servers with invalid or self-signed SSL certificates. It allows the tests to proceed
+  # without being interrupted by these warnings.


Great docs 👍

To clarify, I copied these from the pytest.ini, didn't write them myself!

api/test/unit/utils/test_image_proxy.py

zackkrida

Awesome tests. Looks great. I left some nitpicks that I am calling "super nits" as they're beyond minuscule. Take 'em or leave 'em, nice work!

stacimc

Excellent! Well documented, tests are super clear, and the rationale makes sense to me. Approving to unblock, although I do think we should fix my one comment about the docstring to avoid confusion.

stacimc · 2024-05-03T20:43:42Z

api/api/utils/image_proxy/__init__.py

+    and avoid re-requesting images likely to fail.
+
+    Do this by storing a count for each media identifier of requested thumbnails.
+    When the upstream request succeeds, increment the key. When it fails, decrement it.


As I understand it, isn't this the opposite of what we do? We track the failure_count, and increment when it fails/decrement when it succeeds.

Oops, yes! I originally implemented a slightly different approach and forgot to update this docstring. I'll push a commit to address this and the edits @zackkrida proposed.

Co-authored-by: zack <[email protected]>

sarayourfriend added 4 commits May 1, 2024 17:24

Add pytest-pook and move pytest config into pyproject.toml

71112dc

Remove unnecessary custom async test handling for image proxy

f8ec8d5

Remove unnecessary pook mock match assertions

2cc6577

pytest-pook handles this check automatically

Cache repeated thumbnail failures within configured TTL

42fe2fb

sarayourfriend requested a review from a team as a code owner May 1, 2024 07:56

sarayourfriend requested review from obulat and stacimc May 1, 2024 07:56

sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels May 1, 2024

Fix media info interface to match actual usage

8fba9e1

zackkrida reviewed May 3, 2024

View reviewed changes

api/api/utils/image_proxy/__init__.py Show resolved Hide resolved

zackkrida reviewed May 3, 2024

View reviewed changes

api/test/unit/utils/test_image_proxy.py Outdated Show resolved Hide resolved

zackkrida approved these changes May 3, 2024

View reviewed changes

stacimc approved these changes May 3, 2024

View reviewed changes

sarayourfriend and others added 2 commits May 7, 2024 10:13

Fix documentation string

d09f179

Fix typo

10fe35a

Co-authored-by: zack <[email protected]>

sarayourfriend merged commit 3c1ec34 into main May 7, 2024
45 checks passed

sarayourfriend deleted the add/cache-thumbnail-failures-api-side branch May 7, 2024 03:44

sarayourfriend mentioned this pull request Jul 29, 2024

Reevaluate thumbnail fallback behavior #510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache repeated thumbnail failures within configured TTL #4249

Cache repeated thumbnail failures within configured TTL #4249

sarayourfriend commented May 1, 2024

zackkrida May 3, 2024

zackkrida May 3, 2024

zackkrida May 3, 2024

sarayourfriend May 7, 2024

zackkrida left a comment

stacimc left a comment

stacimc May 3, 2024

sarayourfriend May 3, 2024

	If the count goes below -2, then the thumbnail has failed twice within cache frame.
	If the count goes below -2, then the thumbnail has failed twice within the cache frame.

Cache repeated thumbnail failures within configured TTL #4249

Cache repeated thumbnail failures within configured TTL #4249

Conversation

sarayourfriend commented May 1, 2024

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

zackkrida May 3, 2024

Choose a reason for hiding this comment

zackkrida May 3, 2024

Choose a reason for hiding this comment

zackkrida May 3, 2024

Choose a reason for hiding this comment

sarayourfriend May 7, 2024

Choose a reason for hiding this comment

zackkrida left a comment

Choose a reason for hiding this comment

stacimc left a comment

Choose a reason for hiding this comment

stacimc May 3, 2024

Choose a reason for hiding this comment

sarayourfriend May 3, 2024

Choose a reason for hiding this comment