-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache repeated thumbnail failures within configured TTL #4249
Conversation
Do this by storing a count for each media identifier of requested thumbnails. | ||
When the upstream request succeeds, increment the key. When it fails, decrement it. | ||
|
||
If the count goes below -2, then the thumbnail has failed twice within cache frame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the count goes below -2, then the thumbnail has failed twice within cache frame. | |
If the count goes below -2, then the thumbnail has failed twice within the cache frame. |
super nit
) | ||
|
||
# The number of times to try a thumbnail request before caching a failure | ||
THUMBNAIL_FAILURE_CACHE_TOLERANCE = config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: THUMBNAIL_FAILURE_CACHE_TOLERANCE
is probably clear enough but THUMBNAIL_FAILURE_CACHE_TOLERANCE_TRIES
would be even more explicit.
# Ignore warnings related to unverified HTTPS requests. | ||
# Reason: This warning is suppressed to avoid raising warnings when making HTTP requests | ||
# to servers with invalid or self-signed SSL certificates. It allows the tests to proceed | ||
# without being interrupted by these warnings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great docs 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I copied these from the pytest.ini, didn't write them myself!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome tests. Looks great. I left some nitpicks that I am calling "super nits" as they're beyond minuscule. Take 'em or leave 'em, nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Well documented, tests are super clear, and the rationale makes sense to me. Approving to unblock, although I do think we should fix my one comment about the docstring to avoid confusion.
and avoid re-requesting images likely to fail. | ||
|
||
Do this by storing a count for each media identifier of requested thumbnails. | ||
When the upstream request succeeds, increment the key. When it fails, decrement it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, isn't this the opposite of what we do? We track the failure_count, and increment when it fails/decrement when it succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes! I originally implemented a slightly different approach and forgot to update this docstring. I'll push a commit to address this and the edits @zackkrida proposed.
Co-authored-by: zack <[email protected]>
Fixes
Fixes https://github.com/WordPress/openverse-infrastructure/issues/550 by @sarayourfriend
Related to #3798 (doesn't close that issue because there are still places we'll want to clean up direct pook context management where it can be avoided, and
.matches
assertions as well)Description
I've added a function to wrap the
image_proxy.get
function, because the process for caching was complex enough that trying to put it intoimage_proxy.get
itself would get very messy. The new function decoratesimage_proxy.get
and short-cuts requests with a cached failure response when the upstream request has failed enough times in a row within a window of time.The approach I've chosen works under the assumption that thumbnails are more likely to be flaky, than permanently broken, because dead link filtering should prevent completely unavailable images from appearing in search, which is the main place thumbnail requests would come from or become accessible from (whether on the frontend or through referencing the URLs returned by an API search request).
If we decided not to work with that assumption, then we could consider doing away with a TTL on the cached failures, instead just permanently caching those thumbnails as a failure. I can see an argument for doing that, but it's a bit more extreme, and this at least gets us to a place where we're more intelligently anticipating potential failures, without making it hard for us to experiment with other approaches in the future. In fact, if we wanted to just test a hard-failure, without a TTL, we could just set a very large TTL, and see whether whatever we're hoping happens in that case starts happening.
Because I've gone forward with the assumption that thumbnail requests are more likely to be flaky rather than outright persistent failures in most cases, I've also chosen to decrement the failure count whenever a successful response happens. This is easier than the approach suggested in the issue, which was to create separate started and completed tallies. The approach suggested in the issue was trying to solve a problem where a thumbnail request could fail in such a way that our application never even got to return a failed response (I think because of an OOM or worker timeout, something like that). This was a bigger issue pre-ASGI. However, trying to handle that is far more complex because we'd need to take into consideration that multiple requests could happen concurrently (across different or even the same workers) for the same thumbnail. Imagine we set the "uncompleted request" tolerance to 5. If a thumbnail request came in, uncached, from 6 different users within a short period of time (100s of ms from each other), then the 6th request would see 5 "uncompleted" requests in the tallies, for requests that could succeed (they haven't necessarily failed). There's no way to disambiguate that situation from 5 "actually" uncompleted requests. In that case, the 6th requester would get a "cached" failure, even though the thumbnail would have succeeded. An easy suggestion is to increase the tolerance. However, that's not workable, because there isn't a tolerance that makes sense for the normal case of an unpopular but always failing thumbnail and also prevents popular but successful thumbnails from looking like they've got a bunch of failures to concurrent requests.
As far as I know, the problem where our workers are crashing is not something we're dealing with now, so I've chosen to go with a simpler approach that handles a more common case, where a thumbnail fails in a way we can reliably track, and if it does that too much too quickly, well then assume it's going to keep failing, at least for however long we configure the TTL ("cache window") to be.
Aside from all that, I simplified the
test_image_proxy
module in the following ways:pytest-pook
(Usepytest-pook
plugin to clean up pook fixtures and usage in tests #3798) and remove the variousmock.matches
assertions, which the pytest plugin already checksphoton_get
fixture, and just useasync_to_sync
insteadI also moved the pytest.ini configuration into pyproject.toml to consolidate configuration there.
Testing Instructions
Check out the code and the unit tests. This is hard to actually test locally unless you modify a work to point to a known bad upstream URL, in which case you can reproduce it by requesting the failing thumbnail more times than the configured tolerance (refer to the new setting) and seeing that you eventually get the bypass error response, rather than the upstream error response.
Otherwise, I think I've covered the new wrapper in the unit tests, but please give feedback if there's a way to improve the clarity of the test or if I missed something.
Checklist
Update index.md
).main
) or a parent feature branch.just catalog/generate-docs
for catalogPRs) or the media properties generator (
just catalog/generate-docs media-props
for the catalog or
just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin