-
Notifications
You must be signed in to change notification settings - Fork 50
Only backfill dead links if at least one on first page was not dead #865
Conversation
c08ae86
to
7b09e21
Compare
CACHE_PREFIX = "valid:" | ||
|
||
|
||
def _get_cached_statuses(redis, image_urls): | ||
cached_statuses = redis.mget([CACHE_PREFIX + url for url in image_urls]) | ||
return [int(b.decode("utf-8")) if b is not None else None for b in cached_statuses] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is to facilitate mocking the cached statuses in the test. Mostly this is to avoid the status cache being prepopulated based on other tests request/response cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also welcome modularization!
7b09e21
to
67df008
Compare
API Developer Docs Preview: Ready https://wordpress.github.io/openverse-api/_preview/865 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. |
@krysal do you mind reviewing this in my stead while I run MSR this week? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests look good to me 👍 We could also try with some of the problematic providers (e.g. EOL).
CACHE_PREFIX = "valid:" | ||
|
||
|
||
def _get_cached_statuses(redis, image_urls): | ||
cached_statuses = redis.mget([CACHE_PREFIX + url for url in image_urls]) | ||
return [int(b.decode("utf-8")) if b is not None else None for b in cached_statuses] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also welcome modularization!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I'm seeing the correct behavior! I ran this query:
http http://localhost:8000/v1/images/\?source\=flickr\&page\=5
before and after applying the patch to validate_images.py
, and the later resulted in a single ES query followed by many Deleting broken image from results id=... status=404
calls.
Fixes
Fixes #855 by @sarayourfriend
Description
If after validating dead links, there are zero results, assume any subsequent deeper paginations of the same query will not yield better returns. The lines changed is a lie: it's like 97% changes in the unit tests. The actual change required to make this work is two SLOC with some explanatory comments.
Testing Instructions
Check out the unit tests. To test this locally the best thing to do would be to use OpenSnitch or some other firewall to block outbound requests to Flickr. Then observe the logs and ensure that only a single query is sent to Elasticsearch. Note: this will be hard to do if on Linux with Docker as root (you'll need to mess with a bunch of iptables configurations). Instead, you can just modify the code to set the status code on the responses manually using this patch:
Then you can make any request and all links will get cached as dead. You should get responses back that have empty results now from queries that previously gave back results with the local data (like
source=flickr
).Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin