Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Only backfill dead links if at least one on first page was not dead #865

Merged
merged 3 commits into from
Aug 10, 2022

Conversation

sarayourfriend
Copy link
Contributor

@sarayourfriend sarayourfriend commented Aug 9, 2022

Fixes

Fixes #855 by @sarayourfriend

Description

If after validating dead links, there are zero results, assume any subsequent deeper paginations of the same query will not yield better returns. The lines changed is a lie: it's like 97% changes in the unit tests. The actual change required to make this work is two SLOC with some explanatory comments.

Testing Instructions

Check out the unit tests. To test this locally the best thing to do would be to use OpenSnitch or some other firewall to block outbound requests to Flickr. Then observe the logs and ensure that only a single query is sent to Elasticsearch. Note: this will be hard to do if on Linux with Docker as root (you'll need to mess with a bunch of iptables configurations). Instead, you can just modify the code to set the status code on the responses manually using this patch:

diff --git a/api/catalog/api/utils/validate_images.py b/api/catalog/api/utils/validate_images.py
index 6ad23f6f..c4dcbb42 100644
--- a/api/catalog/api/utils/validate_images.py
+++ b/api/catalog/api/utils/validate_images.py
@@ -58,7 +58,7 @@ def validate_images(query_hash, start_slice, results, image_urls):
         # Response didn't arrive in time. Try again later.
         else:
             status = -1
-        to_cache[cache_key] = status
+        to_cache[cache_key] = 404
 
     thirty_minutes = 60 * 30
     twenty_four_hours_seconds = 60 * 60 * 24
@@ -83,7 +83,7 @@ def validate_images(query_hash, start_slice, results, image_urls):
     for idx, url in enumerate(to_verify):
         cache_idx = to_verify[url]
         if verified[idx] is not None:
-            cached_statuses[cache_idx] = verified[idx].status_code
+            cached_statuses[cache_idx] = 404
         else:
             cached_statuses[cache_idx] = -1

Then you can make any request and all links will get cached as dead. You should get responses back that have empty results now from queries that previously gave back results with the local data (like source=flickr).

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@openverse-bot openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟥 priority: critical Must be addressed ASAP labels Aug 9, 2022
@sarayourfriend sarayourfriend force-pushed the add/limit-dead-link-backfill branch from c08ae86 to 7b09e21 Compare August 9, 2022 11:55
Comment on lines +13 to +18
CACHE_PREFIX = "valid:"


def _get_cached_statuses(redis, image_urls):
cached_statuses = redis.mget([CACHE_PREFIX + url for url in image_urls])
return [int(b.decode("utf-8")) if b is not None else None for b in cached_statuses]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to facilitate mocking the cached statuses in the test. Mostly this is to avoid the status cache being prepopulated based on other tests request/response cycle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also welcome modularization!

@sarayourfriend sarayourfriend force-pushed the add/limit-dead-link-backfill branch from 7b09e21 to 67df008 Compare August 9, 2022 12:26
@github-actions
Copy link

github-actions bot commented Aug 9, 2022

API Developer Docs Preview: Ready

https://wordpress.github.io/openverse-api/_preview/865

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

@sarayourfriend sarayourfriend marked this pull request as ready for review August 9, 2022 12:59
@sarayourfriend sarayourfriend requested a review from a team as a code owner August 9, 2022 12:59
@AetherUnbound AetherUnbound removed their request for review August 9, 2022 19:02
@AetherUnbound
Copy link
Contributor

@krysal do you mind reviewing this in my stead while I run MSR this week?

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look good to me 👍 We could also try with some of the problematic providers (e.g. EOL).

Comment on lines +13 to +18
CACHE_PREFIX = "valid:"


def _get_cached_statuses(redis, image_urls):
cached_statuses = redis.mget([CACHE_PREFIX + url for url in image_urls])
return [int(b.decode("utf-8")) if b is not None else None for b in cached_statuses]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also welcome modularization!

Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I'm seeing the correct behavior! I ran this query:

http http://localhost:8000/v1/images/\?source\=flickr\&page\=5

before and after applying the patch to validate_images.py, and the later resulted in a single ES query followed by many Deleting broken image from results id=... status=404 calls.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟥 priority: critical Must be addressed ASAP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Only fill in pages if at least one link from first page is not dead
5 participants