Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Remove thumbnails from images #526

Merged
merged 10 commits into from
Jun 2, 2022
Merged

Remove thumbnails from images #526

merged 10 commits into from
Jun 2, 2022

Conversation

obulat
Copy link
Contributor

@obulat obulat commented May 20, 2022

Fixes

Fixes WordPress/openverse#1561 by @obulat

Description

This PR removes thumbnail_url from image provider scripts, and hard-codes thumbnail_url to None in the imageStore class.

Testing Instructions

Run the tests - they should pass. Try running the image DAGs and see that they save image thumbnails as NULL.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@obulat obulat requested a review from a team as a code owner May 20, 2022 15:57
@obulat obulat requested review from dhruvkb and stacimc May 20, 2022 15:57
@openverse-bot openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon labels May 20, 2022
Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited for all this code cleanup! Just a few comments. The only other thumbnail references I'm seeing are in the archive/commoncrawl, which is not being used at the moment and seems like a separate can of worms.

Could we add a test that the thumbnail gets hard-coded to None, maybe in the image store tests?

@zackkrida zackkrida requested a review from stacimc May 24, 2022 10:39
Signed-off-by: Olga Bulat <[email protected]>
@obulat
Copy link
Contributor Author

obulat commented May 25, 2022

Could we add a test that the thumbnail gets hard-coded to None, maybe in the image store tests?

I tried adding a specific test for None in the thumbnail, but decided to simply modify existing tests:

  • the public method ImageStore.add_image that is used by the provider scripts does not accept the thumbnail_url parameter. If you pass it, it throws an error.
  • the inner method ImageStore._get_image is used by the add_image, and by the tests as a short-cut. It was possible to pass a non-null thumbnail_url there (even though only the tests would do it), so I added a line to _get_image that sets thumbnail_url to None anyways. That is tested in one of the tests in test_image.

I wonder if it was better to simply use add_image in tests, without ever using _get_image, which would make the tests more like the usage in the real code. Or maybe it doesn't make sense to call _get_image with arguments that add_image can pass it...

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I love seeing change counts like this 🤩
Screenshot_2022-05-25_07-46-22

This prevents us from acquiring new thumbnail IDs - it would be awesome if we could remove thumbnail_url altogether from the catalog and the API. Will there be a follow up PR to remove that column from the table? I think it would be a good idea to do that shortly after this 😄

@@ -94,7 +94,8 @@ def _get_items():
while should_continue:
query_params = _get_query_params(offset=offset)
batch_data = _get_batch_json(query_params=query_params)
if isinstance(batch_data, list) and len(batch_data) > 0:
logger.info(f"batch_data: {batch_data}, query_params: {query_params}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_data is going to be a pretty large list in most cases, do you think this could be a logger.debug line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was probably left their by mistake :)

@obulat
Copy link
Contributor Author

obulat commented May 26, 2022

This prevents us from acquiring new thumbnail IDs - it would be awesome if we could remove thumbnail_url altogether from the catalog and the API. Will there be a follow-up PR to remove that column from the table? I think it would be a good idea to do that shortly after this 😄

I noted it elsewhere, but will also write it here:
We do use thumbnail_url for other media types, and currently, it is in the list of common columns for all media types. We talked about video and 3D model maybe using thumbnails, too. It might be easier to simply leave it as a common column for all media types, but leave it as NULL for images. What do you think?

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally makes sense! 🚀 Do you think it might be worth running an update to set the field to NULL for all existing images?

@obulat
Copy link
Contributor Author

obulat commented May 28, 2022

Totally makes sense! 🚀 Do you think it might be worth running an update to set the field to NULL for all existing images?

Yes, absolutely!

@obulat obulat requested a review from stacimc June 2, 2022 11:00
@obulat obulat self-assigned this Jun 2, 2022
Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 🎉 Look at that beautiful stat for lines changed 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove thumbnail field for images from the catalog
4 participants