Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DAG to remove Flickr thumbnails #2302

Merged
merged 6 commits into from
Jun 6, 2023
Merged

Add DAG to remove Flickr thumbnails #2302

merged 6 commits into from
Jun 6, 2023

Conversation

krysal
Copy link
Member

@krysal krysal commented Jun 2, 2023

Fixes

Fixes #1816 by @krysal

Description

Flickr is the last and main provider retaining thumbnails that do not fit our requirements for showing in the Openverse UI (mostly on desktop), so here is a DAG to remove them progressively in batches. This should allow other tasks while running and advance steadily. It uses the new TaskFlow API which comes really handy for a DAG like this.

After the DAG has runs successfully, this will allow us to revert #1812 on the API side.

In minor related changes, I also exposed the port of the upstream_db for being able to use UI software like DataGrip or TablePlus, and fixed a shadowing name in the Flickr DAG.

Testing Instructions

  1. Run the normal flicker DAG to ingest some rows.
  2. Modify those rows to have some fake thumbnails URLs
UPDATE image SET thumbnail='https://flickr.com/fake_thumb.JPG' WHERE identifier IN (
	SELECT identifier FROM image WHERE provider='flickr' AND thumbnail IS NULL
	FETCH FIRST 50000 ROWS ONLY
);
  1. Run the new flickr_thumbnails_removal DAG
  2. Check in the DB the Flickr rows
-- This query should return 0
SELECT count(*) FROM image WHERE provider='flickr' AND thumbnail IS NOT NULL;

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Jun 2, 2023
@krysal krysal requested review from a team as code owners June 2, 2023 16:51
@krysal krysal requested review from dhruvkb and stacimc June 2, 2023 16:51
@krysal krysal force-pushed the rm_flickr_thumbs branch 2 times, most recently from 28dd2f4 to 8b58802 Compare June 2, 2023 19:45
@stacimc
Copy link
Contributor

stacimc commented Jun 3, 2023

I mentioned this on the popularity refresh project thread but should have also done so on #1816 -- I'm currently working on a reusable batched_update DAG that will be used by the popularity refresh DAGs, but can also be run manually for these sorts of backfills without having to spin up temporary DAGs. I plan on having that up by early next week.

The DAG I'm working on uses a slightly different approach for the batched update, which I think is slightly more optimized (about 1.3x as fast during some tests I did on a DB snapshot of production data, although I was only able to run a few tests). I suspect this update is going to be quite slow either way 😞 so any performance improvement might really add up. A full update of Flickr needs almost 50k batches, although I actually don't know how many have null thumbnails. Do you have a sense of how many need to be updated?

@krysal
Copy link
Member Author

krysal commented Jun 5, 2023

@stacimc By March 11, I commented on the following ratio of thumbnails availability for Flickr. Today there could be fewer since that was before my attempt to run the UPDATE manually.

+---------------+------------------+----------------------+
| provider      | thumbs_available | thumbs_not_available |
|---------------+------------------+----------------------|
| flickr        | 497009314        | 911578               |
+---------------+------------------+----------------------+

I'm eager to see the optimizations for this task. However, since it's expected to take some time either way, I would like to start as soon as possible. The DAG I created here can be started and stopped at any time without harm. We can move forward while you prepare the other DAG and it is reviewed without pressure. What do you think?

Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My limited understanding of the catalog notwithstanding, this LGTM!

query = dedent(
f"""
UPDATE image SET thumbnail = NULL WHERE identifier IN
(SELECT identifier {select_conditions} FETCH FIRST 10000 ROWS ONLY)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was new to me! I only new LIMIT till now.

for license in LICENSE_INFO.keys():
for license_ in LICENSE_INFO.keys():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own edification, what is the reason behind this rename?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pycharm was complaining it was shadowing the built-in name.

@AetherUnbound
Copy link
Collaborator

@krysal I think waiting for the more generic version of the DAG that @stacimc is working on would be more ideal IMO. There's a few reasons for this:

  • We want to establish a process for doing these kinds of updates in the future and we're almost there with the new DAG
  • We have the perfect test case for using this new process against production data with the Flickr thumbnails. Besides the popularity data work, I don't know when we'll have another opportunity to run that DAG and see how it works/behaves/formalize its process.
  • It's potentially more optimized, which means the generic DAG should complete sooner.
  • If we wait and use the generic DAG, we don't have to worry about retiring or removing this temporary DAG down the line.

@stacimc - do you feel confident that the case we have here will be possible with the generic version you're working on? I also recognize that you'd like to get this completed ASAP Krystle - Staci do you feel like you'd be able to prioritize it so we can kick off this Flickr update?

@krysal
Copy link
Member Author

krysal commented Jun 5, 2023

@AetherUnbound I don't think any of those reasons are strong enough to block this particular task.

We have the perfect test case for using this new process against production data with the Flickr thumbnails.

Isn't the popularity calculations backfill the main case for @stacimc's DAG?

I don't know when we'll have another opportunity to run that DAG and see how it works/behaves/formalize its process.

We need to clean the tags so that is another opportunity to try the DAG, despite being a bit more complicated due to the type of data.

It's potentially more optimized, which means the generic DAG should complete sooner.

As I said before, since it's expected to take some time either way would be better to start sooner and make progress, on what we can achieve with this DAG.

If we wait and use the generic DAG, we don't have to worry about retiring or removing this temporary DAG down the line.

It's a code change so small and easy that it shouldn't count weighing the benefit of gaining time. Tasks related to thumbnails have been delayed for too long.

However, I'll grant on waiting if you both want to use Flickr's thumbnails as a test case.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the popularity calculations backfill the main case for @stacimc's DAG?

I'm writing a batched_update DAG that will be used by the popularity refresh DAGs, but which can also be run manually to do arbitrary SQL updates. The goal was for this to remove the need for one-off temporary DAGs for backfills. Its implementation was unfortunately delayed because of the catalog performance investigations.

The DAG is working (and definitely works for this use case), I am just writing tests. It is my priority and should be up by the end of the day tomorrow.

As I said before, since it's expected to take some time either way would be better to start sooner and make progress, on what we can achieve with this DAG.

I definitely see your point -- at least what I was getting at was that if this DAG is going to be very long running, which may very well be the case, then a delay of a day or two on the PR might actually still be faster. There is certainly no harm in starting this DAG, though.

I was hoping to use the thumbnail update to test out the new DAG, if urgency allows. The v1 implementation is not especially complex to review because of some limitations on dynamic task mapping; the primary difference is in the use of indexed temp tables for updating (which speeds up the inner SELECT per batch), and the configurability of the DAG itself. Of course, I'm not sure how long review will take however.

All that said, if you feel strongly that this should be started today we can go ahead. My only blocking request is the addition of SKIP LOCKED.

catalog/dags/flickr_thumbs_removal.py Outdated Show resolved Hide resolved
catalog/dags/flickr_thumbs_removal.py Outdated Show resolved Hide resolved
@krysal
Copy link
Member Author

krysal commented Jun 5, 2023

Another good reason to try this DAG is to have hard data for effective comparison. So far we have been talking about hypothetical efficiency, but no numbers have been shared. I can't tell where that is coming from until the other DAG is up for review and tried.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good reason to try this DAG is to have hard data for effective comparison. So far we have been talking about hypothetical efficiency, but no numbers have been shared. I can't tell where that is coming from until the other DAG is up for review and tried.

Approving, but for what it's worth we will not be able to compare this in production once the update is finished because this exact update can't be reasonably performed twice. Based on what we have seen in other tests, it will not be especially meaningful to compare it to different updates -- there are too many confounding factors.

The performance test I mentioned earlier was on production data on a test DB instance restored from a production snapshot. The number I gave was 1.3x as fast. I was comparing the performance of updating a single batch. If you are curious it took a little more than an hour (1hr 2min 22sec) with this approach, and 45min 12 sec with the approach in #2331. I gave the relative performance rather than the exact times because in our testing we've seen that production is generally faster than these test instances, so the absolute run time in the tests isn't necessarily predictive, just the relative performance. I am very hopeful that the batches will be faster on prod but we'll have to see; this is one of the reasons I'm very eager to test #2331 soon.

@krysal krysal merged commit 7804de6 into main Jun 6, 2023
@krysal krysal deleted the rm_flickr_thumbs branch June 6, 2023 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Delete unacceptable thumbnails from catalog DB after the image data refresh is finished
4 participants