Add popularity refresh DAGs #2592

stacimc · 2023-07-07T22:02:46Z

Fixes

Fixes #2089 by @stacimc

Description

Adds a popularity refresh DAG factory that generates audio_popularity_refresh and image_popularity_refresh DAGs. Each DAG:

refreshes popularity metrics and constants (recalculating the popularity constants)
then, triggers a batched_update DagRun for each of the providers of this media type that support popularity data
waits for the batched updates to complete and notifies Slack

It is currently on a None schedule, so must be triggered manually, and has very generous timeouts. These will be revisited in follow-up PRs after testing is done in production.

Testing Instructions

The TriggerDagRunOperator we're using to trigger the batched_updates is put in deferrable mode, in order to free the worker slot while waiting for the batched updates to complete. This requires a Triggerer to be running. ~~Locally, I did this by running just catalog/shell and then airflow triggerer to start the Triggerer.~~ Run just down -v && just up to start the triggerer.

You'll also want to make sure you have DATA_REFRESH_POKE_INTERVAL=5 in your catalog/.env so that you don't have to wait 30 minutes for the TriggerDagRunOperator to re-run.

Then run just init to get sample data in your local environment. The sample data all have null popularity scores, which you can verify by running just catalog/pgcli to open a pgcli session and then running:

SELECT COUNT(*) FROM image WHERE standardized_popularity is null;

SELECT COUNT(*) FROM audio WHERE standardized_popularity is null;

In both cases the result should be 5000.

Now, simply run the audio_popularity_refresh and image_popularity_refresh DAGs locally. Both should pass with no task failures. You should go to http://localhost:9090/dags/batched_update/grid and verify that you see a separate DagRun for each provider that supports popularity data. You can inspect the logs for the notify_updated_count for each, and should see:

INFO - Updated 3992 records for update: wikimedia_audio_popularity_refresh_20230707
INFO - Updated 828 records for update: freesound_popularity_refresh_20230707
INFO - Updated 180 records for update: jamendo_popularity_refresh_20230707
INFO - Updated 0 records for update: nappy_popularity_refresh_20230707
INFO - Updated 0 records for update: rawpixel_popularity_refresh_20230707
INFO - Updated 0 records for update: wikimedia_popularity_refresh_20230707
INFO - Updated 2500 records for update: flickr_popularity_refresh_20230707
INFO - Updated 2500 records for update: stocksnap_popularity_refresh_20230707

Where 0 records were updated, this is because there are no records for that provider in our sample data. Your query_ids will differ in the date suffix.

Because this PR also moves some popularity task factories around, also run both data refreshes to ensure they pass.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

zackkrida · 2023-07-14T15:02:21Z

catalog/dags/common/popularity/sql.py

+    if media_type == AUDIO:
+        table_name = TABLE_NAMES[AUDIO]
+        standardized_popularity_func = STANDARDIZED_AUDIO_POPULARITY_FUNCTION


Outside of the scope of this PR, maybe we could have some kind of MEDIA_TYPE_CONFIG dictionary in the future? Then we could only pass the media type to format_update_standardized_popularity_query and retrieve all the db columns, table names and so on from the config.

Oh, jinx with my own comments 😄

Re this and also this comment -- 100% yes :) In fact I started doing that in this PR but removed it because it created a much larger changeset. I kept to the current convention for this work, but I'll create an issue for refactoring this separately (unless @AetherUnbound do you think this should be updated in this PR?)

Ah right, cause it's in sql.py there could be a lot more changes present. Yes, I think a separate issue is good!

zackkrida

THIS IS SO COOL. I had the pleasure of forgetting to enable the batched_update DAG in the Airflow UI and got to see all of the refresh_popularity mapped instances be marked as deferred until I enabled it.

Works as described, and I don't see any suggestions for the code itself, so LGTM! Excited to use this and tune the timing intervals after we observe it in production.

AetherUnbound · 2023-07-15T00:52:07Z

The TriggerDagRunOperator we're using to trigger the batched_updates is put in deferrable mode, in order to free the worker slot while waiting for the batched updates to complete. This requires a Triggerer to be running. Locally, I did this by running just catalog/shell and then airflow triggerer to start the Triggerer.

Woah interesting! Will we need to change the deployment steps in production to start a triggerer there as well?

AetherUnbound

This is fantastic! I'm so happy to see all of the pieces coming together here 😄 I was able to run the testing instructions locally and everything ran as expected. I have a few questions and notes, and want to mention again about the airflow triggerer as we'll likely need to make an adjustment for deployments prior to kicking this off.

Additionally, would you be willing to add some tests for the get_providers_update_confs function?

catalog/dags/common/popularity/sql.py

AetherUnbound · 2023-07-15T01:07:01Z

catalog/dags/common/popularity/sql.py

+    if media_type == AUDIO:
+        table_name = TABLE_NAMES[AUDIO]
+        standardized_popularity_func = STANDARDIZED_AUDIO_POPULARITY_FUNCTION


Oh, jinx with my own comments 😄

AetherUnbound · 2023-07-15T01:19:56Z

catalog/dags/popularity/dag_factory.py

+        {
+            # Uniquely identify the query
+            "query_id": (
+                f"{provider}_popularity_refresh_{last_updated_time.strftime('%Y%m%d')}"


Do you think this is fine enough resolution for this DAG? There wouldn't be a case where we'd run it twice in one day potentially?

The query_id is used for building the temp table in the triggered batched_update runs, which is dropped when the update is successful. If it's not successful, we should be managing that by clearing tasks or by manually triggering a new one with the resume_update param to use the existing temp table.

This DAG has max_active_runs=1, so I think the only way to get a collision would be to start a popularity refresh, fail it and at least one of the triggered batched_updates, and then retry the popularity_refresh DAG on the same day. In that case a failure seems fine, since that's not the intended way to handle errors for batched updates anyway.

AetherUnbound · 2023-07-15T01:24:21Z

catalog/dags/popularity/dag_factory.py

+        max_active_runs=1,
+        catchup=False,
+        doc_md=__doc__,
+        tags=["popularity_refresh"],


Do you think it would make sense for this to also have the "data_refresh" tag? Or should we leave it off because it's independent?

When the project is complete, the data and popularity refreshes should be totally decoupled. There's definitely a case to be made that they're related but I think it might be more confusing to use the data_refresh tag here, in case it implies a relationship that no longer exists. But that's not a strongly held opinion :)

catalog/dags/popularity/dag_factory.py

AetherUnbound · 2023-07-15T01:29:02Z

catalog/dags/popularity/dag_factory.py

+        refresh_popularity_scores = TriggerDagRunOperator.partial(
+            task_id="refresh_popularity",
+            trigger_dag_id=BATCHED_UPDATE_DAG_ID,
+            # Wait for all the dagruns to finish
+            wait_for_completion=True,
+            # Release the worker slot while waiting
+            deferrable=True,
+            poke_interval=poke_interval,
+            retries=0,
+        ).expand(
+            # Build the conf for each provider
+            conf=get_providers_update_confs(POSTGRES_CONN_ID, popularity_refresh)
+        )


So cool that we're using dynamic tasks again!!

catalog/dags/popularity/popularity_refresh_types.py

AetherUnbound · 2023-07-15T01:31:54Z

catalog/dags/popularity/refresh_popularity_metrics_task_factory.py

@@ -22,7 +23,9 @@
 UPDATE_MEDIA_POPULARITY_CONSTANTS_TASK_ID = "update_media_popularity_constants_view"


-def create_refresh_popularity_metrics_task_group(data_refresh: DataRefresh):
+def create_refresh_popularity_metrics_task_group(
+    refresh_config: DataRefresh | PopularityRefresh,


Very cool!!

openverse-bot · 2023-07-18T00:00:11Z

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 6 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)².

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

krysal · 2023-07-18T17:28:02Z

I'll put this on draft while Staci catches up.

stacimc · 2023-07-20T00:16:49Z

I think that's all feedback addressed. I also modified the DAG slightly to pull get_last_updated_time out into its own task. This (1) ensures that the cutoff updated_on time is taken after completing the popularity constant refresh and (2) makes testing get_providers_update_confs much easier :)

stacimc · 2023-07-20T17:26:48Z

Actually, drafting again for a moment while I look into changes necessary to start the triggerer in production.

stacimc · 2023-07-21T21:13:56Z

I think we'll need to update the catalog's docker-compose in the infrastructure repo as well in order to start the triggerer when deploying production, although I'm not 100% sure what changes we'll need or how to test them. This PR has been updated so you should now be able to just down -v && just up and the triggerer will start automatically in local development.

krysal

Awesome! Worked as indicated. It seems that we will need a new service instance for the catalog. Looks in line with the project recommendations.

AetherUnbound · 2023-07-25T15:23:42Z

I will be taking another look at this today!

AetherUnbound

LGTM! Thanks for looking into the triggerer stuff.

stacimc added this to the Decoupling Popularity Calculations from Data Refresh milestone Jul 7, 2023

stacimc requested a review from a team as a code owner July 7, 2023 22:02

stacimc self-assigned this Jul 7, 2023

stacimc requested review from krysal and AetherUnbound July 7, 2023 22:02

github-actions bot added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Jul 7, 2023

openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels Jul 7, 2023

zackkrida mentioned this pull request Jul 14, 2023

Popularity calculation optimizations (Matview refresh) #433

Closed

2 tasks

zackkrida reviewed Jul 14, 2023

View reviewed changes

zackkrida approved these changes Jul 14, 2023

View reviewed changes

AetherUnbound requested changes Jul 15, 2023

View reviewed changes

krysal marked this pull request as draft July 18, 2023 17:28

stacimc added 8 commits July 19, 2023 13:18

Move popularity metrics task factory into new popularity module

41288c9

Add popularity refresh dag factory

58b6904

Add tests

a515af4

Pass timeout correctly

f2c0d9b

Add DAG docs

1d7b517

Remove TODO

ddd52d7

Remove unused arg, reorder

0daa177

Move REFRESH_POKE_INTERVAL into constant

ec6e16d

stacimc force-pushed the add/pop-refresh-dag-factory branch from ecf3489 to ec6e16d Compare July 19, 2023 21:54

stacimc mentioned this pull request Jul 19, 2023

Create a mapping of media type to SQL constants #2678

Closed

Ensure last_updated_date is set after refreshing constants, add tests

6117161

stacimc marked this pull request as ready for review July 20, 2023 00:16

stacimc requested a review from AetherUnbound July 20, 2023 00:16

stacimc marked this pull request as draft July 20, 2023 17:26

Add triggerer to docker-compose

a0ae22c

github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Jul 21, 2023

stacimc marked this pull request as ready for review July 21, 2023 21:11

stacimc requested a review from a team as a code owner July 21, 2023 21:11

krysal approved these changes Jul 24, 2023

View reviewed changes

AetherUnbound approved these changes Jul 25, 2023

View reviewed changes

stacimc merged commit 3d83f48 into main Jul 25, 2023

stacimc deleted the add/pop-refresh-dag-factory branch July 25, 2023 20:19

krysal removed 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add popularity refresh DAGs #2592

Add popularity refresh DAGs #2592

stacimc commented Jul 7, 2023 •

edited

Loading

zackkrida Jul 14, 2023

AetherUnbound Jul 15, 2023

stacimc Jul 19, 2023

AetherUnbound Jul 19, 2023

zackkrida left a comment

AetherUnbound commented Jul 15, 2023

AetherUnbound left a comment

AetherUnbound Jul 15, 2023

AetherUnbound Jul 15, 2023

stacimc Jul 19, 2023

AetherUnbound Jul 15, 2023

stacimc Jul 19, 2023

AetherUnbound Jul 15, 2023

AetherUnbound Jul 15, 2023

openverse-bot commented Jul 18, 2023

krysal commented Jul 18, 2023

stacimc commented Jul 20, 2023

stacimc commented Jul 20, 2023

stacimc commented Jul 21, 2023

krysal left a comment

AetherUnbound commented Jul 25, 2023

AetherUnbound left a comment

Add popularity refresh DAGs #2592

Add popularity refresh DAGs #2592

Conversation

stacimc commented Jul 7, 2023 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zackkrida left a comment

Choose a reason for hiding this comment

AetherUnbound commented Jul 15, 2023

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openverse-bot commented Jul 18, 2023

Footnotes

krysal commented Jul 18, 2023

stacimc commented Jul 20, 2023

stacimc commented Jul 20, 2023

stacimc commented Jul 21, 2023

krysal left a comment

Choose a reason for hiding this comment

AetherUnbound commented Jul 25, 2023

AetherUnbound left a comment

Choose a reason for hiding this comment

stacimc commented Jul 7, 2023 •

edited

Loading