Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add popularity refresh DAGs #2592

Merged
merged 10 commits into from
Jul 25, 2023
Merged

Add popularity refresh DAGs #2592

merged 10 commits into from
Jul 25, 2023

Conversation

stacimc
Copy link
Collaborator

@stacimc stacimc commented Jul 7, 2023

Fixes

Fixes #2089 by @stacimc

Description

Adds a popularity refresh DAG factory that generates audio_popularity_refresh and image_popularity_refresh DAGs. Each DAG:

  • refreshes popularity metrics and constants (recalculating the popularity constants)
  • then, triggers a batched_update DagRun for each of the providers of this media type that support popularity data
  • waits for the batched updates to complete and notifies Slack
Screen Shot 2023-07-19 at 5 13 21 PM
  • It is currently on a None schedule, so must be triggered manually, and has very generous timeouts. These will be revisited in follow-up PRs after testing is done in production.

Testing Instructions

The TriggerDagRunOperator we're using to trigger the batched_updates is put in deferrable mode, in order to free the worker slot while waiting for the batched updates to complete. This requires a Triggerer to be running. Locally, I did this by running just catalog/shell and then airflow triggerer to start the Triggerer. Run just down -v && just up to start the triggerer.

You'll also want to make sure you have DATA_REFRESH_POKE_INTERVAL=5 in your catalog/.env so that you don't have to wait 30 minutes for the TriggerDagRunOperator to re-run.

Then run just init to get sample data in your local environment. The sample data all have null popularity scores, which you can verify by running just catalog/pgcli to open a pgcli session and then running:

SELECT COUNT(*) FROM image WHERE standardized_popularity is null;

SELECT COUNT(*) FROM audio WHERE standardized_popularity is null;

In both cases the result should be 5000.

Now, simply run the audio_popularity_refresh and image_popularity_refresh DAGs locally. Both should pass with no task failures. You should go to http://localhost:9090/dags/batched_update/grid and verify that you see a separate DagRun for each provider that supports popularity data. You can inspect the logs for the notify_updated_count for each, and should see:

  • INFO - Updated 3992 records for update: wikimedia_audio_popularity_refresh_20230707
  • INFO - Updated 828 records for update: freesound_popularity_refresh_20230707
  • INFO - Updated 180 records for update: jamendo_popularity_refresh_20230707
  • INFO - Updated 0 records for update: nappy_popularity_refresh_20230707
  • INFO - Updated 0 records for update: rawpixel_popularity_refresh_20230707
  • INFO - Updated 0 records for update: wikimedia_popularity_refresh_20230707
  • INFO - Updated 2500 records for update: flickr_popularity_refresh_20230707
  • INFO - Updated 2500 records for update: stocksnap_popularity_refresh_20230707

Where 0 records were updated, this is because there are no records for that provider in our sample data. Your query_ids will differ in the date suffix.


Because this PR also moves some popularity task factories around, also run both data refreshes to ensure they pass.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc requested a review from a team as a code owner July 7, 2023 22:02
@stacimc stacimc self-assigned this Jul 7, 2023
@stacimc stacimc requested review from krysal and AetherUnbound July 7, 2023 22:02
@github-actions github-actions bot added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Jul 7, 2023
@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels Jul 7, 2023
Comment on lines +480 to +482
if media_type == AUDIO:
table_name = TABLE_NAMES[AUDIO]
standardized_popularity_func = STANDARDIZED_AUDIO_POPULARITY_FUNCTION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of the scope of this PR, maybe we could have some kind of MEDIA_TYPE_CONFIG dictionary in the future? Then we could only pass the media type to format_update_standardized_popularity_query and retrieve all the db columns, table names and so on from the config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, jinx with my own comments 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re this and also this comment -- 100% yes :) In fact I started doing that in this PR but removed it because it created a much larger changeset. I kept to the current convention for this work, but I'll create an issue for refactoring this separately (unless @AetherUnbound do you think this should be updated in this PR?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, cause it's in sql.py there could be a lot more changes present. Yes, I think a separate issue is good!

Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THIS IS SO COOL. I had the pleasure of forgetting to enable the batched_update DAG in the Airflow UI and got to see all of the refresh_popularity mapped instances be marked as deferred until I enabled it.

Works as described, and I don't see any suggestions for the code itself, so LGTM! Excited to use this and tune the timing intervals after we observe it in production.

@AetherUnbound
Copy link
Collaborator

The TriggerDagRunOperator we're using to trigger the batched_updates is put in deferrable mode, in order to free the worker slot while waiting for the batched updates to complete. This requires a Triggerer to be running. Locally, I did this by running just catalog/shell and then airflow triggerer to start the Triggerer.

Woah interesting! Will we need to change the deployment steps in production to start a triggerer there as well?

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! I'm so happy to see all of the pieces coming together here 😄 I was able to run the testing instructions locally and everything ran as expected. I have a few questions and notes, and want to mention again about the airflow triggerer as we'll likely need to make an adjustment for deployments prior to kicking this off.

Additionally, would you be willing to add some tests for the get_providers_update_confs function?

catalog/dags/common/popularity/sql.py Show resolved Hide resolved
Comment on lines +480 to +482
if media_type == AUDIO:
table_name = TABLE_NAMES[AUDIO]
standardized_popularity_func = STANDARDIZED_AUDIO_POPULARITY_FUNCTION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, jinx with my own comments 😄

{
# Uniquely identify the query
"query_id": (
f"{provider}_popularity_refresh_{last_updated_time.strftime('%Y%m%d')}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is fine enough resolution for this DAG? There wouldn't be a case where we'd run it twice in one day potentially?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query_id is used for building the temp table in the triggered batched_update runs, which is dropped when the update is successful. If it's not successful, we should be managing that by clearing tasks or by manually triggering a new one with the resume_update param to use the existing temp table.

This DAG has max_active_runs=1, so I think the only way to get a collision would be to start a popularity refresh, fail it and at least one of the triggered batched_updates, and then retry the popularity_refresh DAG on the same day. In that case a failure seems fine, since that's not the intended way to handle errors for batched updates anyway.

max_active_runs=1,
catchup=False,
doc_md=__doc__,
tags=["popularity_refresh"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would make sense for this to also have the "data_refresh" tag? Or should we leave it off because it's independent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the project is complete, the data and popularity refreshes should be totally decoupled. There's definitely a case to be made that they're related but I think it might be more confusing to use the data_refresh tag here, in case it implies a relationship that no longer exists. But that's not a strongly held opinion :)

catalog/dags/popularity/dag_factory.py Outdated Show resolved Hide resolved
Comment on lines 148 to 157
refresh_popularity_scores = TriggerDagRunOperator.partial(
task_id="refresh_popularity",
trigger_dag_id=BATCHED_UPDATE_DAG_ID,
# Wait for all the dagruns to finish
wait_for_completion=True,
# Release the worker slot while waiting
deferrable=True,
poke_interval=poke_interval,
retries=0,
).expand(
# Build the conf for each provider
conf=get_providers_update_confs(POSTGRES_CONN_ID, popularity_refresh)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cool that we're using dynamic tasks again!!

catalog/dags/popularity/popularity_refresh_types.py Outdated Show resolved Hide resolved
catalog/dags/popularity/popularity_refresh_types.py Outdated Show resolved Hide resolved
@@ -22,7 +23,9 @@
UPDATE_MEDIA_POPULARITY_CONSTANTS_TASK_ID = "update_media_popularity_constants_view"


def create_refresh_popularity_metrics_task_group(data_refresh: DataRefresh):
def create_refresh_popularity_metrics_task_group(
refresh_config: DataRefresh | PopularityRefresh,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!!

@openverse-bot
Copy link
Collaborator

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 6 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@krysal
Copy link
Member

krysal commented Jul 18, 2023

I'll put this on draft while Staci catches up.

@krysal krysal marked this pull request as draft July 18, 2023 17:28
@stacimc
Copy link
Collaborator Author

stacimc commented Jul 20, 2023

I think that's all feedback addressed. I also modified the DAG slightly to pull get_last_updated_time out into its own task. This (1) ensures that the cutoff updated_on time is taken after completing the popularity constant refresh and (2) makes testing get_providers_update_confs much easier :)

@stacimc stacimc marked this pull request as ready for review July 20, 2023 00:16
@stacimc stacimc requested a review from AetherUnbound July 20, 2023 00:16
@stacimc stacimc marked this pull request as draft July 20, 2023 17:26
@stacimc
Copy link
Collaborator Author

stacimc commented Jul 20, 2023

Actually, drafting again for a moment while I look into changes necessary to start the triggerer in production.

@github-actions github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Jul 21, 2023
@stacimc stacimc marked this pull request as ready for review July 21, 2023 21:11
@stacimc stacimc requested a review from a team as a code owner July 21, 2023 21:11
@stacimc
Copy link
Collaborator Author

stacimc commented Jul 21, 2023

I think we'll need to update the catalog's docker-compose in the infrastructure repo as well in order to start the triggerer when deploying production, although I'm not 100% sure what changes we'll need or how to test them. This PR has been updated so you should now be able to just down -v && just up and the triggerer will start automatically in local development.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Worked as indicated. It seems that we will need a new service instance for the catalog. Looks in line with the project recommendations.

@AetherUnbound
Copy link
Collaborator

I will be taking another look at this today!

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for looking into the triggerer stuff.

@stacimc stacimc merged commit 3d83f48 into main Jul 25, 2023
@stacimc stacimc deleted the add/pop-refresh-dag-factory branch July 25, 2023 20:19
@krysal krysal removed 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Create popularity_refresh DAG factory
5 participants