Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create popularity_refresh DAG factory #2089

Closed
stacimc opened this issue May 12, 2023 · 0 comments · Fixed by #2592
Closed

Create popularity_refresh DAG factory #2089

stacimc opened this issue May 12, 2023 · 0 comments · Fixed by #2592
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@stacimc
Copy link
Collaborator

stacimc commented May 12, 2023

Description

Create a popularity_refresh_dag_factory similar to the DAG factories for provider and data refresh DAGs. For each media_type, it should generate a <media_type>_popularity_refresh DAG which does the following:

  1. Update the <media>_popularity_metrics table to include any newly added
    metrics.
  2. Refresh the <media>_popularity_constants view to recalculate the popularity
    constants.
    1. This is done CONCURRENTLY so that provider DAGs can continue reading
      from the view while it updates.
  3. For each unique provider in the <media>_popularity_constants view,
    generate a refresh_<provider>_scores task. The task will run an UPDATE of
    the standardized_popularity on all records matching that provider which
    were last updated before the task began.
    1. We may consider running the refresh_<provider>_scores tasks in parallel
      to speed up the update.
    2. Optionally, we can hard code a SKIPLIST of providers that are present in
      the <media>_popularity_constants view, but for which we do not want to
      create a refresh task. We currently have some providers (Nappy, Rawpixel,
      Stocksnap) that support popularity data but are not dated, meaning scores
      for all of their records will be updated the next time the DAG runs.
      Note that some of these DAGs are on a @monthly schedule however, which
      means skipping them in this DAG could result in delayed recalculation
      time.
  4. Report to Slack when the scores have finished updating.

Refer to this section in the IP to ensure that the refresh tasks avoid issues with deadlocking and timeouts.

Additional context

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels May 12, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog May 12, 2023
@stacimc stacimc self-assigned this May 17, 2023
@stacimc stacimc mentioned this issue Jun 6, 2023
8 tasks
@zackkrida zackkrida moved this from 📋 Backlog to 📅 To do in Openverse Backlog Jul 5, 2023
@stacimc stacimc moved this from 📅 To do to 🏗 In progress in Openverse Backlog Jul 6, 2023
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Openverse Backlog Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant