Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

just recipe to facilitate running ingestion DAGs for testing #2999

Open
sarayourfriend opened this issue Sep 8, 2023 · 2 comments
Open

just recipe to facilitate running ingestion DAGs for testing #2999

sarayourfriend opened this issue Sep 8, 2023 · 2 comments
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Sep 8, 2023

Problem

A lot of catalogue PRs require running provider DAGs to ingest a certain amount of testing data. This requires a lot of fiddly clicking around the Airflow UI, including attending to the running DAGs, checking whether they've pulled sufficient data (usually a couple of hundred records), and manually marking them as successful, and so forth. It's not hard, it's just tedious.

Description

To facilitate this, it would be nice to have a just recipe that automatically handles these things, via commands to the airflow CLI. There could be separate recipes for each media type, with a derived recipe that runs both for you. The recipes would:

  1. Activate the DAG and confirm a new run starts after a second or two.
  2. Monitor the DAG output and then mark the pull_data task as successful after a couple of hundred records are pulled.

For bonus points, the recipe could also check the catalogue .env file to confirm that you've properly configured the required API keys and give a helpful message when you haven't. It could also take a parameter to override the particular provider you want to run, maybe defaulting to the easiest DAGs to configure for local use. The typical suggested providers are Jamendo for audio and Flickr or SMK for images.

Additional context

See this PRs testing instructions for a good example of this workflow: #2964

There are other steps that are also quite common for testing changes, like running the data refreshes. Those also require fiddly clicking around to enable related DAGs (though that might soon not be necessary if #2981 is implemented in a way that removes the reliance on the filtered index DAG from the data refresh DAG, i.e., if the data refresh DAG takes on those tasks itself, rather than triggering the related DAG). If this is a good idea (@WordPress/openverse-catalog), then we could create further recipes for running the data refreshes and popularity calculations and so forth as well.

Additional benefits from this could be gained if we also integrated these into our catalogue CI test suite. Considering data refresh is a critical path for the catalogue, it would be great if we could do an integration test for it on every change to the Airflow configuration and DAGs or to the ingestion server API.

@sarayourfriend sarayourfriend added 🟩 priority: low Low priority and doesn't need to be rushed 🌟 goal: addition Addition of new feature 🤖 aspect: dx Concerns developers' experience with the codebase 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Sep 8, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Sep 8, 2023
@sarayourfriend
Copy link
Collaborator Author

One side note: it could be that my lack of familiarity with Airflow is leading me down a more fiddly approach to running individual provider workflows. Right now I do this:

  1. Search for the provider workflow in the "search DAGs" box in Airflow
  2. Select the provider workflow name, which takes me to the DAGs overview page
  3. Activate the DAG by clicking the toggle switch
  4. Repeat for other provider workflows if necessary
  5. Go back to the DAGs list and filter for active DAGs
  6. Wait around a bit
  7. Click each DAG's active task, which takes me to the DAG runs list page (rather than the DAG overview)
  8. Click again on the DAG run to get taken to the DAG grid view OR graph
  9. If I clicked the worst option and got taken to the graph, then I have to remember how to find the active task (it's the subtle green border on the box), which opens a popup.
  10. If I clicked the better option and got taken to the grid view, I need to check if I'm on the DAG run overall or on the task (depending on if I decided to click the active DAG run or the active DAG task from the DAGs list, I'll end up in one or the other). If I'm on the overall run, then click on the task.
  11. Finally mark the task as successful, either via the popup from step 9 or on the grid view's individual task view from step 10.

Anyway, all of that is pretty fiddly, so it would be nice to have a just recipe, from my perspective. However, if the way that I'm doing this is making things harder on me than it needs to be (like if there are easier shortcuts for these things) then let me know. It's less of a lift for me to learn how to use Airflow better than to write new code to facilitate this (even if the code could also have benefits for integration testing).

@AetherUnbound
Copy link
Collaborator

I love this idea! Step 2 in your description should already be covered by the INGESTION_LIMIT variable, which I'm realizing we don't have a default set in env.template for 😮 I have mine set to AIRFLOW_VAR_INGESTION_LIMIT=250 in my .env, we should probably set a default for that. That stops all processing after that many records have been pulled, which would solve the case for not letting the triggered DAGs run too long. We could employ the Airflow CLI for almost all of this I think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants