just recipe to facilitate running ingestion DAGs for testing #2999
Labels
🤖 aspect: dx
Concerns developers' experience with the codebase
🌟 goal: addition
Addition of new feature
🟩 priority: low
Low priority and doesn't need to be rushed
🧱 stack: catalog
Related to the catalog and Airflow DAGs
Problem
A lot of catalogue PRs require running provider DAGs to ingest a certain amount of testing data. This requires a lot of fiddly clicking around the Airflow UI, including attending to the running DAGs, checking whether they've pulled sufficient data (usually a couple of hundred records), and manually marking them as successful, and so forth. It's not hard, it's just tedious.
Description
To facilitate this, it would be nice to have a just recipe that automatically handles these things, via commands to the airflow CLI. There could be separate recipes for each media type, with a derived recipe that runs both for you. The recipes would:
For bonus points, the recipe could also check the catalogue
.env
file to confirm that you've properly configured the required API keys and give a helpful message when you haven't. It could also take a parameter to override the particular provider you want to run, maybe defaulting to the easiest DAGs to configure for local use. The typical suggested providers are Jamendo for audio and Flickr or SMK for images.Additional context
See this PRs testing instructions for a good example of this workflow: #2964
There are other steps that are also quite common for testing changes, like running the data refreshes. Those also require fiddly clicking around to enable related DAGs (though that might soon not be necessary if #2981 is implemented in a way that removes the reliance on the filtered index DAG from the data refresh DAG, i.e., if the data refresh DAG takes on those tasks itself, rather than triggering the related DAG). If this is a good idea (@WordPress/openverse-catalog), then we could create further recipes for running the data refreshes and popularity calculations and so forth as well.
Additional benefits from this could be gained if we also integrated these into our catalogue CI test suite. Considering data refresh is a critical path for the catalogue, it would be great if we could do an integration test for it on every change to the Airflow configuration and DAGs or to the ingestion server API.
The text was updated successfully, but these errors were encountered: