Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data_refresh pool creation to just init #2434

Closed
sarayourfriend opened this issue Jun 18, 2023 · 2 comments
Closed

Add data_refresh pool creation to just init #2434

sarayourfriend opened this issue Jun 18, 2023 · 2 comments
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🌟 goal: addition Addition of new feature help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 🔧 tech: just Involves Just.js

Comments

@sarayourfriend
Copy link
Collaborator

Problem

When testing the data refresh DAGs locally, you must remember to create the data_refresh pool via the UI. Luckily, the DAG will remind you of this, but it's still easy to miss and the DAG behaves unexpectedly (indefinitely rescheduling a particular task) if you don't create the pool.

Description

We can programatically create the data refresh pool in a new catalog/init recipe that can get called by the root init recipe.

Using the airflow CLI, the following will create the required pool: airflow pools set data_refresh 1 "Data refresh pool"

Add a new recipe to catalog/justfile for running airflow CLI commands. A diff along the following lines may work:

diff --git a/catalog/justfile b/catalog/justfile
index 9689c2267..e595691ba 100644
--- a/catalog/justfile
+++ b/catalog/justfile
@@ -72,9 +72,12 @@ recreate:
 # Administration #
 ##################
 
+_exec +cmd:
+    env DC_USER="airflow" just ../exec {{ SERVICE }} {{ cmd }}
+
 # Launch a Bash shell in an existing container under `SERVICE`
 shell:
-    env DC_USER="airflow" just ../exec {{ SERVICE }} /bin/bash
+    just _exec /bin/bash
 
 # Launch an IPython shell in a new container under `SERVICE`
 ipython: up-deps
@@ -88,6 +91,15 @@ ipython: up-deps
 pgcli db_user_pass="deploy" db_name="openledger": up
     just ../_pgcli upstream_db {{ db_user_pass }} {{ db_name }} upstream_db
 
+# Execute an Airflow CLI command: `just airflow-cli help`
+airflow-cli +cmd:
+    just _exec airflow {{ cmd }}
+
+# Initialise catalog resources
+init: up
+    just airflow-cli pools set data_refresh 1 "Data refresh pool"
+
+
 #########
 # Tests #
 #########
diff --git a/justfile b/justfile
index c383721a1..642744945 100644
--- a/justfile
+++ b/justfile
@@ -181,6 +181,7 @@ wait-up: up
 init:
     just api/init
     just frontend/init
+    just catalog/init
 
 # Take all Docker services down, in all profiles
 down *flags:

However, it may be necessary to create a wait-up command for the catalog to ensure that the Airflow CLI is ready to be used.

@sarayourfriend sarayourfriend added help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🌟 goal: addition Addition of new feature 🤖 aspect: dx Concerns developers' experience with the codebase 🔧 tech: airflow Involves Apache Airflow 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: just Involves Just.js labels Jun 18, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Jun 18, 2023
@sarayourfriend
Copy link
Collaborator Author

@AetherUnbound @stacimc Do y'all feel that this is necessary? I wasn't aware of #2352, which Staci showed me. If it's not too bad to use the "correct" pool locally, would it be better to more closely mirror the live configuration for the DAG, just in case? I don't know how often it would really come into play locally.

@AetherUnbound
Copy link
Collaborator

It would be more accurate to production, but I don't think it would be necessary. #2352 should cover the case we had previously where we were having to recreate the pool ourselves manually. If anything it might be more limiting locally if we were trying to tests multiple things at once and reduce overall testing time. Since pools only dictate task-level concurrency constraints, #2480 will still function exactly the same locally as it would in production regardless of pools. If we were to go with this approach, it feels like we're more testing the features of Airflow than of our own flow management, which I don't think would be beneficial.

@github-project-automation github-project-automation bot moved this from 📋 Backlog to ✅ Done in Openverse Backlog Jun 26, 2023
@github-project-automation github-project-automation bot moved this from ✅ Done to 📋 Backlog in Openverse Backlog Jun 26, 2023
@sarayourfriend sarayourfriend closed this as not planned Won't fix, can't repro, duplicate, stale Jun 26, 2023
@github-project-automation github-project-automation bot moved this from 📋 Backlog to ✅ Done in Openverse Backlog Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🌟 goal: addition Addition of new feature help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 🔧 tech: just Involves Just.js
Projects
Archived in project
Development

No branches or pull requests

2 participants