-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update init scripts to use new data refresh #4962
Conversation
4adb242
to
737099c
Compare
Latest k6 run output1
Footnotes
|
Full-stack documentation: https://docs.openverse.org/_preview/4962 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
676e7f7
to
adc26dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM and I was able to get everything working as expected locally. I am requesting changes for a very small, but I think significant issue with the development experience of this change.
The data refresh DAGs will fail if the sensitive terms Airflow connection is not configured. A default configuration is in catalog/env.template
, but nothing in this change indicates a hard requirement to get new .env values (whether manually or by recreating your .env altogether). When the wait-for-index
check times out in load_sample_data, even if you check Airflow, all you will see is that the connection is missing.
I was able to figure it out and remembered how the sensitive terms connection works in the new data refresh, but that was with my context and experience on the project. Newer contributors will run into this, and will not have any direct indication of how to resolve the problem. In fact, they may not even know to check the DAG status.
I believe there is a relatively simple fix for this: in the load sample data script, grep catalog/.env
for the sensitive terms connection envvar. If it is not present, log a message and either:
- Automatically append the one from env.template to .env. In a local environment, for a script used only in CI (where this check will never be relevant) or locally, I feel this is fine.
- Just log the message but with instructions for adding the new connection manually, and then exit 1.
Both of these could eventually be removed, say in 2 or 3 months. There is no hard "backwards compatibility" requirement of our development environment (it would be untenable), but transitions should ideally to be smooth and transparent at the time.
The diff below solves this problem for me locally. I tested by removing the connection variable from my .env and re-running the init script:
diff --git a/load_sample_data.sh b/load_sample_data.sh
index 46df60c8f..702bba335 100755
--- a/load_sample_data.sh
+++ b/load_sample_data.sh
@@ -6,6 +6,18 @@ CACHE_SERVICE_NAME="${CACHE_SERVICE_NAME:-cache}"
UPSTREAM_DB_SERVICE_NAME="${UPSTREAM_DB_SERVICE_NAME:-upstream_db}"
DB_SERVICE_NAME="${DB_SERVICE_NAME:-db}"
+# `true` resolves to an empty string, and prevents the script from failing
+# due to `set -e` and grep's non-zero status code if the pattern isn't found
+has_sensitive_terms_airflow_conn=$(grep "AIRFLOW_CONN_SENSITIVE_TERMS" catalog/.env || true)
+
+if [[ ! "$has_sensitive_terms_airflow_conn" ]]; then
+ echo "Adding new Airflow connection environment variable required for sample data loading"
+ grep "AIRFLOW_CONN_SENSITIVE_TERMS" catalog/env.template >> catalog/.env
+
+ echo "Restarting Airflow to populate the new connection variable"
+ just dc restart webserver scheduler triggerer
+fi
+
while getopts 'c' OPTION; do
case "$OPTION" in
c)
Otherwise, this PR LGTM. I have some questions about the DevEx, but nothing that is blocking or can't be iterated on. Those are:
- Loading sample data takes noticeably longer than it did with the ingestion server. Is the DAG concurrency check necessary for local usage? Could we run audio and image data refresh concurrently when loading sample data? Given the regularity of recreating sample data locally (at least when working on some tickets), if there's anything we can do to cut down the amount of time this takes, it would be a great benefit.
- We trigger the DAGs in order of audio then image. Is it reliable that audio will always happen first? If not, if image could go first somehow, then the load sample data script would sit there waiting for audio much longer. I don't think it would reach the 5-minute time out, but, speaking personally, I would almost certainly kill the process if it was taking longer than 2 minutes.
- If it's even possible for the DAG order to be inconsistent, running the DAGs concurrently obviates any concern about it.
- There are four rather visible warnings emitted by Airflow CLI: "/home/airflow/.local/lib/python3.12/site-packages/airflow/cli/commands/dag_command.py:48 UserWarning: Could not import graphviz. Rendering graph to the graphical format will not be possible.". Can we suppress these? However unlikely it may seem, warnings could be taken as the cause if something else goes wrong in the load sample data script, and could confuse someone unfamiliar with how this all works.
To clarify again, none of those 3 are blockers, but I would create issues for any you agree are valid improvements to make after the fact.
The only blocker is making sure the sensitive terms connection doesn't cause unexpected and niche failures during this transitionary period.
@sarayourfriend Great call about the sensitive terms Airflow connection. I added your suggestion in d9b2a70; I agree, we should handle this proactively since it could break all the init scripts for any dev who is unaware of the need to add the new connection, and this is only a temporary measure. As to running the DAGs concurrently during the sample data script -- I see no reason why not! There were a number of naming collisions (connection ids, FDW, etc) that we didn't need to worry about when the data refreshes were guaranteed to be isolated, but that can be handled. I think it's worth getting this working in this PR because it should help with cutting down that time in the recreate/init scripts and CI. I've added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing!
The TaskGroups that created new indices were returning the name of the index from the taskgroup, in order to allow accessing the index names later in other parts of the DAG. Unfortunately this was breaking the expected dependency list in the DAG; only the returned task was being set upstream to downstream tasks, rather than the last task in the TaskGroup. TL;DR it was possible for, for example, creation of the filtered index to fail, but then table/index promotion would proceed anyway! This commit fixes that by pulling the tasks to generate the index names out of the TaskGroups so they can be used directly at any point in the DAG.
This is necessary for the load sample data script, so we can load sample data by running the staging DAGs. If they were on any kind of automated schedule, we would have to contend with scheduled runs in addition to our manual ones.
This allows the init scripts to run faster as the DAGs can run concurrently when there is no risk of negative impact to ES CPU.
…shes run concurrently
547aa42
to
276bd49
Compare
Fixes
Fixes #4153 by @stacimc
Description
Updates the
load_sample_data
script to load sample data by actually triggering the new data refreshes, removing this dependency on the ingestion server and also helping to test the new DAGs! If AIRFLOW_CONN_SENSITIVE_TERMS is not configured, it will set this up automatically by pulling the value from theenv.template
to prevent setup from breaking for folks who have not yet added this env variable (see discussion).It also fixes a few dependency issues that were noticed as part of this work, explained in commit messages.
Note that this change also updates CI, so the data refreshes are run during the setup for the api tests. This should go a long way to integration testing the data refreshes; if they break, or don't populate the sample data correctly, we should see test failures in CI.
The downside is that the overhead of Airflow, while it should be negligible in an actual production data refresh compared to the amount of time a refresh takes, does slow down the init scripts and CI by extension. I've seen the api tests in CI take 9-10 minutes, up from 5-7. To compensate for this I've taken @sarayourfriend's suggestion to add an option to allow the data refreshes to run concurrently, which is enabled in the init scripts.
Testing Instructions
First, remove
AIRFLOW_CONN_SENSITIVE_TERMS
from your.env
to verify that the init scripts will still work without it.Run
ov j recreate
. Once the catalog stack is up, you can open http://localhost:9090/dags/staging_audio_data_refresh/grid and watch as the DAG is actually triggered locally. Verify that the DAG (and the related image one) both pass, and that therecreate
works. Check out the image data refresh at the same time to see them running concurrently.Then try
ov j api/init
and verify this works successfully as well.CI should also pass, of course.
Checklist
Update index.md
).main
) or a parent feature branch.ov just catalog/generate-docs
for catalogPRs) or the media properties generator (
ov just catalog/generate-docs media-props
for the catalog or
ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin