Apply task instance mutation hook consistently #38440
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We are using the Cluster Policies and in the the feature of the "Task Instance Mutation" to route workload to the respective endpoint. Respective endpoint means that we use multiple Celery queues and distribute the work. As the distribution is based on workflow meta data and we don't want to add the routing complexity into the workflow (modelling the workflow statically for all routing combinations) the task instance mutation is the only option.
As discussed in #32471 we have seen that the task instance mutation works in general "well" for the first execution but we saw a couple of errors:
Root cause is that after initial task creation defaults are loaded from python code many times on multiple levels. Root casue seems to be
TaskInstance._refresh_from_task()
.Fixing these to lines as in this PR removes all problems as described above. Trade-off will be that the policy code is executed a lot more often. But assuming this is not implemented with performance overhead it should not generate a performance impact.
How to test:
queue
on some (or all :-D) tasksexample_params_trigger_ui
and introduce some random errors in the code. Example attached below.QUEUE
for the queue worker to print this in the DAG when testingcloses: #32471
FYI @AutomationDev85 @wolfdn @clellmann
Example cluster policy used for testing as
airflow_local_settings.py
:Modified DAG for testing -
example_params_trigger_ui.py
: