-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix backfill occassional deadlocking #26161
Conversation
Comment: I think it would be nice to get some description of state changes and what State.NONE really is. I thin it would be really great to get a visual representation of all state changes for tasks in a few "state diagrams". Or maybe I am missing it and it is somewhere? Maybe we can have a nice set of state diagrams in mermaid since it is now nice integrated in GitHub ? Happy to collaborate on that one. |
We have one at https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#task-instances but that might not be in mermaid -> (https://airflow.apache.org/docs/apache-airflow/stable/_images/task_lifecycle_diagram.png) |
Ah Thanks @kaxil -> I remembered we had one - just could not find it :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. So if I understand now - Scheduler will simply move tasks to "Scheduled" state and further and that should prevent the deadlocks as only scheduler(s) will be doing it ?
During backfilling, we set all task instances of the dagrun being run to scheduled, this causes deadlocking of the dagrun whenever dagrun.update_state is called and there's no running or schedulable task instances. The fix was to remove the batch update of the task instances to scheduled state and have the executor queue the tasks that the dependencies have been met.
acbd8b2
to
dbaa7ab
Compare
I have updated the code. Previously, it would be sent to the executor to be queued but now it follows scheduled -> queued etc. |
@@ -351,18 +351,19 @@ def _task_instances_for_dag_run(self, dag_run, session=None): | |||
dag_run.refresh_from_db() | |||
make_transient(dag_run) | |||
|
|||
dag_run.dag = dag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dag_run
does not generally have a dag
attribute (and I don’t think there’s any code expecting it), why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we call task_instance_scheduling_decision
it tries to do self.get_dag()
and if the dag attribute is not set, it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test I added failed due to a lack of dag
attribute but that could be because the dagrun
on the test does not have the dag
attribute. However, I decided that it's better to have the dag
attribute set in the code instead of making the test pass by adding the attribute on the dr in the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn these old code are so hard to follow.
cherry-picked change from the community apache/airflow#26161 Internal bug Change-Id: I44c690ed56561adef420b7935947647f417d7f2e GitOrigin-RevId: f846495d50f8de8412cccff345f6f265fd65adf9
cherry-picked change from the community apache/airflow#26161 Internal bug Change-Id: I62478c4c1142a00f1f984e5d14d1af7754946b82 GitOrigin-RevId: c582c826563065ef2c7c37213bc2f7a4fdcb81d8
Fix from OSS: apache#26161
cherry-picked change from the community apache/airflow#26161 Internal bug Change-Id: I44c690ed56561adef420b7935947647f417d7f2e GitOrigin-RevId: f846495d50f8de8412cccff345f6f265fd65adf9
During backfilling, we set all task instances of the dagrun being run to scheduled,
this causes deadlocking of the dagrun whenever dagrun.update_state is called and
there's no running or schedulable task instances.
The fix was to remove the batch update of the task instances to scheduled state
and have the executor queue the tasks that the dependencies have been met.
closes: #25353, closes: #26044