-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many fixes for handling tasks in jobs and executors #1271
Conversation
@jlowin this is a really great stuff! What I would really like (if you need help let me know), is to have some tests in place especially for the ignore_depends_on_past parameter as we have been bitten by this before, the schedulerjob fix and the deadlock ones. This would increase coverage in this area which is pretty important. |
11d75d6
to
534ca3a
Compare
Use the same calling format as the other Executors
The current machinery for running BackfillJobs overrides tasks’ start_dates to deal with depends_on_past. This is fragile and, critically, doesn’t always carry through all of the nested Jobs. We replace it with an explicit instruction to ignore_depends_on_past when considering whether a task can be queued. Also, this will be used later to evaluate whether a set of tasks is deadlocked.
1. Introduce a concept of a deadlocked backfill, meaning no tasks can run. The easiest way to create this is with depends_on_past. Previously, backfill would sit forever. Now it identifies the deadlock and exist, possibly informing the user that a cause of the deadlock with depends_on_past 2. Previously, BackfillJob would run a task once to put it in a queue, but then ignore the queued task on every subsequent loop, resulting in it never being run. Now it considers queued tasks and runs them. 3. “UP_FOR_RETRY” tasks were not handled properly by the executor (it raised the “the airflow run command failed at reporting an error” message). Now they are.
SchedulerJob loads EVERY queued task and tries to run it, which creates conflicts with any other Job trying to do the same (BackfillJob from CLI or subdag, or potentially [one day] other schedulers). This creates a new method, process_events, which polls the Scheduler’s own executor for queued tasks and adds them to a set. The scheduler then only considers that set when prioritizing queued tasks.
Previously, DagRuns failed if any task failed and succeeded if all tasks succeeded or were skipped. However, because of trigger behaviors, that’s not right — a task can fail and another task can start up with an “on failed” trigger. This changes the logic to consider three termination cases: 1. Failure. If any of the root tasks fail, the dagrun fails. This is because there is no possibility of any “on failure” trigger coming off a root task. 2. Success. If ALL of the root nodes succeed or skip, the dagrun succeeds. This means there can be upstream failures as long as failure triggers are respected. 3. Deadlock — A dag run is deadlocked when no action is possible. This is determined by the presence of unfinished tasks without met dependencies. However, care must be taken when depends_on_past=True because individual dag runs could *look* like they are deadlocked when they are actually just waiting for earlier runs to finish. To solve this problem, we evaluate deadlocks in two ways. First, across all dagruns simultaneously (to account for situations with depends_on_past=True). Second, in each individual dagrun (but only if there are no depends_on_past relationship).
534ca3a
to
68b236d
Compare
|
Fix minor issues including: - clean up State - fix bug with nonstandard DAGS_FOLDER locations - remove restriction on dags not being outside DAGS_FOLDER because DagBags are allowed to load dags from anywhere - miscellaneous Landscape fixes - use logger instead of print for DAG.clear()
68b236d
to
c1eb83a
Compare
|
The grand sum of fixes from #1225, also closes #1254 and #1255.
This is a big PR and I've left the commits separated for clarity. When merged, I can squash them (though it may make sense to leave them distinct because they address many different issues).
The comments for each commit will give the details, but at a high level this address a few things:
BackfillJob
SchedulerJob
events_buffer
.BackfillJob
depends_on_past
inBackfillJob
depends_on_past
dependencies when determining if a task can run. Users can tell a BackfillJob that they want toignore_first_depends_on_past
if they desire. The default is not to do this (but the helpful deadlock error message will tell users if that's the reason their tasks aren't running).depends_on_past
is-I
or--ignore_depends_on_past
(forairflow run
) or--ignore_first_depends_on_past
(forairflow backfill
)