Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many fixes for handling tasks in jobs and executors #1271

Merged
merged 8 commits into from
Apr 3, 2016

Conversation

jlowin
Copy link
Member

@jlowin jlowin commented Apr 1, 2016

The grand sum of fixes from #1225, also closes #1254 and #1255.

This is a big PR and I've left the commits separated for clarity. When merged, I can squash them (though it may make sense to leave them distinct because they address many different issues).

The comments for each commit will give the details, but at a high level this address a few things:

  1. Handling queued/retried tasks in BackfillJob
    • queued tasks never ran
    • retried tasks were sometimes viewed as errors
  2. Handling queued tasks in SchedulerJob
    • Scheduler tried to run ALL queued tasks, including those from other jobs, causing collisions and confusion. Now it only schedules queued tasks from its own executor, by examining the executor's events_buffer.
  3. Handling deadlocked tasks in BackfillJob
    • If the backfill gets into a situation where none of the tasks can run, it exits with a helpful message. Previously it looped forever.
  4. Fixed handling of depends_on_past in BackfillJob
    • previously, this was handled by automatically overriding start dates to effectively coerce them into ignoring their depends_on_past parameter. This was fragile and didn't always make it into nested executor instructions. Instead, this PR makes it possible to explicitly ignore ONLY depends_on_past dependencies when determining if a task can run. Users can tell a BackfillJob that they want to ignore_first_depends_on_past if they desire. The default is not to do this (but the helpful deadlock error message will tell users if that's the reason their tasks aren't running).
    • The command line parameter to ignore depends_on_past is -I or --ignore_depends_on_past (for airflow run) or --ignore_first_depends_on_past (for airflow backfill)
  5. Fixed determination of DagRun states
    • previously, DagRuns were failed as soon as a task failed and marked successful if all tasks succeeded/skipped. That's no good, because tasks can fail and other tasks can respond with triggers or retries, but in the current system the dagrun would terminate immediately and those tasks wouldn't run.
    • Now:
    • dagruns FAIL when any of the root tasks FAIL
    • dagruns SUCCEED when all of the root tasks SUCCEED/SKIP
    • dagruns FAIL when all active dagruns are deadlocked (no tasks can run).
  6. Fixed issue where triggers were not respected by BackfillJob
    • once a task failed, the backfill removed all of its downstream tasks from consideration no matter what their triggers were

@jlowin jlowin changed the title Many fixes for handling tasks in jobs Many fixes for handling tasks in jobs and executors Apr 1, 2016
@bolkedebruin
Copy link
Contributor

@jlowin this is a really great stuff! What I would really like (if you need help let me know), is to have some tests in place especially for the ignore_depends_on_past parameter as we have been bitten by this before, the schedulerjob fix and the deadlock ones. This would increase coverage in this area which is pretty important.

@jlowin jlowin force-pushed the subdag_pool_issue_1225 branch 2 times, most recently from 11d75d6 to 534ca3a Compare April 3, 2016 03:14
jlowin added 5 commits April 2, 2016 23:19
Use the same calling format as the other Executors
The current machinery for running BackfillJobs overrides tasks’
start_dates to deal with depends_on_past. This is fragile and,
critically, doesn’t always carry through all of the nested Jobs. We
replace it with an explicit instruction to ignore_depends_on_past when
considering whether a task can be queued.

Also, this will be used later to evaluate whether a set of tasks is
deadlocked.
1. Introduce a concept of a deadlocked backfill, meaning no tasks can
run. The easiest way to create this is with depends_on_past.
Previously, backfill would sit forever. Now it identifies the deadlock
and exist, possibly informing the user that a cause of the deadlock
with depends_on_past

2. Previously, BackfillJob would run a task once to put it in a queue,
but then ignore the queued task on every subsequent loop, resulting in
it never being run. Now it considers queued tasks and runs them.

3. “UP_FOR_RETRY” tasks were not handled properly by the executor (it
raised the “the airflow run command failed at reporting an error”
message). Now they are.
SchedulerJob loads EVERY queued task and tries to run it, which creates
conflicts with any other Job trying to do the same (BackfillJob from
CLI or subdag, or potentially [one day] other schedulers). This creates
a new method, process_events, which polls the Scheduler’s own executor
for queued tasks and adds them to a set. The scheduler then only
considers that set when prioritizing queued tasks.
Previously, DagRuns failed if any task failed and succeeded if all
tasks succeeded or were skipped. However, because of trigger behaviors,
that’s not right — a task can fail and another task can start up with
an “on failed” trigger.

This changes the logic to consider three termination cases:

1. Failure. If any of the root tasks fail, the dagrun fails. This is
because there is no possibility of any “on failure” trigger coming off
a root task.

2. Success. If ALL of the root nodes succeed or skip, the dagrun
succeeds. This means there can be upstream failures as long as failure
triggers are respected.

3. Deadlock — A dag run is deadlocked when no action is possible.
This is determined by the presence of unfinished tasks without met
dependencies. However, care must be taken when depends_on_past=True
because individual dag runs could *look* like they are deadlocked
when they are actually just waiting for earlier runs to finish.

To solve this problem, we evaluate deadlocks in two ways. First,
across all dagruns simultaneously (to account for situations with
depends_on_past=True). Second, in each individual dagrun (but only
if there are no depends_on_past relationship).
@jlowin jlowin force-pushed the subdag_pool_issue_1225 branch from 534ca3a to 68b236d Compare April 3, 2016 03:19
@landscape-bot
Copy link

Code Health
Repository health increased by 0.42% when pulling 68b236d on jlowin:subdag_pool_issue_1225 into 526e564 on airbnb:master.

jlowin added 3 commits April 2, 2016 23:25
Fix minor issues including:
  - clean up State
  - fix bug with nonstandard DAGS_FOLDER locations
  - remove restriction on dags not being outside DAGS_FOLDER
      because DagBags are allowed to load dags from anywhere
  - miscellaneous Landscape fixes
  - use logger instead of print for DAG.clear()
@jlowin jlowin force-pushed the subdag_pool_issue_1225 branch from 68b236d to c1eb83a Compare April 3, 2016 03:25
@landscape-bot
Copy link

Code Health
Repository health increased by 0.52% when pulling c1eb83a on jlowin:subdag_pool_issue_1225 into 526e564 on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.5%) to 68.593% when pulling c1eb83a on jlowin:subdag_pool_issue_1225 into 526e564 on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 67.184% when pulling c1eb83a on jlowin:subdag_pool_issue_1225 into 526e564 on airbnb:master.

@mistercrunch
Copy link
Member

img

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants