Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-56] Airflow's scheduler can "lose" queued tasks #1378

Merged
merged 2 commits into from
May 9, 2016

Conversation

jlowin
Copy link
Member

@jlowin jlowin commented Apr 13, 2016

When Scheduler is run with —num-runs, there can be multiple
Schedulers and Executors all trying to run tasks. For queued tasks,
Scheduler was previously only trying to run tasks that it itself had
queued — but that doesn’t work if the Scheduler is restarting. This PR
reverts that behavior and adds two types of “best effort” executions —
before running a TI, executors check if it is already running, and
before ending executors call sync() one last time

Closes https://issues.apache.org/jira/browse/AIRFLOW-56

@jlowin
Copy link
Member Author

jlowin commented Apr 13, 2016

@abridgett please give this a shot regarding #1342
@syvineckruyk it's very important for you to have a look at this -- I had to partially undo one of the fixes that kept SchedulerJob from preying on BackfillJob's queued tasks. That fix made it so SchedulerJob only dealt with its own tasks, but when SchedulerJob runs with num-runs, it keeps restarting and the new scheduler was ignoring tasks queued by the old one.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.03% when pulling 8fa7fd4 on jlowin:queued-tasks into 5d15d68 on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 67.979% when pulling 8fa7fd487c59673c1163dee1b0631f4d49c6809e on jlowin:queued-tasks into 5d15d68 on airbnb:master.

@abridgett
Copy link
Contributor

It seems to have worked today, will keep an eye on it - many many thanks @jlowin

@abridgett
Copy link
Contributor

worked today as well FYI (last time it was updated in the middle of a run effectively so this was the first real test)

@jlowin
Copy link
Member Author

jlowin commented Apr 15, 2016

Great. @syvineckruyk any luck?

@syvineckruyk
Copy link
Contributor

@jlowin I won't be able to test until tomorrow. Sorry for that .. I have bunch of test dags ready to go so testing will be quick when I start.

The sample dags from #1350 and #1225 should be a good start though.

@jlowin
Copy link
Member Author

jlowin commented Apr 15, 2016

No rush, just wanted to make sure you had seen the issue.

@jlowin
Copy link
Member Author

jlowin commented Apr 21, 2016

@syvineckruyk just want to ping you on this, we may need to merge it to master shortly

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling f592c65 on jlowin:queued-tasks into c1f485f on airbnb:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 66.996% when pulling f592c65fb6b8d974ddb1ce477cb7aefc5c0fff5b on jlowin:queued-tasks into c1f485f on airbnb:master.

@syvineckruyk
Copy link
Contributor

@jlowin gotcha ... starting to test now.

@syvineckruyk
Copy link
Contributor

@jlowin so I am running your "queued-tasks" branch ... and seeing some weird issues around dag runs. don't have an extact handle on it yet... failed dag runs are getting created ... and I am now seeing dag runs created for 5 days prior to the start date.

@jlowin
Copy link
Member Author

jlowin commented Apr 22, 2016

I think the 5 day before thing is a known issue with the scheduler logic -- basically if you have a non-scheduled execution, the scheduler picks up 5 days before it. We will have to deal with that separately.

@syvineckruyk
Copy link
Contributor

@jlowin cool. I found the cause of the failing dag runs .. it was an unrelated issue. continuing to run this version today ... so far so good.

@syvineckruyk
Copy link
Contributor

syvineckruyk commented Apr 24, 2016

@jlowin so far on this branch I have found a couple of issues. depends_on_past=True does not appear to be respected. At least on SDOs ... don't know about other tasks. I have SDOs in a failed state... at the next interval the following task instance of the failed SDO is launched.

This may be helpful ... the job was behaving as expected. A few minutes ago when we crossed over to April 24th UTC... that is when the task-instances that should not have been kicked off began running. The schedule of those tasks was not midnight ... it was actually for 12 hours from then at 12pm. So something about the new UTC day seems to be involved.

I have also observed several instances of subtasks completing in success.... but SDO never notices and stays in a running state indefinitely.

@syvineckruyk
Copy link
Contributor

@jlowin still trying to create a generic DAG to expose these bugs ... but wanted to let you know that it appears that the SDO operators that do not get updated do so when another task in the dag run has failed.

  1. Task Fails
  2. Dag Run Fails
  3. Subtasks complete
  4. Status of SDOs remains running

@jlowin
Copy link
Member Author

jlowin commented Apr 24, 2016

Excellent, thanks so much for the feedback @syvineckruyk. I'm trying to figure out what in this code could have led to that behavior. I think it might be the extra sync() command I gave each executor. The idea was to avoid orphaned tasks by waiting for all tasks to complete -- but maybe there's a situation where the parent DAG has already quit before the orphaned tasks do... thinking further.

@syvineckruyk
Copy link
Contributor

@jlowin just a question when I was running the last version you gave me.... things were running well ... what was the issue that required this area to need more work ?

commit fd9388c0c27c2e469f4eb0362800323a08b76d68
Merge: 58abca2 b2844af
Author: bolkedebruin <[email protected]>
Date:   Tue Apr 5 22:24:15 2016 +0200

    Merge pull request #1290 from jlowin/subdag-backfill-status

    Make sure backfill deadlocks raise errors

commit b2844af020cb5a470bd83ead09ddb121923084ca
Author: jlowin <[email protected]>
Date:   Mon Apr 4 18:59:13 2016 -0400

    Fix infinite retries with pools, with test

    Addresses the issue raised in #1299

@jlowin
Copy link
Member Author

jlowin commented Apr 24, 2016

The issue you were experiencing was because Scheduler and Backfill were both trying to queue your SubDag tasks. The fix I put in place was that Scheduler kept track of tasks it submitted to the executor, and only tried to run those tasks if they became queued. That way, the queued subdag tasks were left alone and Backfill handled them without interference. The problem is that folks often run Scheduler with the --num_runs parameter, which makes Scheduler restart periodically. When Scheduler restarts, it has no way to know what tasks it submitted before it quit, and so those tasks remain queued forever. The bug could occur less commonly if Scheduler were to quit for any other reason while tasks were queued.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling 88a211d on jlowin:queued-tasks into c1f485f on airbnb:master.

@coveralls
Copy link

coveralls commented Apr 26, 2016

Coverage Status

Coverage increased (+1.7%) to 68.704% when pulling 88a211d66894b8d02b8923378a97a3e5cce50373 on jlowin:queued-tasks into c1f485f on airbnb:master.

@syvineckruyk
Copy link
Contributor

@jlowin were you able to identify any of the issue I referred to above ?

Thanks

@jlowin
Copy link
Member Author

jlowin commented Apr 26, 2016

@syvineckruyk not directly but I removed some of the code that I suspect is causing it

@jlowin
Copy link
Member Author

jlowin commented Apr 26, 2016

As a matter of fact it looks like now Travis is failing with queued tasks -- so I think I need to leave that code in. More thinking...

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.02% when pulling 12b7b6a on jlowin:queued-tasks into c1f485f on airbnb:master.

@coveralls
Copy link

coveralls commented Apr 26, 2016

Coverage Status

Coverage increased (+0.04%) to 67.066% when pulling 12b7b6a116a0eadcf3101b636ffbec6239437870 on jlowin:queued-tasks into c1f485f on airbnb:master.

@bolkedebruin bolkedebruin changed the title Handle queued tasks from multiple jobs/executors [AIRFLOW-56] Airflow's scheduler can "loose" queued tasks May 6, 2016
@bolkedebruin bolkedebruin added kind:bug This is a clearly a bug Job/Executor and removed Missing JIRA Issue labels May 6, 2016
@jlowin jlowin changed the title [AIRFLOW-56] Airflow's scheduler can "loose" queued tasks [AIRFLOW-56] Airflow's scheduler can "lose" queued tasks May 6, 2016
@@ -1159,7 +1167,7 @@ def run(
self.pool = pool or task.pool
self.test_mode = test_mode
self.force = force
self.refresh_from_db()
self.refresh_from_db(lock_for_update=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide the session as well, to make sure the commit happens in the same session

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@bolkedebruin
Copy link
Contributor

bolkedebruin commented May 9, 2016

I think that here: https://github.com/jlowin/airflow/blob/queued-tasks/airflow/jobs.py#L650 we need an extra session.commit() after the delete.

Update: here we can actually lose a commit.

@@ -86,10 +84,23 @@ def heartbeat(self):
key=lambda x: x[1][1],
reverse=True)
for i in range(min((open_slots, len(self.queued_tasks)))):
key, (command, priority, queue) = sorted_queue.pop(0)
self.running[key] = command
key, (command, priority, queue, ti) = sorted_queue.pop(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ instead of priority (looks like it's unused)

@bolkedebruin
Copy link
Contributor

+1!

jlowin added 2 commits May 9, 2016 17:18
When Scheduler is run with `—num-runs`, there can be multiple
Schedulers and Executors all trying to run tasks. For queued tasks,
Scheduler was previously only trying to run tasks that it itself had
queued — but that doesn’t work if the Scheduler is restarting. This PR
reverts that behavior and adds two types of “best effort” executions —
before running a TI, executors check if it is already running, and
before ending executors call sync() one last time
The scheduler can encounter a queued task twice before the
task actually starts to run -- this locks the task and avoids
that condition.
@aoen
Copy link
Contributor

aoen commented May 9, 2016

LGTM

@asfgit asfgit merged commit c1aa93f into apache:master May 9, 2016
asfgit pushed a commit that referenced this pull request May 9, 2016
@griffinqiu
Copy link

We have been blocked on this issue in 1.7.0.
Do you have some suggestions for fix it? Hot patch to 1.7.0 or upgrade to 1.7.1? Is 1.7.1 stable enough?
The 1.6.2 seem has the issue same. ;(

Thank you very much

@bolkedebruin
Copy link
Contributor

1.7.1 is in much better condition than 1.7.0. So I would definitely use that one.

@criccomini
Copy link
Contributor

To be clear, 1.7.1.2 is the version you should be using. This version is quite stable.

@griffinqiu
Copy link

I have tested 1.7.1.2.
It seems there still are many issues on SubDag with pool, and the SubDags also waste many resources on celery.

@criccomini
Copy link
Contributor

@griffinqiu can you open up JIRAs for the known issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants