-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic tasks marked as upstream_failed
when none of their upstream tasks are failed
or upstream_failed
#27449
Comments
Can you share the dag code? You can redact some values. Without sharing your code it might be difficult to understand what's going on |
Without looking at anything, one UPSTREAM_FAILED case specific to a mapped task is when the upstream pushes a non-mappable value (an int, for example). This feels likely for your case. |
A few other details I discovered while debugging:
Why would the scheduler be trying to expand tasks during another task's execution? Is it because the task downstream has those mapped tasks as upstream? Even so, it shouldn't mark them as |
Unfortunately it's proprietary code from my employer, hence the de-ID. I am attempting to create a new test DAG that I can share to reproduce the error. |
I was able to successfully create a test case. I also updated the issue language reflecting more research. At this point I can't imagine why this behavior is desirable but it's likely I'm missing something. |
I was not able to reproduce with your test case. You also have an issue on the test case: last_task = PythonOperator(
task_id="last_task",
python_callable=nop,
) should have last_task = PythonOperator(
task_id="last_task",
python_callable=nop,
op_args = [1]
) for example |
@eladkal - I corrected my test case with Do you have |
A
I used the default for However, I see that in my case, the second task starts before the first task and they finish together |
Having the same issue with airflow 2.4.1. |
Reproduced with LocalExecutor! |
It looks like non-expanded mapped task instances don't have diff --git a/airflow/ti_deps/deps/trigger_rule_dep.py b/airflow/ti_deps/deps/trigger_rule_dep.py
index 39722c6df3..f50043c7b6 100644
--- a/airflow/ti_deps/deps/trigger_rule_dep.py
+++ b/airflow/ti_deps/deps/trigger_rule_dep.py
@@ -67,6 +67,9 @@ class TriggerRuleDep(BaseTIDep):
def _get_dep_statuses(self, ti, session, dep_context: DepContext):
# Checking that all upstream dependencies have succeeded
if not ti.task.upstream_list:
+ if ti.task.is_mapped and ti.map_index == -1:
+ yield self._failing_status(reason="Task is a mapped task and has not expanded.")
+ return
yield self._passing_status(reason="The task instance did not have any upstream tasks.")
return cc: @uranusjr, would like to hear what you think about this |
Something doesn’t click. Why does the non-expanded mapped task not have a proper |
It was removed here Lines 2240 to 2244 in 3aadc44
when airflow/airflow/jobs/local_task_job.py Lines 253 to 258 in 3aadc44
The removal seems like a bug? maybe @ashb can chip in |
Okay, so the problem here is, as Ephraim identified, that the mini scheduler operates on a subset of the DAG, not the complete dag. So when The reason we use a partial subset rather than the whole dag was a performance optimization. Essentially: this task just finished, lets look at the downstream task and see if any of them can be scheduled. But in order to check if those can be scheduled, we need the upstream of those tasks (which is how we get to include So I think we have two options here:
|
Oh and the removal is not a bug (at least not in the parital_subset function itself): because |
Ok. Thanks Ash, I think option 2 is better:
Since it will limit the change to mapped task. Let me try it |
Wild idea. This was a loose thought I have for quite some time while reviewing some of issues connected to mini-scheduler. I think the mini-scheduler "approach" was a good idea - triggering direct downstream task when task is finished as soon as possible is generally cool. However it introduces complexity (similar to above) because it is IMHO a tea that is "almost, but not quite, entirely unlike tea" to quote Douglas Adams. I looks like scheduler scheduling tasks, it behaves like that, but there are some small quirks (like above) that make it susceptible to subtle bugs. Also I think the original idea that the mini-scheduler will be run in a more "distributed" fashion is a bit hampered by the fact that bulk of what the mini-scheduler does is done via the DB anyway and it synchronizes on DagRun lock and DB operation so pretty much all the "distribution" benefits are all but gone. Havint multiple schedulers already provides a way to "distribute" scheduling and distributing it even more does not change much. Yes, it uses the fact that the DAG is already loaded in memory and some of the DB objects are effectively cached by SQLAlchemy but I think with 2.4 and the "micro-pipelines" approach where our users have much better ways to split their DAGs into smaller, independent DAGs, this is becomes far less of an issue. I thought that we could implement "mini-scheduling" slightly differently. Rather than actually do the scheduling, we could add DagRun of just completed task to a some kind of table where we could keep "priority" DagRuns to schedule by a "real" scheduler. Then it would be pretty immediately picked up by (one of) the schedulers at the next scheduling loop and scheduled. The latency there would be slightly bigger, but not that much IMHO and we would loose the "DAG structure in-memory", but I think it would make it a bit more robust - especially when we start introducing new concepts. It's not entirely hashed out - there are probably some edge cases where for example we have a lot of those "priority" dagruns to process (what should we do with the non-priority ones for example). I am mentioning it here because I think such an approach would pretty much automatically solve the above problem. WDYT? |
Such a change shouldn't be made without measuring the performance before and after. Yes removing it would make things simpler, but I think a tiny bit of complexity is worth it for the speed boost. (I can't remember the figures off the top of my head, but I know we measured it before adding this, and it did make a difference to the "inter-task lag") The place where this change really helps is when you have a large number (500-1000+) of DAGs, as this mini scheduler only locks the current dag, leaving the scheduler on to do other DAGs. |
Absolutely - I think there might be some effects I have not thought about. Just wanted to bounce off of you the idea and thought I had. I do realise there are performance implications if we change it, just wanted to raise it as an option so that others might also think about it and see as possibility :) |
Apache Airflow version
2.4.2
What happened
There is a mapped task is getting marked as
upstream_failed
when none of its upstream tasks arefailed
orupstream_failed
.In the above graph view, if
first_task
finishes beforesecond_task
,first_task
immediately tries to expandmiddle_task
. Note - this is an important step to reproduce - The order the tasks finish matter.Note that the value of the Airflow configuration variable
schedule_after_task_execution
must beTrue
(the default) for this to occur.The expansion occurs when the Task supervisor performs the "mini scheduler", in this line in
dagrun.py
.Which then marks
middle_task
asupstream_failed
in this line inmappedoperator.py
:I believe this was introduced by the PR Fail task if mapping upstream fails.
What you think should happen instead
The dynamic tasks should successfully execute. I don't think the mapped task should expand because its upstream task hasn't completed at the time it's expanded. If the upstream task were to complete earlier, it would expand successfully.
How to reproduce
Execute this DAG, making sure Airflow configuration
schedule_after_task_execution
is set to default valueTrue
.Operating System
debian buster
Versions of Apache Airflow Providers
No response
Deployment
Docker-Compose
Deployment details
No response
Anything else
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: