Triggerer intermittent failure when running many triggerers #32091

tomrutter · 2023-06-23T11:08:50Z

Apache Airflow version

2.6.2

What happened

We are running a dag with many deferrable tasks using a custom trigger that waits for an Azure Batch task to complete. When many tasks have been deferred, we can an intermittent error in the Triggerer. The logged error message is the following:

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py", line 457, in run
    asyncio.run(self.arun())
  File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py", line 470, in arun
    await self.create_triggers()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py", line 492, in create_triggers
    dag_id = task_instance.dag_id
AttributeError: 'NoneType' object has no attribute 'dag_id'

After this error occurs, the trigger still reports as healthy, but no events are triggered. Restarting the triggerer fixes the problem.

What you think should happen instead

The specific error in the trigger should be addressed to prevent the triggerer async thread from crashing.

The triggerer should not perform heartbeat updates when the async triggerer thread has crashed.

How to reproduce

This occurs intermittently, and seems to be the results of running more than one triggerer. Running many deferred tasks eventually ends up with this error occurring.

Operating System

linux (standard airflow slim images extended with custom code running on kubernetes)

Versions of Apache Airflow Providers

postgres,celery,redis,ssh,statsd,papermill,pandas,github_enterprise

Deployment

Official Apache Airflow Helm Chart

Deployment details

Azure Kubernetes and helm chart 1.9.0.
2 replicas of both triggerer and scheduler.

Anything else

It seems that as triggers fire, the link between the trigger row and the associated task_instance for the trigger is removed before the trigger row is removed. This leaves a small amount of time where the trigger exists without an associated task_instance. The database updates are performed in a synchronous loop inside the triggerer, so with one triggerer, this is not a problem. However, it can be a problem with more than one triggerer.

Also, once the triggerer async loop (that handles the trigger code) fails, the triggers no longer fire. However, the heartbeat is handled by the synchronous loop so the job still reports as healthy.

I have included an associated PR to resolve these issues.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2023-06-23T11:08:52Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

tomrutter · 2023-06-23T11:25:01Z

PR #32092 submitted

hussein-awala · 2023-06-23T19:27:55Z

It seems that as triggers fire, the link between the trigger row and the associated task_instance for the trigger is removed before the trigger row is removed. This leaves a small amount of time where the trigger exists without an associated task_instance. The database updates are performed in a synchronous loop inside the triggerer, so with one triggerer, this is not a problem. However, it can be a problem with more than one triggerer.

It should not happen even if you have multiple triggerers, because each one take a part of the unassigned triggers (base on capacity arguments) and locks the rows in the DB until update them to add its ID.
Also I don't believe that we delete the task_instance before delete the trigger since we load them from a dequeue, and you get this exception when you read from it.

I think your problem is with line:

airflow/airflow/jobs/triggerer_job_runner.py

Line 685 in a1ba155

new_trigger_instance = trigger_class(**new_trigger_orm.kwargs)

where your custom trigger doesn't set the task_instance parameter, and keep it None.

tomrutter · 2023-06-23T19:53:49Z

Will check, but wouldn't all triggers fail if that were the case? There are many (>100) identical triggers working and only one that fails (all using the same custom trigger). I notice that if the trigger heartbeat is delayed it can push the triggers to other triggerers. Could that open the window for the problem above under high load? The custom triggers do one http request using async call and then parse the json response. I don't think they are blocking but sometimes the triggerer complains of blocking when many are running at once.

tomrutter · 2023-06-23T20:06:09Z

I think task instance is set just below on line 691.

hussein-awala · 2023-06-25T08:22:08Z

I notice that if the trigger heartbeat is delayed it can push the triggers to other triggerers. Could that open the window for the problem above under high load?

I will check that and try to reproduce your problem

tomrutter · 2023-06-25T09:14:52Z

Thanks! I appreciate your attention on this!

I notice that submit_event also breaks the link between trigger and task, so this might leave a slightly larger window. The removal of the link could be left to Trigger.clean_unused?

tomrutter added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jun 23, 2023

hussein-awala mentioned this issue Jun 23, 2023

Handle partially updated trigger/task_instance data in triggerr jobs #32092

Merged

hussein-awala mentioned this issue Jun 25, 2023

Multiple Triggerer processes keeps reassigning triggers to each other when job_heartbeat_sec is higher than 30 seconds. #32121

Closed

2 tasks

hussein-awala closed this as completed in #32092 Jun 27, 2023

ephraimbuddy mentioned this issue Jul 7, 2023

Status of testing of Apache Airflow 2.6.3rc1 #32432

Closed

53 tasks

Lioscro mentioned this issue Mar 28, 2024

Possible race condition in triggerer when running multiple instances #38599

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triggerer intermittent failure when running many triggerers #32091

Triggerer intermittent failure when running many triggerers #32091

tomrutter commented Jun 23, 2023 •

edited by hussein-awala

Loading

boring-cyborg bot commented Jun 23, 2023

tomrutter commented Jun 23, 2023

hussein-awala commented Jun 23, 2023

tomrutter commented Jun 23, 2023

tomrutter commented Jun 23, 2023

hussein-awala commented Jun 25, 2023

tomrutter commented Jun 25, 2023

Triggerer intermittent failure when running many triggerers #32091

Triggerer intermittent failure when running many triggerers #32091

Comments

tomrutter commented Jun 23, 2023 • edited by hussein-awala Loading

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Jun 23, 2023

tomrutter commented Jun 23, 2023

hussein-awala commented Jun 23, 2023

tomrutter commented Jun 23, 2023

tomrutter commented Jun 23, 2023

hussein-awala commented Jun 25, 2023

tomrutter commented Jun 25, 2023

tomrutter commented Jun 23, 2023 •

edited by hussein-awala

Loading