Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triggerer intermittent failure when running many triggerers #32091

Closed
2 tasks done
tomrutter opened this issue Jun 23, 2023 · 7 comments · Fixed by #32092
Closed
2 tasks done

Triggerer intermittent failure when running many triggerers #32091

tomrutter opened this issue Jun 23, 2023 · 7 comments · Fixed by #32092
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet

Comments

@tomrutter
Copy link
Contributor

tomrutter commented Jun 23, 2023

Apache Airflow version

2.6.2

What happened

We are running a dag with many deferrable tasks using a custom trigger that waits for an Azure Batch task to complete. When many tasks have been deferred, we can an intermittent error in the Triggerer. The logged error message is the following:

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py", line 457, in run
    asyncio.run(self.arun())
  File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py", line 470, in arun
    await self.create_triggers()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py", line 492, in create_triggers
    dag_id = task_instance.dag_id
AttributeError: 'NoneType' object has no attribute 'dag_id'

After this error occurs, the trigger still reports as healthy, but no events are triggered. Restarting the triggerer fixes the problem.

What you think should happen instead

The specific error in the trigger should be addressed to prevent the triggerer async thread from crashing.

The triggerer should not perform heartbeat updates when the async triggerer thread has crashed.

How to reproduce

This occurs intermittently, and seems to be the results of running more than one triggerer. Running many deferred tasks eventually ends up with this error occurring.

Operating System

linux (standard airflow slim images extended with custom code running on kubernetes)

Versions of Apache Airflow Providers

postgres,celery,redis,ssh,statsd,papermill,pandas,github_enterprise

Deployment

Official Apache Airflow Helm Chart

Deployment details

Azure Kubernetes and helm chart 1.9.0.
2 replicas of both triggerer and scheduler.

Anything else

It seems that as triggers fire, the link between the trigger row and the associated task_instance for the trigger is removed before the trigger row is removed. This leaves a small amount of time where the trigger exists without an associated task_instance. The database updates are performed in a synchronous loop inside the triggerer, so with one triggerer, this is not a problem. However, it can be a problem with more than one triggerer.

Also, once the triggerer async loop (that handles the trigger code) fails, the triggers no longer fire. However, the heartbeat is handled by the synchronous loop so the job still reports as healthy.

I have included an associated PR to resolve these issues.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@tomrutter tomrutter added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jun 23, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 23, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@tomrutter
Copy link
Contributor Author

PR #32092 submitted

@hussein-awala
Copy link
Member

It seems that as triggers fire, the link between the trigger row and the associated task_instance for the trigger is removed before the trigger row is removed. This leaves a small amount of time where the trigger exists without an associated task_instance. The database updates are performed in a synchronous loop inside the triggerer, so with one triggerer, this is not a problem. However, it can be a problem with more than one triggerer.

It should not happen even if you have multiple triggerers, because each one take a part of the unassigned triggers (base on capacity arguments) and locks the rows in the DB until update them to add its ID.
Also I don't believe that we delete the task_instance before delete the trigger since we load them from a dequeue, and you get this exception when you read from it.

I think your problem is with line:

new_trigger_instance = trigger_class(**new_trigger_orm.kwargs)

where your custom trigger doesn't set the task_instance parameter, and keep it None.

@tomrutter
Copy link
Contributor Author

Will check, but wouldn't all triggers fail if that were the case? There are many (>100) identical triggers working and only one that fails (all using the same custom trigger). I notice that if the trigger heartbeat is delayed it can push the triggers to other triggerers. Could that open the window for the problem above under high load? The custom triggers do one http request using async call and then parse the json response. I don't think they are blocking but sometimes the triggerer complains of blocking when many are running at once.

@tomrutter
Copy link
Contributor Author

I think task instance is set just below on line 691.

@hussein-awala
Copy link
Member

I notice that if the trigger heartbeat is delayed it can push the triggers to other triggerers. Could that open the window for the problem above under high load?

I will check that and try to reproduce your problem

@tomrutter
Copy link
Contributor Author

Thanks! I appreciate your attention on this!

I notice that submit_event also breaks the link between trigger and task, so this might leave a slightly larger window. The removal of the link could be left to Trigger.clean_unused?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants