-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudwatch Integration: SIGTERM/SIGKILL Sent Following DAG Completion, Causing Errors in Worker Logs #13824
Comments
One thing to note with this issue, I believe this is preventing workers from also consuming more work via a 60 second sleep at the end of every task. We noticed that there was a ~30-40 second delay between 2 workers running tasks, as if all workers were busy doing something, but nothing was actually happening. It seemed work had been queued, but was not being performed. After looking at the logs for a given task in CloudWatch (which has the added benefit of including times in the log), there's a 60 second sleep after SIGTERM to when the process is finally killed with SIGKILL.
I imagine it's the configuration parameter https://github.com/apache/airflow/blob/master/airflow/utils/process_utils.py#L42-L52 |
It also seems that this issues sets all the JobRuns to have |
We're also experiencing this issue with ECSOperator and CloudWatch logging on Airflow 2.0.2. Changing |
Hi @potiuk, Who can help us with this bug? |
Maybe this is something the Amazon team can take a look at @subashcanapathy ? I think that's one of the obvious candidates that someone from AWS could help with since this is CloudWatch integration problem. |
This might be fixed by #16289 -- one thing to try to see if that is the case is to disable the mini scheduler run by setting |
Set it to false and see if that fixes the problem. Sadly we missed that in 2.1.1, so it would have to wait for 2.1.2 which would be at least a few weeks. And if setting that config doesn't help help then there is something else at fault here. https://airflow.apache.org/docs/apache-airflow/2.0.2/configurations-ref.html#schedule-after-task-execution for reference. |
Hi @ashb I set All the tasks are successful on Airflow side, however in flower all are failed. My airflow.cfg [scheduler]
dag_dir_list_interval = 60
parsing_processes = 4
run_duration = 41460
schedule_after_task_execution = False
statsd_host = airflow-statsd
statsd_on = True
statsd_port = 9125
statsd_prefix = airflow This is how Celery logs look like for a given worker: [2021-07-20 09:02:49,057: INFO/ForkPoolWorker-15] Executing command in Celery: ['airflow', 'tasks', 'run', 'XXXXXXXX', 'eks.sensor', '2021-07-20T08:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/opt/airflow/dags/XXXXXXXX.py']
[2021-07-20 09:02:49,086: INFO/ForkPoolWorker-15] Filling up the DagBag from /opt/airflow/dags/XXXXXXXX.py
[2021-07-20 09:02:51,442: INFO/ForkPoolWorker-15] Datasets List: 2
[2021-07-20 09:02:51,442: INFO/ForkPoolWorker-15] Start getting tables list from dataset: XXXXXXXX
[2021-07-20 09:02:51,585: INFO/ForkPoolWorker-15] Start getting tables list from dataset: XXXXXXXX
[2021-07-20 09:02:52,380: WARNING/ForkPoolWorker-15] Running <TaskInstance: XXXXXXXX.eks.sensor 2021-07-20T08:00:00+00:00 [queued]> on host airflow-worker-7c6b4f75f9-x67r9
[2021-07-20 09:03:18,710: ERROR/ForkPoolWorker-15] Failed to execute task Task received SIGTERM signal.
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 117, in _execute_in_fork
args.func(args)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 91, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 238, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 64, in _run_task_by_selected_method
_run_task_by_local_task_job(args, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 121, in _run_task_by_local_task_job
run_job.run()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 245, in run
self._execute()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 100, in _execute
self.task_runner.start()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 41, in start
self.process = self._start_by_fork()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py", line 92, in _start_by_fork
logging.shutdown()
File "/usr/local/lib/python3.8/logging/__init__.py", line 2126, in shutdown
h.flush()
File "/home/airflow/.local/lib/python3.8/site-packages/watchtower/__init__.py", line 297, in flush
q.join()
File "/usr/local/lib/python3.8/queue.py", line 89, in join
self.all_tasks_done.wait()
File "/usr/local/lib/python3.8/threading.py", line 302, in wait
waiter.acquire()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1286, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2021-07-20 09:04:18,738: ERROR/ForkPoolWorker-15] Failed to execute task [Errno 2] No such file or directory: '/tmp/tmpbwn0h8za'.
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 117, in _execute_in_fork
args.func(args)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 91, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 238, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 64, in _run_task_by_selected_method
_run_task_by_local_task_job(args, ti)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 121, in _run_task_by_local_task_job
run_job.run()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 245, in run
self._execute()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 145, in _execute
self.on_kill()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 166, in on_kill
self.task_runner.on_finish()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/base_task_runner.py", line 178, in on_finish
self._error_file.close()
File "/usr/local/lib/python3.8/tempfile.py", line 499, in close
self._closer.close()
File "/usr/local/lib/python3.8/tempfile.py", line 436, in close
unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpbwn0h8za'
[2021-07-20 09:04:18,834: ERROR/ForkPoolWorker-15] Task airflow.executors.celery_executor.execute_command[0a8ca64c-01df-4868-a21d-b369d3f7a6cd] raised unexpected: AirflowException('Celery command failed on host: airflow-worker-7c6b4f75f9-x67r9')
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/celery/app/trace.py", line 412, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/celery/app/trace.py", line 704, in __protected_call__
return self.run(*args, **kwargs)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 88, in execute_command
_execute_in_fork(command_to_exec)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 99, in _execute_in_fork
raise AirflowException('Celery command failed on host: ' + get_hostname())
airflow.exceptions.AirflowException: Celery command failed on host: airflow-worker-7c6b4f75f9-x67r9
|
@andormarkus can you set this: If this worked for you, you should remove it when 2.1.3 is out |
I still having the same issue. Logs are the same airflow.cfg [scheduler]
dag_dir_list_interval = 60
orphaned_tasks_check_interval = 84600
parsing_processes = 4
run_duration = 41460
schedule_after_task_execution = False
statsd_host = airflow-statsd
statsd_on = True
statsd_port = 9125
statsd_prefix = airflow |
Hi Folks...have there been any new insights into this issue? |
Has there been any movement in this, or any proposed fixes? I'm still getting the temp file location error. We are are 2.0.1, though I spun up a local container of 2.2.0, and still experienced this error, so that leads me to believe it still exists. Any updates? |
We have fixed this on #18269 released in 2.2.0 |
I have tested it with Airflow 2.2.2 and the problem is still existing. Example logs 1: 2021-11-18 21:19:01,060: INFO/MainProcess] Task airflow.executors.celery_executor.execute_command[53ca9d3f-b59e-4a3f-9f21-009c32db5473] received
[2021-11-18 21:19:01,082: INFO/ForkPoolWorker-16] Executing command in Celery: ['airflow', 'tasks', 'run', 'simple_dag', 'sleep', 'scheduled__2021-11-18T21:18:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/simple_dag.py']
[2021-11-18 21:19:01,082: INFO/ForkPoolWorker-16] Celery task ID: 53ca9d3f-b59e-4a3f-9f21-009c32db5473
[2021-11-18 21:19:01,123: INFO/ForkPoolWorker-16] Filling up the DagBag from /opt/airflow/dags/repo/dags/simple_dag.py
[2021-11-18 21:19:01,234: WARNING/ForkPoolWorker-16] Running <TaskInstance: simple_dag.sleep scheduled__2021-11-18T21:18:00+00:00 [queued]> on host airflow-worker-5997488b78-t2ftx
[2021-11-18 21:19:17,631: INFO/ForkPoolWorker-15] Task airflow.executors.celery_executor.execute_command[4ccd30db-274f-4f2d-a750-3df8ec6e856c] succeeded in 76.68239335156977s: None
[2021-11-18 21:19:17,852: ERROR/ForkPoolWorker-16] Failed to execute task Task received SIGTERM signal.
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/executors/celery_executor.py", line 121, in _execute_in_fork
args.func(args)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/cli.py", line 92, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 292, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 105, in _run_task_by_selected_method
_run_task_by_local_task_job(args, ti)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 163, in _run_task_by_local_task_job
run_job.run()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/base_job.py", line 245, in run
self._execute()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/local_task_job.py", line 103, in _execute
self.task_runner.start()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/task/task_runner/standard_task_runner.py", line 41, in start
self.process = self._start_by_fork()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/task/task_runner/standard_task_runner.py", line 97, in _start_by_fork
logging.shutdown()
File "/usr/local/lib/python3.9/logging/__init__.py", line 2141, in shutdown
h.flush()
File "/home/airflow/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 297, in flush
q.join()
File "/usr/local/lib/python3.9/queue.py", line 90, in join
self.all_tasks_done.wait()
File "/usr/local/lib/python3.9/threading.py", line 312, in wait
waiter.acquire()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1413, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
|
Is is possible we're running into a CloudWatch quota? Per the below docs https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html |
Hey folks, I was looking into this today, but I can't seem to reproduce it. @andormarkus I see in your last post you're using 2.2.2 and are running some |
Hi @o-nikolas We are on helm chart Please let me know if you need more information. Relevant section from helm chart configuration config:
logging:
colored_console_log: "False"
remote_logging: "True"
remote_log_conn_id: aws_default
remote_base_log_folder: "cloudwatch://${log_group_arn}"
"""Sample DAG."""
import time
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2020, 1, 1),
"email": ["[email protected]"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
def sleep() -> bool:
"""Sleep.
Returns:
bool: True
"""
time.sleep(10)
return True
with DAG("simple_dag", default_args=default_args, schedule_interval="* * * * *", catchup=False) as dag:
t1 = PythonOperator(task_id="sleep", python_callable=sleep) |
I did some dive deep on this issue and root caused it. TL;DR It is related to The issue is related to a combination of three factors: forking + threading + logging. This combination can lead to a deadlock when logs are being flushed after a task finishes execution. This means that the StandardTaskRunner will be stuck at this line. Now, since the task has actually finished (thus its state is success), but the process didn't yet exit (and will never since it is in a deadlock state), the
Notice that this could cause further issues:
References
|
Hi @rafidka, thanks, for the update. Here is not my current findings: Used versions: airflow: 2.2.4
watchtower: 2.0.1 Test setup: I'm running 5 simple dag every in the past few days: From flower perspective everything looks good: I have checked the worker logs and few times per hour I get the following error messages: [2022-03-01 18:21:24,954: ERROR/ForkPoolWorker-15] Failed to execute task Task received SIGTERM signal.
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/executors/celery_executor.py", line 121, in _execute_in_fork
args.func(args)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/cli.py", line 92, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 298, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 105, in _run_task_by_selected_method
_run_task_by_local_task_job(args, ti)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 163, in _run_task_by_local_task_job
run_job.run()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/base_job.py", line 246, in run
self._execute()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/local_task_job.py", line 103, in _execute
self.task_runner.start()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/task/task_runner/standard_task_runner.py", line 41, in start
self.process = self._start_by_fork()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/task/task_runner/standard_task_runner.py", line 97, in _start_by_fork
logging.shutdown()
File "/usr/local/lib/python3.9/logging/__init__.py", line 2141, in shutdown
h.flush()
File "/home/airflow/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 432, in flush
q.join()
File "/usr/local/lib/python3.9/queue.py", line 90, in join
self.all_tasks_done.wait()
File "/usr/local/lib/python3.9/threading.py", line 312, in wait
waiter.acquire()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1415, in signal_handler Worker logs for a failed task looks like this: ▶ kubectl -n airflow logs airflow-worker worker | grep ForkPoolWorker-16
[2022-03-01 19:08:00,654: INFO/ForkPoolWorker-16] Celery task ID: fee17d13-2423-4ed1-ab2f-3f1a3fd34551
[2022-03-01 19:08:00,709: INFO/ForkPoolWorker-16] Filling up the DagBag from /opt/airflow/dags/repo/dags/simple_dag_2.py
[2022-03-01 19:08:00,842: WARNING/ForkPoolWorker-16] Running <TaskInstance: simple_dag_2.sleep scheduled__2022-03-01T19:07:00+00:00 [queued]> on host airflow-worker-58b8d8789b-w7jwv
[2022-03-01 19:08:24,702: ERROR/ForkPoolWorker-16] Failed to execute task Task received SIGTERM signal.
[2022-03-01 19:08:25,170: INFO/ForkPoolWorker-16] Task airflow.executors.celery_executor.execute_command[fee17d13-2423-4ed1-ab2f-3f1a3fd34551] succeeded in 24.535810169007163s: None @rafidka Where should I find the Based on my testing |
@andormarkus , hmm this is interesting. I didn't particularly test Airflow 2.2.4, but I tested 2.0.2 + watchtower 2.0.0 and beyond and I couldn't reproduce the issue, so I assumed since Airflow 2.2.4 is using watchtower 2.0.1, then the issue should be resolved. I will see if I can do some testing and try to reproduce this issue.
This is a bit tricky unfortunately. The thing is that when you configure Airflow to use CloudWatch logging, but the logging itself has issues, you are likely to miss some logs. On the other hand, when stop using CloudWatch logging, the issue itself disappears (i.e. this is a Heisenbug situation.) In my case, I had to modify Airflow source code locally and then use file based logging like this:
|
Hi @rafidka, I'm so sorry, I was looking for this warning in Kubernetes not in CloudWatch. Here are the warning in CloudWatch: Here is an exported Log steam. In is very interesting: CloudWatch does not store the log level (info/warning/error), just to log message.
|
@andormarkus , what you are reporting above is exactly the symptoms I've seen when there is a problem with logging. Could you please do a |
I just Airflow 2.2.4 and it is working fine for me. On Airflow 2.2.4:
On Airflow 2.2.3:
For reference, this is the DAG I am testing with: from datetime import timedelta
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
import os
from datetime import datetime
NUM_LINES = 10000
DAG_ID = os.path.basename(__file__).replace(".py", "")
@dag(dag_id=DAG_ID, schedule_interval=timedelta(minutes=1), catchup=False, start_date=days_ago(0), tags=['test'])
def print_dag():
@task()
def execute_fn():
for i in range(0, NUM_LINES):
print(datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3])
execute_fn_t = execute_fn()
test_dag_d = print_dag() |
Hi @rafidka, I’m running 5 ‘simple_dag.py’ parallel and I got one error every 5-10 minutes. I will deploy your dag tomorrow. Im running my code on Airflow 2.2.4 with watchtower 2.0.1 |
I just tried a sleep DAG (similar to yours) and it also succeeded on Airflow 2.2.4:
This is my DAG: import time
from datetime import timedelta
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
import os
DAG_ID = os.path.basename(__file__).replace(".py", "")
@dag(dag_id=DAG_ID, schedule_interval=timedelta(minutes=1), catchup=False, start_date=days_ago(0), tags=['test'])
def sleep_dag():
@task()
def execute_fn():
time.sleep(10)
execute_fn_t = execute_fn()
test_dag_d = sleep_dag() I suspect your setup have some issue (perhaps some stale configuration or package.) I would start clean or -even better- use Docker if you aren't already. |
Hello, We are running our Airflow on AWS EKS with the latest helm chart and official docker image. I don't think it is setup issue. I was able to reproduce it locally as well with docker compose:
|
@andormarkus , to avoid making assumptions, could you please share the exact code of Also, it is worth noting that based on the setup you mentioned above, the local executor will be used instead of the Celery executor. The testing I done was on the Celery executor. |
@rafidka, I think you have misunderstood point 4. in my setup guide. I meant leave the default environment variables as is just change
"""Sample DAG."""
import time
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2020, 1, 1),
"email": ["[email protected]"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
def sleep() -> bool:
"""Sleep.
Returns:
bool: True
"""
time.sleep(20)
return True
with DAG("simple_dag_1", default_args=default_args, schedule_interval="* * * * *", catchup=False) as dag:
t1 = PythonOperator(task_id="sleep", python_callable=sleep) |
Thanks @andormarkus ! Are you able to reproduce when using the celery executor? |
Thanks to everyone trying to get to the root cause of this issue. |
Apache Airflow version:
2.0.0
Environment:
Docker Stack
Celery Executor w/ Redis
3 Workers, Scheduler + Webserver
Cloudwatch remote config turned on
What happened:
Following execution of a DAG when using cloudwatch integration, the state of the Task Instance is being externally set, causing SIGTERM/SIGKILL signals to be sent. This causes error logs in Workers, which is a nuisance for alert monitoring
However following the completion of the DAG, the following is appended to the logs:
This is a problem, because it causes the following to appear in Worker logs:
What you expected to happen:
No errors to appear in Worker logs, if this SIGTERM/SIGKILL is intended
How to reproduce it:
Use Airflow w/ Celery Executor and Cloudwatch Remote Logging
Anything else we need to know:
Occurs every time, every task in DAG
The text was updated successfully, but these errors were encountered: