-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KubernetesPodOperator Fails on empty log line #21605
Comments
I'm not sure of the root cause of the issue. |
Could you try update to latest version of the provider https://pypi.org/project/apache-airflow-providers-cncf-kubernetes/ 3.0.2 |
3.0.2 Also has the same logic for parsing log lines, wouldn't be of much help |
With reference to #15638 @dimberman, are you aware of a case where the log lines can be empty? |
It has different kubernetes library. Can you please chack it ? |
Hello, I tried using the latest package version 3.1.1 but face the same issue as above, any recommendations? |
@bhavaniravi I think the root cause of the problem is that you can't read the logs of the POD. This looks like misconfiguration of your service account. The root cause of the problem is not parsing, nor even empty line but the fact that you cannot read the logs:
Maybe you should look closer at your k8s logs and see what is the root cause of the problem. Even if we add an extra step to "react" on empty lines it will not solve the root cause, which is inability to read the logs. Can any of those who has similar problems get a deeper look at your k8s logs and see if there are any other errors - for example indicating that thare are some permission issues or any other anomalies. Looking at it as "parsing" issue is just masking a real problem you have in your deployment I am afraid. |
@potiuk Just adding a little more context. I am using Airflow 2.x to spin up pods on an Azure Kubernetes Cluster using the KubernetesPodOperator operator. Its interesting that I dont face the above logging issues when running the pod on a provisioned node. However, the moment I try to execute the same on the newer virtual nodes, it starts hitting me with the logging error. Furthermore, I've also noticed that if I disable the do_xcom_push argument, the job then succeeds just fine, however will still have those warning messages throughout the logs saying "Error parsing timestamp. Will continue execution but won't update timestamp". On the documentation page for AKS virtual nodes, it does mention that init containers are not supported. I believe, the way xcom works is by running a sidecar container however right? Just trying to make sense of any obvious limitations which I might be missing right off the bat :) |
It seems that in some circumstances, the K8S client might return empty logs even if "timestamps" options is specified. That should not happen in general, but apparently it does in some cases and leads to task being killed. Rather than killing the tasks we should log it as an error (on top of trying to find out why and preventing it from happening - also to be able to gather more information and diagnosis on when it happens). Related to: apache#21605
…2566) It seems that in some circumstances, the K8S client might return empty logs even if "timestamps" options is specified. That should not happen in general, but apparently it does in some cases and leads to task being killed. Rather than killing the tasks we should log it as an error (on top of trying to find out why and preventing it from happening - also to be able to gather more information and diagnosis on when it happens). Related to: #21605
While we do not know the root cause, the #22566 should mitigate the crash. Will be released in the next provider (but the next provider will only be installable in 2.3.0 +). |
Thanks! however what fix/workaround can we use in the interim for this to prevent the errors and the crashing until the above is released? (as 2.3.0 release seems far away) Is there a possibility of using a custom Xcom backend (s3) with the pod operator instead of the default writing to the side-car container? |
Thanks! however what fix/workaround can we use in the interim for this to prevent the errors and the crashing until the above is released? (as 2.3.0 release seems far away) 2.3.0 is out. I do not think there were any workarounds. |
Hi Team, Even with Airflow MWAA 2.4.3 version with Kubernetes POD Operator task, we are seeing the below issue in Airflow logs. kubernetes version - 1.24 {{pod_manager.py:410}} ERROR - Error parsing timestamp (no timestamp in message ''). Will continue execution but won't update timestamp I don't see any issue with POD logs when I manually checked using kubectl client. Each line contains timestamp in it. But the above warning message appears in Airflow logs. |
MWAA 2.4.3 use by default |
I am hitting this on the very latest Cloud Composer which uses |
I'm also experiencing this problem while using Airflow v2.5.0
In this case, we're at a step in the execution process that could take a long time and you can see that the error occurs exactly 5 minutes after the previous line got executed. There isn't an empty log line in the log that follows. My pipeline was working before when I was dealing with a smaller dataset because the build time of this step was small enough, but I think there might be some kind of KubernetesPodOperator related timeout at play here. @potiuk / anyone else: Any ideas on Airflow variables I could play around with to test this hypothesis? |
Maybe @dstandish and maybe upgrading to latest airflow ? there were some changes in how logs are pulled from K8S pods ? |
Ah right. Now I remember.
|
Anyone who is experiencing failures with provider version >= 4.0, please share the traceback |
We've started seeing |
Seems like we faced similar issue on the later version: Versions of Apache Airflow Providers apache-airflow-providers-cncf-kubernetes==6.1.0 Apache Airflow version 2.6.1 Python version: 3.10 Operating System Debian VERSION="11 (bullseye)" Seems like after empty string timestamp error (that appears during long query) log parses stops and fail into 404.
|
Hi there! |
I got the same issue when updated to 2.6.2, |
I solved the issue by adding persistence for logs |
I found this issue using GCP Composer v2 with KubernetesPodOperator.
And how did you do that @y0zg ? |
Hi, why is the issue closed? I am on 2.6.1 with airflow-cncf-kubernetes 7.5.1 (latest) and still getting this error |
Because we think the original issue from version released 2 years has been fixed. You @romanzdk @ricoms (also suggestion y00zg @tanthml) might have SIMILAR issue which might be completely different. And the best way you can help someone to diagnoe and solve your issue is open a new one where you will describe what happens in your case and provide evidence from your version (ideally after upgrading to latest version of Airflow because what you see, could have been fixed since). This is an open-source project where you get software for free and people who solve other people's problem mostoften do that in their free time - weekends and nights. And the best way that you can get help from those people is to make it easy for them to diagnose your issue. I know it is super-easy to write "I have the same issue". You do not loose any time on gathering evidences and writing the issue. But your issue (especially that we are talking about someone who raised it for version 2.2 and we are 2 years later with 2.7.1 and k8s code has been rewritten 3 times since then) . Also it might depend on multiple factors like kubernetes version, provider version, type of kubernetes cluster you have etc. But your comment did not bring anyone any closer to knowing all those details. So, if you can open a new issue @romanzdk and provide good evidences, you are increasing chances (but only that - chances) that someone will spend their free afternoon or weekend and look at your issue and maybe even fixes it. If all that you have is question why the issue is closed, your chances for getting it solved do not increase even by a fraction of percent. So - if you really care about your problem being solved, - I suggest you help those who try to help you and provide good, reproducible issue with good evidences of what happens (ideally looking at your system, also and corellating what happens on your system (maybe some pods were failing? Maybe you can see some unusual behaviour or output of your pods etc. That will be a great help for those who spends their nights an weekends trying to help people who use the software for completely free. |
Same here |
Apache Airflow Provider(s)
cncf-kubernetes
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes==1!2.1.0
Apache Airflow version
2.2.3 (latest released)
Operating System
Debian GNU/Linux 10 (buster)
Deployment
Astronomer
Deployment details
No response
What happened
Some
KubernetesPodOperator
tasks fail with an empty logline. From the following logs, you could see thatmonitor_pod
functionError parsing timestamp. Will continue execution but won't update timestamp
unable to retrieve container logs for docker://
Exception: Log not in "{timestamp} {log}" format. Got:
What you expected to happen
In case of empty log line, we should graciously handle the error instead of failing the task itself
How to reproduce
Not sure what really causes this issue but this StackOverflow question Docker cleaning up the logs?
Anything else
Complete log stacktrace*
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: