-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitoring addon / OMS agent - memory consumption, oom restarts #2457
Comments
Hi kaplik, AKS bot here 👋 I might be just a bot, but I'm told my suggestions are normally quite good, as such:
|
Support request ID | 2107260050000609 |
Triage required from @Azure/aks-pm |
Hi, @kaplik, I am investigating this from Monitoring addon team and let provide the update and workaround for this issue soon. |
@vishiy, @saaror would you be able to assist? Issue DetailsWhat happened: I did this upgrade on 3 AKS clusters, and all of them is facing same issue with OMS agent. See the number of restart in last 3 days:
Kubernetes events for pod:
I have tried to disable and reenable the monitoring addon, but that didn't help:
I have tried to investigate a bit and i have figure out, that "td-agent-bit" is causing the memory consumption:
Memory consumed by this process is constantly growing, until the pod will consume 100% memory from its limit. Another interesting thing is that fluent-bit have a problem with flushing chunks to OMS:
This could be caused by:
Head of fluent-bit-out-oms-runtime.log
Target LA workspace settings: What you expected to happen: How to reproduce it (as minimally and precisely as possible): Anything else we need to know?: Environment: Kubernetes version (use
Size of cluster (how many worker nodes are in the cluster?)
|
/assign @ganga1980 |
Hi, @kaplik, I did investigate and couldnt find exact root cause, and discussed with support team who is handling the support ticket to schedule a meeting on your earliest convienent time so that we can investigate and understand whats causing this. Based on our telemetry, looks like you have upgraded the cluster 2021-07-14T20:14:00Z and our updated agent ciprod06112021 got deployed on 2021-07-15. But OOMing happening on starting 2021-07-22 and ifmy analysis correct with regards to timings. Then I am suspecting , your application containers logs or some timing issue causing to get into this state. Need to understand your application container workload to see what causing get into these state for example following calls are not suppose to make 2021/07/26 06:10:49 oms.go:1362: RequestId c58f47a0-bee5-4c2b-8887-4772a2a1128d Status 403 Forbidden Status Code 403 |
Hi @ganga1980, We have 3 AKS clusters which are affected by this issue, all of them have been upgraded by same procedure. Here is the timeline of the upgrade. All clusters are in the same subscription (151c879a-7d27-4fc1-9662-46db11216bb1). 12.7.2021 around 22:00 CEST - [XXX]-test-aks upgrade to 1.20.7 It's true that i have noticed the OMS agent restarts after 22.7., so it is definitely possible, that the problem starts after the nodepool reimage to latest version on 21.7. and 22.7.) and everything was OK after the AKS upgrades (12.7. and 14.7.). Node pool upgrade was performed by this command on all clusters:
I was able to find out in container insights that the OMS agent was probably automatically upgraded on 16.7. ([XXX]-prod-aks cluster): (times in UTC)
Workload in all clusters:
|
Thanks, @kaplik for sharing the logs and also confirming the workaround #1 works without any issues. Here are the details of this issue .. If the namespace excluded in both stdout & stderr streams, then no issue since we don’t tail the logs of the container of that namespace completely. But, If the namespace excluded either to stdout or stderr, then we drop messages of excluded stream after receiving log lines. If the batch, we received from the fluent bit tail plugin contains the all log lines of excluded stream then instead of dropping them, because of the bug in ciprod06112021, agent try to ingest incorrect path which will fail with 403 (which is expected) but those log lines keep retried. This will not issue of the number of dropped messages but if the dropped messages are very high and that will cause, the number of log lines are retrying are very high, then that will impact the memory usage of td-agent-bit (aka Fluent bit) and eventually agent will OOM if the memory usage keep grows because of the retries . In your case, the namespace excluded “baapi2-test” excluded for stdout but not stderr, the containers in this namespace generating very high number of stdout logs which are keep retrying incorrect path instead of dropping those log lines. Possible workarounds with released Monitoring addon agent (ciprod06112021) 1 . Exclude or include the namespace for both stdout & stderr streams in the container-azm-ms-agentconfig configmap Or
|
Action required from @Azure/aks-pm |
Issue needing attention of @Azure/aks-leads |
Thanks @kaplik for reporting and also helping on validating the fix. Fix has been rolled out to all the regions. |
Hi, @Azure/aks-leads , Can you guys please help on closing this issue as the fix has been rolled out to all the AKS regions? |
What happened:
I'm facing issue with AKS monitoring addon (oms agent) which is restarting quite often. Possible cause is OOM.
The issue occurred probably after AKS upgrade (from 1.19.X -> 1.20.7). The OMS agent was updated automatically during AKS cluster upgrade to: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod06112021
I did this upgrade on 3 AKS clusters, and all of them is facing same issue with OMS agent.
See the number of restart in last 3 days:
Kubernetes events for pod:
I have tried to disable and reenable the monitoring addon, but that didn't help:
I have tried to investigate a bit and i have figure out, that "td-agent-bit" is causing the memory consumption:
Memory consumed by this process is constantly growing, until the pod will consume 100% memory from its limit.
Another interesting thing is that fluent-bit have a problem with flushing chunks to OMS:
This could be caused by:
Head of fluent-bit-out-oms-runtime.log
Target LA workspace settings:
OMS agent is using (as far as i can tell) correct WS ID and KEY.
What you expected to happen:
0 restarts.
How to reproduce it (as minimally and precisely as possible):
i haven't be able to reproduce this issue on newly deployed AKS cluster
Anything else we need to know?:
Environment:
Kubernetes version (use
kubectl version
):Size of cluster (how many worker nodes are in the cluster?)
3x Standard_D4s_v3
other cluster with same issue have different number of nodes, thus it seems to be irrelevant
The text was updated successfully, but these errors were encountered: