Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

container-monitoring addon doesn't work with 1.16 #2066

Closed
jackfrancis opened this issue Sep 30, 2019 · 10 comments
Closed

container-monitoring addon doesn't work with 1.16 #2066

jackfrancis opened this issue Sep 30, 2019 · 10 comments
Labels
bug Something isn't working stale

Comments

@jackfrancis
Copy link
Member

Here's some debug data from our E2E tests:

2019/09/30 15:43:50 $ k logs omsagent-f4gcd -c omsagent -n kube-system
 2019/09/30 15:43:51 #### $ k logs omsagent-f4gcd -c omsagent -n kube-system completed in 998.026281ms
 2019/09/30 15:43:51 
 getting gid for docker.sock
 creating a local docker group
 adding omsagent user to local docker group
   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                  Dload  Upload   Total   Spent    Left  Speed
 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  2303    0  2303    0     0  34868      0 --:--:-- --:--:-- --:--:-- 35430
 k8s-agentpool-27751012-vmss000002
 not setting customResourceId
 -e error    Error resolving host during the onboarding request. Check the internet connectivity and/or network policy on the cluster
 ****************Start Config Processing********************
 Both stdout & stderr log collection are turned off for namespaces: '*_kube-system_*.log' 
 ****************End Config Processing********************
 Workspace 9257fb9c-24d2-433a-a57a-fed7ff3eb584 already onboarded and agent is running.
 Symbolic links have not been created; re-onboarding to create them
 info	Generating certificate ...
 -e info	Agent GUID is <agent-guid>
 -e info	Onboarding success
 Configure syslog...
 Configuring rsyslog for OMS logging
 Restarting service: rsyslog
 invoke-rc.d: could not determine current runlevel
  * Stopping enhanced syslogd rsyslogd
    ...done.
  * Starting enhanced syslogd rsyslogd
    ...done.
 Configure heartbeat monitoring agent...
 Configure log rotate for workspace <workspace_id>...
 INFO:  Configuring OMS agent service <workspace_id> ...
 invoke-rc.d: could not determine current runlevel
  * Starting Operations Management Suite agent (<workspace_id>): 
    ...done.
 -e error	MetaConfig generation script not available at /opt/microsoft/omsconfig/Scripts/OMS_MetaConfigHelper.py
  * Starting periodic command scheduler cron
    ...done.
 Primary Workspace: <workspace_id>    Status: Onboarded(OMSAgent Running)
 omsagent 1.10.0.1
 docker-cimprov 6.0.0.0
 nodename: k8s-agentpool-27751012-vmss000002
 replacing nodename in telegraf config
 File Doesnt Exist. Creating file...
 �[1mFluent-Bit v0.14.4�[0m
 �[1m�[93mCopyright (C) Treasure Data�[0m
 
 ****************Start Prometheus Config Processing********************
 config::No configmap mounted for prometheus custom config, using defaults
 ****************End Prometheus Config Processing********************
 2019-09-30T15:19:57Z I! Starting Telegraf 
 Telegraf unknown (git: fork 50cd124)
 td-agent-bit 0.14.4
 2019-09-30T15:20:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:21:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:22:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:23:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:24:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:25:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:26:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:27:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:28:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:29:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:30:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:31:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:32:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:33:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:34:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:35:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:36:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:37:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:38:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:39:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:40:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:41:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:42:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 2019-09-30T15:43:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://10.240.0.66:10255/metrics: Get http://10.240.0.66:10255/metrics: dial tcp 10.240.0.66:10255: getsockopt: connection refused
 
 2019/09/30 15:43:51 $ k describe pod omsagent-f4gcd -n kube-system
 2019/09/30 15:43:51 Error trying to run 'kubectl exec':command terminated with exit code 1
 
 2019/09/30 15:43:51 Command:kubectl exec omsagent-f4gcd -n kube-system [grep -i cAdvisorPerfEmitStreamSuccess /var/opt/microsoft/omsagent/log/omsagent.log] 
 2019/09/30 15:43:52 #### $ k describe pod omsagent-f4gcd -n kube-system completed in 877.960811ms
 2019/09/30 15:43:52 
 Name:                 omsagent-f4gcd
 Namespace:            kube-system
 Priority:             2000001000
 Priority Class Name:  system-node-critical
 Node:                 k8s-agentpool-27751012-vmss000002/10.240.0.66
 Start Time:           Mon, 30 Sep 2019 15:18:52 +0000
 Labels:               component=oms-agent
                       controller-revision-hash=646d95b4c5
                       pod-template-generation=1
                       tier=node
 Annotations:          agentVersion: 1.10.0.1
                       dockerProviderVersion: 6.0.0-0
                       kubernetes.io/psp: privileged
                       schema-versions: v1
 Status:               Running
 IP:                   10.240.0.73
 IPs:
   IP:           10.240.0.73
 Controlled By:  DaemonSet/omsagent
 Containers:
   omsagent:
     Container ID:   docker://eb423256bc19f7c292a3ece25aa200677cbdcbdff4e27fd0273607047aa3f437
     Image:          mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod07092019
     Image ID:       docker-pullable://mcr.microsoft.com/azuremonitor/containerinsights/ciprod@sha256:0f798cb7d56931b231f71e38e7fa5bf898b69e611247a566701f70a5f29a9799
     Ports:          25225/TCP, 25224/UDP
     Host Ports:     0/TCP, 0/UDP
     State:          Running
       Started:      Mon, 30 Sep 2019 15:19:39 +0000
     Ready:          True
     Restart Count:  0
     Limits:
       cpu:     150m
       memory:  600Mi
     Requests:
       cpu:     75m
       memory:  225Mi
     Liveness:  exec [/bin/bash -c /opt/livenessprobe.sh] delay=60s timeout=1s period=60s #success=1 #failure=3
     Environment:
       NODE_IP:             (v1:status.hostIP)
       ACS_RESOURCE_NAME:  kubernetes-eastus-68361
       CONTROLLER_TYPE:    DaemonSet
       ISTEST:             true
     Mounts:
       /etc/config/settings from settings-vol-config (rw)
       /etc/kubernetes/host from azure-json-path (rw)
       /etc/omsagent-secret from omsagent-secret (ro)
       /hostfs from host-root (rw)
       /var/lib/docker/containers from containerlog-path (rw)
       /var/log from host-log (rw)
       /var/run/host from docker-sock (rw)
       /var/run/secrets/kubernetes.io/serviceaccount from omsagent-token-6z8h8 (ro)
 Conditions:
   Type              Status
   Initialized       True 
   Ready             True 
   ContainersReady   True 
   PodScheduled      True 
 Volumes:
   host-root:
     Type:          HostPath (bare host directory volume)
     Path:          /
     HostPathType:  
   docker-sock:
     Type:          HostPath (bare host directory volume)
     Path:          /var/run
     HostPathType:  
   container-hostname:
     Type:          HostPath (bare host directory volume)
     Path:          /etc/hostname
     HostPathType:  
   host-log:
     Type:          HostPath (bare host directory volume)
     Path:          /var/log
     HostPathType:  
   containerlog-path:
     Type:          HostPath (bare host directory volume)
     Path:          /var/lib/docker/containers
     HostPathType:  
   azure-json-path:
     Type:          HostPath (bare host directory volume)
     Path:          /etc/kubernetes
     HostPathType:  
   omsagent-secret:
     Type:        Secret (a volume populated by a Secret)
     SecretName:  omsagent-secret
     Optional:    false
   settings-vol-config:
     Type:      ConfigMap (a volume populated by a ConfigMap)
     Name:      container-azm-ms-agentconfig
     Optional:  true
   omsagent-token-6z8h8:
     Type:        Secret (a volume populated by a Secret)
     SecretName:  omsagent-token-6z8h8
     Optional:    false
 QoS Class:       Burstable
 Node-Selectors:  beta.kubernetes.io/os=linux
 Tolerations:     node-role.kubernetes.io/master=true:NoSchedule
                  node.kubernetes.io/disk-pressure:NoSchedule
                  node.kubernetes.io/memory-pressure:NoSchedule
                  node.kubernetes.io/not-ready:NoExecute
                  node.kubernetes.io/pid-pressure:NoSchedule
                  node.kubernetes.io/unreachable:NoExecute
                  node.kubernetes.io/unschedulable:NoSchedule
 Events:
   Type    Reason     Age        From                                        Message
   ----    ------     ----       ----                                        -------
   Normal  Scheduled  <unknown>  default-scheduler                           Successfully assigned kube-system/omsagent-f4gcd to k8s-agentpool-27751012-vmss000002
   Normal  Pulling    24m        kubelet, k8s-agentpool-27751012-vmss000002  Pulling image "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod07092019"
   Normal  Pulled     24m        kubelet, k8s-agentpool-27751012-vmss000002  Successfully pulled image "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod07092019"
   Normal  Created    24m        kubelet, k8s-agentpool-27751012-vmss000002  Created container omsagent
   Normal  Started    24m        kubelet, k8s-agentpool-27751012-vmss000002  Started container omsagent
@jackfrancis jackfrancis added the bug Something isn't working label Sep 30, 2019
@jackfrancis
Copy link
Member Author

@rashmichandrashekar FYI

Also, are the uuid's that I replaced with <agent-guid> and <workspace_id> above private info? If so, they should not be outputted in plain text to the pod logs.

@rashmichandrashekar
Copy link
Contributor

@jackfrancis : we are yet to test out the support for 1.16. We are planning to work on it in the next couple of weeks.
The WSID dependency will be removed. We have made changes to use default workspace we had some issues in the new environment. @ganga1980 - Could you please comment on this?

@ptylenda
Copy link

Hey, is there any progress on this one? I see similar behaviour for k8s 1.15 also.

@ganga1980
Copy link
Contributor

Hi, @ptylenda , We are working on adding container monitoring add-on support for 1.16 and it should work for the 1.15. what is the error you are hitting on k8s 1.15 version? Can you please share more details about the error.?

@ptylenda
Copy link

Hey @ganga1980, briefly, it looks as follows:

  1. AKS Engine cluster definition (following https://github.com/Azure/aks-engine/blob/master/docs/tutorials/containermonitoringaddon.md#1-using-default-log-analytics-workspace):
{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorRelease": "1.15",
      "kubernetesConfig": {
        "addons": [{
            "name": "container-monitoring",
            "enabled": true
          }
        ]
      }
    },
    "masterProfile": {
      "count": 1,
      "dnsPrefix": "...",
      "vmSize": "Standard_D2_v3"
    },
    "agentPoolProfiles": [
      {
        "name": "windowspool2",
        "count": 2,
        "vmSize": "Standard_D2_v3",
        "availabilityProfile": "AvailabilitySet",
        "osType": "Windows",
        "osDiskSizeGB": 128,
        "extensions": [
            {
                "name": "winrm"
            }
        ]
      }
    ],
    "windowsProfile": {
      "adminUsername": "...",
      "adminPassword": "...",
      "sshEnabled": true
    },
    "linuxProfile": {
      "adminUsername": "...",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "..."
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "...",
      "secret": "..."
    },
    "extensionProfiles": [
      {
        "name": "winrm",
        "version": "v1"
      }
    ]
  }
}
  1. OMS agent logs:
> kubectl logs omsagent-r44g6 -n kube-system
getting gid for docker.sock
creating a local docker group
adding omsagent user to local docker group
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2293    0  2293    0     0  23370      0 --:--:-- --:--:-- --:--:-- 23397
k8s-master-70017404-0
not setting customResourceId
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: .oms.opinsights.azure.com
-e error    Error resolving host during the onboarding request. Workspace might be deleted.
****************Start Config Processing********************
Both stdout & stderr log collection are turned off for namespaces: '*_kube-system_*.log'
****************End Config Processing********************
-e error        Missing Workspace ID or Shared Key information for onboarding
 * Starting periodic command scheduler cron
   ...done.
No Workspace
omsagent 1.10.0.1
docker-cimprov 6.0.0.0
nodename: k8s-master-70017404-0
replacing nodename in telegraf config
File Doesnt Exist. Creating file...
�[1mFluent-Bit v0.14.4�[0m
�[1m�[93mCopyright (C) Treasure Data�[0m

****************Start Prometheus Config Processing********************
config::No configmap mounted for prometheus custom config, using defaults
****************End Prometheus Config Processing********************
2019-10-31T19:19:15Z I! Starting Telegraf
Telegraf unknown (git: fork 50cd124)
td-agent-bit 0.14.4

So it looks that the issue is a bit different, it complains about log workspace not being present.

@ganga1980
Copy link
Contributor

Hi, @ptylenda , As described in this doc https://github.com/Azure/aks-engine/blob/master/docs/tutorials/containermonitoringaddon.md#1-using-default-log-analytics-workspace at this point of time, this option works only if you create the cluster using aks-engine deploy comnmand. Did you create the cluster using "aks-engine deploy" command using the API definition? or the aks-engine generate and following deployment template? If its latter option, this onboarding using this option doesnt work, but legacy option should work. We have plan to enable using aks-engine generate also, but right now we dont have support for this.

@ptylenda
Copy link

ptylenda commented Nov 1, 2019

@ganga1980, thanks for the guidance - my bad, I missed the part about aks-engine deploy support. I have recreated my cluster using aks-engine deploy command and for k8s 1.15 everything works perfect! Are there any plans for supporting Windows nodes out-of-the-box?

@ganga1980
Copy link
Contributor

@ganga1980, thanks for the guidance - my bad, I missed the part about aks-engine deploy support. I have recreated my cluster using aks-engine deploy command and for k8s 1.15 everything works perfect! Are there any plans for supporting Windows nodes out-of-the-box?

@ptylenda , we do support collecting inventory and perf for the windows nodes and containers. But not the logs. Enabling collecting and ingesting stdout/stderr logs for windows containers is in back log. I can reach out you as soon as we have the working version.

@stale
Copy link

stale bot commented Jan 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 1, 2020
@stale stale bot closed this as completed Jan 8, 2020
@bobsyourmom
Copy link

Problem could be because service_control script is trying to use invoke-rc.d (which doesn't work because there are no runlevels in a docker container) before trying the service command (which will work).
So change the if conditions in the script or the lazy way is make it look for a non-existent path so it will use the service command. eg:
if [ -x /xxxusr/sbin/invoke-rc.d ]; then
/usr/sbin/invoke-rc.d $OMSAGENT_WS start
elif [ -x /sbin/service ]; then
/sbin/service $OMSAGENT_WS start

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

5 participants