Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

test: disable container-monitoring tests #2098

Closed

Conversation

jackfrancis
Copy link
Member

Reason for Change:

Disable container-monitoring E2E tests while test signal is consistently failing.

Issue Fixed:

Requirements:

Notes:

@acs-bot acs-bot added the size/XS label Oct 3, 2019
@acs-bot
Copy link

acs-bot commented Oct 3, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis
Copy link
Member Author

@rashmichandrashekar this is the consistent failure we're seeing across all Kubernetes versions:

2019/10/03 12:10:42 #### $ k logs omsagent-rs-7fb45f86b-8ql7l -c omsagent -n kube-system completed in 459.103635ms
 2019/10/03 12:10:42 
 getting gid for docker.sock
 creating a local docker group
 adding omsagent user to local docker group
   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                  Dload  Upload   Total   Spent    Left  Speed
 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  2305    0  2305    0     0   249k      0 --:--:-- --:--:-- --:--:--  281k
 k8s-agentpool-14094549-vmss000000
 not setting customResourceId
   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                  Dload  Upload   Total   Spent    Left  Speed
 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 ****************Start Config Processing********************
 Both stdout & stderr log collection are turned off for namespaces: '*_kube-system_*.log' 
 ****************End Config Processing********************
 No arg for -d option
 
 Maintenance tool for OMS:
 Onboarding:
 omsadmin.sh -w <workspace id> -s <shared key> [-d <top level domain>]
 
 List Workspaces:
 omsadmin.sh -l
 
 Remove Workspace:
 omsadmin.sh -x <workspace id>
 
 Remove All Workspaces:
 omsadmin.sh -X
 
 Update workspace configuration and folder structure to multi-homing schema:
 omsadmin.sh -U
 
 Onboard the workspace with a multi-homing marker. The workspace will be regarded as secondary.
 omsadmin.sh -m <multi-homing marker>
 
 Define proxy settings ('-u' will prompt for password):
 omsadmin.sh [-u user] -p host[:port]
 
 Azure resource ID:
 omsadmin.sh -a <Azure resource ID>
 
 Detect if omiserver is listening to SCOM port:
 omsadmin.sh -o
  * Starting periodic command scheduler cron
    ...done.
 No Workspace
 omsagent 1.10.0.1
 docker-cimprov 6.0.0.0
 nodename: k8s-agentpool-14094549-vmss000000
 replacing nodename in telegraf config
 File Doesnt Exist. Creating file...
 �[1mFluent-Bit v0.14.4�[0m
 �[1m�[93mCopyright (C) Treasure Data�[0m
 
 ****************Start Prometheus Config Processing********************
 config::No configmap mounted for prometheus custom config, using defaults
 ****************End Prometheus Config Processing********************
 Telegraf unknown (git: fork 50cd124)
 2019-10-03T12:10:25Z I! Starting Telegraf 
 td-agent-bit 0.14.4
 
 2019/10/03 12:10:42 $ k describe pod omsagent-rs-7fb45f86b-8ql7l -n kube-system
 2019/10/03 12:10:42 Error trying to run 'kubectl exec':grep: /var/opt/microsoft/omsagent/log/omsagent.log: No such file or directory
 command terminated with exit code 2
 
 2019/10/03 12:10:42 Command:kubectl exec omsagent-rs-7fb45f86b-8ql7l -n kube-system [grep -i kubePodInventoryEmitStreamSuccess /var/opt/microsoft/omsagent/log/omsagent.log] 
 2019/10/03 12:10:42 #### $ k describe pod omsagent-rs-7fb45f86b-8ql7l -n kube-system completed in 512.680172ms
 2019/10/03 12:10:42 
 Name:           omsagent-rs-7fb45f86b-8ql7l
 Namespace:      kube-system
 Priority:       0
 Node:           k8s-agentpool-14094549-vmss000000/10.240.0.4
 Start Time:     Thu, 03 Oct 2019 11:46:26 +0000
 Labels:         pod-template-hash=7fb45f86b
                 rsName=omsagent-rs
 Annotations:    agentVersion: 1.10.0.1
                 dockerProviderVersion: 6.0.0-0
                 kubernetes.io/psp: privileged
                 schema-versions: v1
 Status:         Running
 IP:             10.240.0.26
 Controlled By:  ReplicaSet/omsagent-rs-7fb45f86b
 Containers:
   omsagent:
     Container ID:  docker://1ef97927ec55d86abd976b0dc45a3c179b706635a2505f2f8d08a19c8b8c5020
     Image:         mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod07092019
     Image ID:      docker-pullable://mcr.microsoft.com/azuremonitor/containerinsights/ciprod@sha256:0f798cb7d56931b231f71e38e7fa5bf898b69e611247a566701f70a5f29a9799
     Ports:         25225/TCP, 25224/UDP
     Host Ports:    0/TCP, 0/UDP
     State:         Running
       Started:     Thu, 03 Oct 2019 12:10:21 +0000
     Last State:    Terminated
       Reason:      Error
       Message:     agent or fluentbit not running
 
       Exit Code:    143
       Started:      Thu, 03 Oct 2019 12:06:21 +0000
       Finished:     Thu, 03 Oct 2019 12:10:19 +0000
     Ready:          True
     Restart Count:  6
     Limits:
       cpu:     150m
       memory:  600Mi
     Requests:
       cpu:     75m
       memory:  225Mi
     Liveness:  exec [/bin/bash -c /opt/livenessprobe.sh] delay=60s timeout=1s period=60s #success=1 #failure=3
     Environment:
       NODE_IP:             (v1:status.hostIP)
       ACS_RESOURCE_NAME:  kubernetes-eastus-90843
       CONTROLLER_TYPE:    ReplicaSet
       ISTEST:             true
     Mounts:
       /etc/config from omsagent-rs-config (rw)
       /etc/config/settings from settings-vol-config (rw)
       /etc/kubernetes/host from azure-json-path (rw)
       /etc/omsagent-secret from omsagent-secret (ro)
       /var/lib/docker/containers from containerlog-path (rw)
       /var/log from host-log (rw)
       /var/run/host from docker-sock (rw)
       /var/run/secrets/kubernetes.io/serviceaccount from omsagent-token-w2wgq (ro)
 Conditions:
   Type              Status
   Initialized       True 
   Ready             True 
   ContainersReady   True 
   PodScheduled      True 
 Volumes:
   docker-sock:
     Type:          HostPath (bare host directory volume)
     Path:          /var/run
     HostPathType:  
   container-hostname:
     Type:          HostPath (bare host directory volume)
     Path:          /etc/hostname
     HostPathType:  
   host-log:
     Type:          HostPath (bare host directory volume)
     Path:          /var/log
     HostPathType:  
   containerlog-path:
     Type:          HostPath (bare host directory volume)
     Path:          /var/lib/docker/containers
     HostPathType:  
   azure-json-path:
     Type:          HostPath (bare host directory volume)
     Path:          /etc/kubernetes
     HostPathType:  
   omsagent-secret:
     Type:        Secret (a volume populated by a Secret)
     SecretName:  omsagent-secret
     Optional:    false
   omsagent-rs-config:
     Type:      ConfigMap (a volume populated by a ConfigMap)
     Name:      omsagent-rs-config
     Optional:  false
   settings-vol-config:
     Type:      ConfigMap (a volume populated by a ConfigMap)
     Name:      container-azm-ms-agentconfig
     Optional:  true
   omsagent-token-w2wgq:
     Type:        Secret (a volume populated by a Secret)
     SecretName:  omsagent-token-w2wgq
     Optional:    false
 QoS Class:       Burstable
 Node-Selectors:  beta.kubernetes.io/os=linux
                  kubernetes.io/role=agent
 Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                  node.kubernetes.io/unreachable:NoExecute for 300s
 Events:
   Type     Reason     Age                   From                                        Message
   ----     ------     ----                  ----                                        -------
   Normal   Scheduled  24m                   default-scheduler                           Successfully assigned kube-system/omsagent-rs-7fb45f86b-8ql7l to k8s-agentpool-14094549-vmss000000
   Normal   Pulling    24m                   kubelet, k8s-agentpool-14094549-vmss000000  Pulling image "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod07092019"
   Normal   Pulled     24m                   kubelet, k8s-agentpool-14094549-vmss000000  Successfully pulled image "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod07092019"
   Normal   Created    12m (x4 over 23m)     kubelet, k8s-agentpool-14094549-vmss000000  Created container omsagent
   Normal   Started    12m (x4 over 23m)     kubelet, k8s-agentpool-14094549-vmss000000  Started container omsagent
   Normal   Pulled     12m (x3 over 20m)     kubelet, k8s-agentpool-14094549-vmss000000  Container image "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod07092019" already present on machine
   Normal   Killing    8m23s (x4 over 20m)   kubelet, k8s-agentpool-14094549-vmss000000  Container omsagent failed liveness probe, will be restarted
   Warning  Unhealthy  2m23s (x16 over 22m)  kubelet, k8s-agentpool-14094549-vmss000000  Liveness probe failed:

@rashmichandrashekar
Copy link
Contributor

@jackfrancis : looks like the workspace is removed from the addonconfig.
We cannot get rid of this right now because of the limitations we have in the test framework.
@ganga1980: could you please explain?

@codecov
Copy link

codecov bot commented Oct 3, 2019

Codecov Report

Merging #2098 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #2098   +/-   ##
======================================
  Coverage    76.6%   76.6%           
======================================
  Files         135     135           
  Lines       20606   20606           
======================================
  Hits        15786   15786           
  Misses       3897    3897           
  Partials      923     923

@ganga1980
Copy link
Contributor

@jackfrancis : looks like the workspace is removed from the addonconfig.
We cannot get rid of this right now because of the limitations we have in the test framework.
@ganga1980: could you please explain?

@jackfrancis, current implementation of creating and fetching the workspace id and keys are done as part of the aks-engine deploy command since we need to make network calls . I have spent good amount of time on the aks-engine code to figure out whether we can make network calls during the generate command. From reviewing the code, I didnt see any existing network calls made during the generate command. If you think this possible, we can update the code to support generate code path also otherwise tests related to container monitoring needs to be update to use deploy command path.

@jackfrancis
Copy link
Member Author

@ganga1980 so the changes introduced in #2031 mean that the container-monitoring addon is no longer able to work out of the box unless you create your cluster via aks-engine deploy?

@ganga1980
Copy link
Contributor

@ganga1980 so the changes introduced in #2031 mean that the container-monitoring addon is no longer able to work out of the box unless you create your cluster via aks-engine deploy?

@jackfrancis, No. it works via aks-engine generate command too but the workspace id and key has to be specified the container monitoring add-on profile in definition file. For the aks-engine deploy command, it works without specifying the workspace id and keys since these will be auto populated by making network call(s).

@jackfrancis
Copy link
Member Author

@ganga1980 this is the cluster config we maintain to test this addon:

https://github.com/Azure/aks-engine/blob/master/test/e2e/test_cluster_configs/container_monitoring.json

Could you remark how we can evolve that cluster config to accommodate generate?

Thanks!

@ganga1980
Copy link
Contributor

@ganga1980 this is the cluster config we maintain to test this addon:

https://github.com/Azure/aks-engine/blob/master/test/e2e/test_cluster_configs/container_monitoring.json

Could you remark how we can evolve that cluster config to accommodate generate?

Thanks!

@jackfrancis, Sure, let me investigate more on this and see if the generate command (same as deploy) can be simplified to fetch the workspace id and keys automatically for the container monitoring add-on. Me and @rashmichandrashekar investigating current test failures and will get the fix soon if the issue turns to be related to recent my change.

@devigned
Copy link
Member

devigned commented Oct 3, 2019

@ganga1980 why not just add the workspace creation into the deployment template? By doing that, one could ensure the workspace exists and is consistent between deploy and generate?

@jackfrancis
Copy link
Member Author

@ganga1980 I agree with your original assessment that fetching stuff over the wire should not happen via generate. So I think we just need to pre-populate the test api model w/ the correct data so we can test this scenario.

@jackfrancis
Copy link
Member Author

Also what @devigned said makes sense. Is there a reason we can't do the workspace retrieval at runtime as the k8s layer bootstraps?

@ganga1980
Copy link
Contributor

@ganga1980 why not just add the workspace creation into the deployment template? By doing that, one could ensure the workspace exists and is consistent between deploy and generate?
@devigned, Yes, workspace creation can be achieved with the Template, but there is no template way to fetch the workspace key of created or existing workspace.

@devigned
Copy link
Member

devigned commented Oct 3, 2019

@ganga1980 I'm pretty sure you can call list keys in the template and return the workspace key. I might still have a template laying around that does that. I'll look for it.

@ganga1980
Copy link
Contributor

@ganga1980 I'm pretty sure you can call list keys in the template and return the workspace key. I might still have a template laying around that does that. I'll look for it.

@devigned, if we can achieve the fetching keys with the template then I think, we have solution. Appreciate if you can share the template so that I can play with it and see it works for all the scenarios such as new or existing workspace.

@devigned
Copy link
Member

devigned commented Oct 3, 2019

"resources": [
        {
            "apiVersion": "2017-03-15-preview",
            "location": "[parameters('omsWorkspaceRegion')]",
            "name": "[parameters('omsWorkspaceName')]",
            "type": "Microsoft.OperationalInsights/workspaces",
            "comments": "Log Analytics workspace",
            "properties": {
                "sku": {
                        "name": "Standard"
                    }
            }
        }
    ],
    "outputs": {
        "workspaceId": {
            "type": "string",
            "value": "[reference(resourceId('Microsoft.OperationalInsights/workspaces', parameters('omsWorkspaceName'))).customerId]"
          },
        "workspacePrimaryKey": {
            "type": "string",
            "value": "[listKeys(resourceId('Microsoft.OperationalInsights/workspaces', parameters('omsWorkspaceName')), '2017-03-15-preview').primarySharedKey]"
          }
    }

@ganga1980 I believe the above snippet should do what's needed. It'd probably be good to document it somewhere too. It wasn't super easy to figure out.

@ganga1980
Copy link
Contributor

"resources": [
        {
            "apiVersion": "2017-03-15-preview",
            "location": "[parameters('omsWorkspaceRegion')]",
            "name": "[parameters('omsWorkspaceName')]",
            "type": "Microsoft.OperationalInsights/workspaces",
            "comments": "Log Analytics workspace",
            "properties": {
                "sku": {
                        "name": "Standard"
                    }
            }
        }
    ],
    "outputs": {
        "workspaceId": {
            "type": "string",
            "value": "[reference(resourceId('Microsoft.OperationalInsights/workspaces', parameters('omsWorkspaceName'))).customerId]"
          },
        "workspacePrimaryKey": {
            "type": "string",
            "value": "[listKeys(resourceId('Microsoft.OperationalInsights/workspaces', parameters('omsWorkspaceName')), '2017-03-15-preview').primarySharedKey]"
          }
    }

@ganga1980 I believe the above snippet should do what's needed. It'd probably good to document it somewhere too. It wasn't super easy to figure out.

@devigned, thanks David for sharing the template snippet. Yes, agree, we should document. Let me play with it and circle back on this.

@jackfrancis
Copy link
Member Author

Fixed in #2109

@jackfrancis jackfrancis closed this Oct 7, 2019
@jackfrancis jackfrancis deleted the e2e-container-monitoring-disable branch October 7, 2019 20:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants