Unable to get pod logs from AKS cluster #97

nitinkhandelwal26 · 2020-10-28T09:26:46Z

Hello Team,

Try to get the pod log using :
kubectl logs csi-secrets-store-9wr95 -n cluster-baseline-settings -c secrets-store
and getting below output:
Error from server: Get https://aks-npuser01-42213062-vmss00000c:10250/containerLogs/cluster-baseline-settings/csi-secrets-store-9wr95/s
ecrets-store: dial tcp 10.10.128.197:10250: i/o timeout

Where in portal its showing like

Don't know why I am unable to fetch logs from any pods, Can you please help me here..

nitinkhandelwal26 · 2020-10-28T09:28:59Z

Matrix server pod:

Traefik deployment:

Traefik Pods in portal:

Traefik Kubectl Get Logs Error:

User Permissions:

nitinkhandelwal26 · 2020-10-28T10:20:40Z

This issue seems to be related :
Azure/AKS#1544

nitinkhandelwal26 · 2020-10-28T10:47:27Z

We have created Subnet Level NSG and when we removed NSG from Subnet level its working,

Can you please guide which ports/rules should be applied to subnet level NSG to make it work with NSG...

neilpeterson · 2020-10-28T21:08:21Z

@nitinkhandelwal26 I see that you have closed this issue. Did you find the information that you need?

nitinkhandelwal26 · 2020-10-29T03:10:34Z

Thanks @neilpeterson for support, no still I didn't got that information, but my issue got resolved by removing subnet level NSG, i still need to find out that.. if you have any information on that then it will be really helpful...
@ckittel working on one PR related to dependancies on public images, might be that will fix this issue... as we would not need to reach public repos... but still don't know which are those ports and IPs which we need to allow in subnet level NSG to fetch logs from pods in aks cluster..

ckittel · 2020-10-29T13:05:35Z

We don't document any subnet level NSG specific port requirements that I'm aware of outside of our general egress guidance. Obviously AKS applies NSG rules to the NICs in your cluster, but if you're applying at the subnet level as you said, your responsibility is to ensure they don't interfere with normal healthy traffic.

If you find a ruleset that works for you, do please share. I'm guessing you'll see your ruleset look similar to the documented ruleset in the link above + any additional configuration added to the cluster that demands even more openings.

Out of curiosity, what specific problem are you looking to solve with added subnet-level NSG rules that the NIC-level NSG rules + egress via FW doesn't already solve for?

nitinkhandelwal26 · 2020-10-29T18:21:53Z

@ckittel Our network team providing the HUB - Spoke has implemented this.

nitinkhandelwal26 · 2020-10-29T18:53:04Z

Will share surely the ruleset once fixing that.. currently removed NSGs to make it work. thank you @ckittel for your support..

Cogax · 2021-09-30T05:50:41Z

I do also have this issue. Kubelet (Port 10250) is not reachable from kube-apiserver. I can get pods but I can't access logs (timeout error as mentioned above). I added an Inbound Rule for Ports mentioned here https://docs.microsoft.com/en-us/azure/aks/limit-egress-traffic and here https://kubernetes.io/docs/reference/ports-and-protocols/ in the spoke-nodepools NSG. It's still not working. What's wrong with my approach?

ckittel · 2021-09-30T12:19:28Z

@Cogax thanks for reporting.

We have a faq entry on this error, but it's not very detailed.

We've seen some inconsistency in network rules depending on if your cluster is running konnectivity or not. It seems to depend on what region to you deploy into (as far as we can tell so far).

I'm curious when you run kubectl get deployments -n kube-system does konnectivity-agent show up for you, or do you see tunnelfront instead?

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
azure-policy           1/1     1            1           46h
azure-policy-webhook   1/1     1            1           46h
coredns                2/2     2            2           46h
coredns-autoscaler     1/1     1            1           46h
konnectivity-agent     1/1     1            1           46h
metrics-server         1/1     1            1           46h
omsagent-rs            1/1     1            1           46h

I just deployed this cluster two days ago and I can freely get logs (see Examples below) across various pods across the two node pools without that error. But I know we've run into the error you've seen before as well, so any help in triage will be appreciated. Let's start with konnectivity vs tunnelfront and see if that helps narrow this down.

Also, what region did you deploy into?

cc: @abossard

Examples:

❯ kubectl logs aspnetapp-deployment-56b77c4f79-tptcj -n a0008
{"EventId":50,"LogLevel":"Warning","Category":"Microsoft.AspNetCore.DataProtection.Repositories.EphemeralXmlRepository","Message":"Using an in-memory repository. Keys will not be persisted to storage.","State":{"Message":"Using an in-memory repository. Keys will not be persisted to storage.","{OriginalFormat}":"Using an in-memory repository. Keys will not be persisted to storage."}}

❯ kubectl logs aad-pod-identity-nmi-xsbc2 -n cluster-baseline-settings
I0928 14:04:33.062515       1 main.go:111] running NMI in namespaced mode: false
I0928 14:04:33.062526       1 nmi.go:53] initializing in standard mode
I0928 14:04:33.062533       1 probes.go:41] initialized health probe on port 8085
I0928 14:04:33.062539       1 probes.go:44] started health probe

❯ kubectl logs azure-policy-webhook-59b69cfc84-pq42w -n kube-system
{"level":"info","ts":"2021-09-28T13:56:30.436919146Z","msg":"fetching secret","log-id":"0008c6fa-b-1","method":"github.com/Azure/azure-policy-kubernetes/pkg/webhook.(*k8sClient).getSecret"}
{"level":"info","ts":"2021-09-28T13:56:30.455736905Z","msg":"fetching validating webhook configuration","log-id":"0008c6fa-b-2","method":"github.com/Azure/azure-policy-kubernetes/pkg/webhook.(*k8sClient).getValidationWebhookConfig"}

Cogax · 2021-09-30T12:37:28Z

@ckittel

Thanks for your quick response. I found your FAQ article while researching this issue but it didn't help me. I opened all Inbound Traffic (Any Source, Destination, Port, etc) on alls NSG's but it had no effect.

I removed the whole setup so I can't give you exact answers. I will recreate it later an check if the issue still exists. Some information I have at the moment:

region was westeurope for everything
azure subscription is a test subscription with limited quotas
because of quota limitations, I used only 1 node (min=1, max=1) for system nodepool as well as user nodepool:

"agentPoolProfiles": [
{
  "name": "npsystem",
  "count": 1,
  "vmSize": "Standard_DS2_v2",
  "osDiskSizeGB": 80,
  "osDiskType": "Ephemeral",
  "osType": "Linux",
  "minCount": 1,
  "maxCount": 1,
  "vnetSubnetID": "[variables('vnetNodePoolSubnetResourceId')]",
  "enableAutoScaling": true,
  "type": "VirtualMachineScaleSets",
  "mode": "System",
  "scaleSetPriority": "Regular",
  "scaleSetEvictionPolicy": "Delete",
  "orchestratorVersion": "[parameters('kubernetesVersion')]",
  "enableNodePublicIP": false,
  "maxPods": 30,
  "availabilityZones": ["1", "2", "3"],
  "upgradeSettings": {
    "maxSurge": "33%"
  },
  "nodeTaints": ["CriticalAddonsOnly=true:NoSchedule"]
},
{
  "name": "npuser01",
  "count": 1,
  "vmSize": "Standard_DS3_v2",
  "osDiskSizeGB": 120,
  "osDiskType": "Ephemeral",
  "osType": "Linux",
  "minCount": 1,
  "maxCount": 1,
  "vnetSubnetID": "[variables('vnetNodePoolSubnetResourceId')]",
  "enableAutoScaling": true,
  "type": "VirtualMachineScaleSets",
  "mode": "User",
  "scaleSetPriority": "Regular",
  "scaleSetEvictionPolicy": "Delete",
  "orchestratorVersion": "[parameters('kubernetesVersion')]",
  "enableNodePublicIP": false,
  "maxPods": 30,
  "availabilityZones": ["1", "2", "3"],
  "upgradeSettings": {
    "maxSurge": "33%"
  }
}

I did not applied flux. Insead, I executed those YAML files manually:

kubectl apply -f kubernetes/new-aks/cluster-manifests/kube-system/container-azm-ms-agentconfig.yaml
kubectl apply -f kubernetes/new-aks/cluster-manifests/cluster-baseline-settings/kured.yaml
kubectl apply -f kubernetes/new-aks/cluster-manifests/cluster-baseline-settings/aad-pod-identity.yaml
kubectl apply -f kubernetes/new-aks/cluster-manifests/a0008/ingress-network-policy.yaml

In my console I had a list of all Pods. The issue appeared when I wanted to deploy traeffik components. They were always on ContainerCreating and Pending. This was the reason why I wanted to check the logs of the aad-pod-identity-mni pod. This is the list:

NAMESPACE                   NAME                                                   READY   STATUS              RESTARTS   AGE
a0008                       traefik-ingress-controller-54ff76688d-c4n2t            0/1     ContainerCreating   0          13h
a0008                       traefik-ingress-controller-54ff76688d-nm5k9            0/1     Pending             0          15h
cluster-baseline-settings   aad-pod-identity-mic-59545c8bc7-75d66                  1/1     Running             0          13h
cluster-baseline-settings   aad-pod-identity-mic-59545c8bc7-7mbx2                  1/1     Running             0          13h
cluster-baseline-settings   aad-pod-identity-nmi-mzvz4                             1/1     Running             0          13h
cluster-baseline-settings   kured-swvrh                                            1/1     Running             0          13h
cluster-baseline-settings   kured-xghl8                                            1/1     Running             0          13h
default                     node-debugger-aks-npuser01-58613137-vmss000001-q8fq9   1/1     Running             0          29m
gatekeeper-system           gatekeeper-audit-6856c7d886-clp5d                      1/1     Running             0          13h
gatekeeper-system           gatekeeper-controller-7bff99d7dc-2dn28                 1/1     Running             0          13h
gatekeeper-system           gatekeeper-controller-7bff99d7dc-hr4xd                 1/1     Running             0          13h
kube-system                 aks-link-79f56b9565-5n2v8                              1/1     Running             0          149m
kube-system                 aks-link-79f56b9565-dsgh6                              1/1     Running             0          149m
kube-system                 aks-secrets-store-csi-driver-9lmpv                     3/3     Running             2          13h
kube-system                 aks-secrets-store-csi-driver-vwzp9                     3/3     Running             2          13h
kube-system                 aks-secrets-store-provider-azure-vswz4                 1/1     Running             0          13h
kube-system                 aks-secrets-store-provider-azure-zq49c                 1/1     Running             0          13h
kube-system                 azure-cni-networkmonitor-6gnqm                         1/1     Running             0          13h
kube-system                 azure-cni-networkmonitor-z9gw6                         1/1     Running             0          13h
kube-system                 azure-ip-masq-agent-c2rtt                              1/1     Running             0          13h
kube-system                 azure-ip-masq-agent-lh9vp                              1/1     Running             0          13h
kube-system                 azure-npm-4qrs2                                        1/1     Running             0          13h
kube-system                 azure-npm-btqv7                                        1/1     Running             0          13h
kube-system                 azure-policy-6f77469b44-6pn2w                          1/1     Running             0          13h
kube-system                 azure-policy-webhook-59b69cfc84-2gl7l                  1/1     Running             0          13h
kube-system                 coredns-86846667d7-lqbsv                               1/1     Running             0          13h
kube-system                 coredns-86846667d7-r6wtz                               1/1     Running             0          13h
kube-system                 coredns-autoscaler-5f85dc856b-xzfkt                    1/1     Running             0          13h
kube-system                 csi-azuredisk-node-l4rqh                               3/3     Running             0          13h
kube-system                 csi-azuredisk-node-zd676                               3/3     Running             0          13h
kube-system                 csi-azurefile-node-h5shc                               3/3     Running             0          13h
kube-system                 csi-azurefile-node-wxh45                               3/3     Running             0          13h
kube-system                 kube-proxy-cv2wr                                       1/1     Running             0          13h
kube-system                 kube-proxy-qgpbn                                       1/1     Running             0          13h
kube-system                 metrics-server-6bc97b47f7-g5hwr                        0/1     Running             249        13h
kube-system                 omsagent-4sp2t                                         1/1     Running             0          13h
kube-system                 omsagent-f8sxs                                         1/1     Running             0          13h
kube-system                 omsagent-rs-7c5979787c-849m4                           1/1     Running             0          13h

I don't know if there was an konnectivity-agent running

Hope that helps. I will recreate the whole setup I did and update this issue later.

ckittel · 2021-09-30T12:42:18Z

I think we're on to something. westeurope has been the problem every time so far. You've got aks-link in the output above, which means your cluster is indeed not running konnectivity.

We made a change to this repo to migrate to konnectivity's network rulesets (basically all over 443) instead what tunnelfront/aks-link required (see #199 - specifically the removal of the rules in hub-regionA.json). That's probably what's causing this, if I had to guess. Usually that manifests other errors as well, and not just log fetching though, which is an interesting wrinkle in this.

See related conversation happening @ #223 where @brk3 had a similar observation (also in westeurpoe and found a workaround by adding back the rules we removed when AKS moved to konnectivity (comment: #223 (comment))

ccyflai · 2021-11-08T14:22:02Z

Encountered the problem to get pod logs until I allow the node IPs to access port 9000 of API server in hub firewall network rule. This is documented in below. I would suggest amending the ARM template hub-regionA.json for that.

https://docs.microsoft.com/en-us/azure/aks/limit-egress-traffic#azure-global-required-network-rules

ckittel · 2021-11-08T15:40:37Z

@ccyflai -- can I ask what region you were deploying to? Just want to see if the pattern continues to emerge here.

Glad you added that extra firewall rule to proceed. Don't forget to remove it once konnectivity is used within your cluster, as it won't be necessary anymore.

ccyflai · 2021-11-08T15:53:31Z

I deployed in southeastasia.

ckittel · 2021-11-30T21:53:48Z

It looks like konnectivity is rolling out more broadly now. Since the egress affordances for aks-link have been replaced with the simplified egress rules found in this reference implementation for konnectivity, I'm going to close this issue. But if your region doesn't use konnectivity, then the conversation above will help. It's just a matter of timing between the two, unfortunately.

nitinkhandelwal26 closed this as completed Oct 28, 2020

nitinkhandelwal26 reopened this Oct 29, 2020

ckittel added the question Further information is requested label Oct 29, 2020

nitinkhandelwal26 closed this as completed Oct 29, 2020

ckittel reopened this Sep 30, 2021

ckittel closed this as completed Nov 30, 2021

nbusseneau mentioned this issue Dec 7, 2021

CI: ConformanceAKS fails to become ready: unable to retrieve cilium status: unable to unmarshal response of cilium status: unexpected end of JSON input cilium/cilium#18107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to get pod logs from AKS cluster #97

Unable to get pod logs from AKS cluster #97

nitinkhandelwal26 commented Oct 28, 2020 •

edited

Loading

nitinkhandelwal26 commented Oct 28, 2020 •

edited

Loading

nitinkhandelwal26 commented Oct 28, 2020

nitinkhandelwal26 commented Oct 28, 2020

neilpeterson commented Oct 28, 2020

nitinkhandelwal26 commented Oct 29, 2020

ckittel commented Oct 29, 2020 •

edited

Loading

nitinkhandelwal26 commented Oct 29, 2020

nitinkhandelwal26 commented Oct 29, 2020

Cogax commented Sep 30, 2021

ckittel commented Sep 30, 2021 •

edited

Loading

Cogax commented Sep 30, 2021

ckittel commented Sep 30, 2021 •

edited

Loading

ccyflai commented Nov 8, 2021

ckittel commented Nov 8, 2021

ccyflai commented Nov 8, 2021

ckittel commented Nov 30, 2021

Unable to get pod logs from AKS cluster #97

Unable to get pod logs from AKS cluster #97

Comments

nitinkhandelwal26 commented Oct 28, 2020 • edited Loading

nitinkhandelwal26 commented Oct 28, 2020 • edited Loading

nitinkhandelwal26 commented Oct 28, 2020

nitinkhandelwal26 commented Oct 28, 2020

neilpeterson commented Oct 28, 2020

nitinkhandelwal26 commented Oct 29, 2020

ckittel commented Oct 29, 2020 • edited Loading

nitinkhandelwal26 commented Oct 29, 2020

nitinkhandelwal26 commented Oct 29, 2020

Cogax commented Sep 30, 2021

ckittel commented Sep 30, 2021 • edited Loading

Cogax commented Sep 30, 2021

ckittel commented Sep 30, 2021 • edited Loading

ccyflai commented Nov 8, 2021

ckittel commented Nov 8, 2021

ccyflai commented Nov 8, 2021

ckittel commented Nov 30, 2021

nitinkhandelwal26 commented Oct 28, 2020 •

edited

Loading

nitinkhandelwal26 commented Oct 28, 2020 •

edited

Loading

ckittel commented Oct 29, 2020 •

edited

Loading

ckittel commented Sep 30, 2021 •

edited

Loading

ckittel commented Sep 30, 2021 •

edited

Loading