Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't attach to pods in step 09-workload #223

Closed
brk3 opened this issue Sep 1, 2021 · 8 comments
Closed

Can't attach to pods in step 09-workload #223

brk3 opened this issue Sep 1, 2021 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@brk3
Copy link

brk3 commented Sep 1, 2021

Hi, thanks for putting this reference guide together, I've found it really useful.

I'm having an issue connecting to pods on the worker nodes as shown in step 9.4:

$ kubectl run curl -n a0008 -i --tty --rm --image=mcr.microsoft.com/azure-cli --limits='cpu=200m,memory=128Mi'
Flag --limits has been deprecated, has no effect and will be removed in the future.
If you don't see a command prompt, try pressing enter.
Error attaching, falling back to logs: error dialing backend: dial tcp 10.240.0.70:10250: i/o timeout
pod "curl" deleted
Error from server: Get "https://aks-npuser01-12568910-vmss000001:10250/containerLogs/a0008/curl/curl": dial tcp 10.240.0.70:10250: i/o timeout

The result is the same with kubectl logs:

$ kubectl logs busybox11
Error from server: Get "https://aks-npuser01-12568910-vmss000001:10250/containerLogs/default/busybox11/busybox11": dial tcp 10.240.0.70:10250: i/o timeout

Looking at Microsoft's troubleshooting page I see this exact error, but it's unclear to me as of yet which NSG I may need to modify, or if this is indeed the actual problem. I'm assuming it would be the subnet 'snet-clusternodes' but the NSG attached to this is wide open...

Update: I managed to get this working by adding a network rule to the firewall opening all ports in both directions, effectively disabling the firewall. Narrowing it down to the kubelet port (10250) doesn't work, so I still have more questions that answers. Are certain internal cluster comms traversing the firewall? If so why? Also curious as to why other people are not seeing this issue when using these templates.

@ckittel
Copy link
Member

ckittel commented Sep 10, 2021

Interesting observation. Indeed, we haven't experience that in any of our recent deployments. The only traffic that should be going through the firewall is anything that is influenced by the UDR (next hop traffic leaving the subnet). So traffic to the managed AKS master nodes, for example. But not traffic between nodes.

We've seen this step fail occasionally and are thinking of just removing it. kubectl run is not really a "feature" that we'd want to encourage users to do, and this step is a bit "confusing" anyway (as it's showing you're supposed to be getting a failure). It's showing the right thing, but still a bit obtuse.

As for your experience with changing the firewall rules for this to work, that's what's got my head scratching. You mentioned changing INBOUND rules as well as part of the solution.... can you say more about that. The FW is not responsible for gating inbound traffic.

@ckittel ckittel self-assigned this Sep 10, 2021
@ckittel ckittel added the bug Something isn't working label Sep 10, 2021
@brk3
Copy link
Author

brk3 commented Sep 13, 2021

Hi @ckittel, I since managed to narrow this down to port 9000. The AKS egress rules docs lists this as a requirement "For tunneled secure communication between the nodes and the control plane. This is not required for private clusters."

So the working firewall rule looks like this:

Rule Collection Group: DefaultNetworkRuleCollectionGroup
Rule collection name: AKS-Global-Requirements
Source: ipg-westeurope-AksNodepools
Port: 9000
Protocol: TCP
Destination: AzureCloud.WestEurope (my hub is in west europe)
Action: Allow

My latest understanding is that by default the managed AKS control plane runs separately to the node pools; we don't have access to it. So it would make sense to me that we need this rule to allow them to traverse the subnet and out of the firewall to wherever in Azure the control plane is running. Again, this still doesn't explain why it works for everyone else though :)

We've seen this step fail occasionally and are thinking of just removing it. kubectl run is not really a "feature" that we'd want to encourage users to do, and this step is a bit "confusing" anyway (as it's showing you're supposed to be getting a failure). It's showing the right thing, but still a bit obtuse.

I undertand that kubectl run is not something you generally want to advocate, however, it's still core k8s functionality I would expect to work in this setup (same with kubectl logs). In my case it's not showing the right thing, I can't get the 403 nor any response as the terminal can't even attach without the above rule in place.

Appreciate if you have any other thoughts on this, particularly around the port 9000 requirement. Happy to provide any other info as needed.

@ckittel
Copy link
Member

ckittel commented Sep 13, 2021

Your understanding is correct on the separation of the node pools and the managed portion of AKS. But this deployment already deploys the necessary firewall rules. Did all of the other firewall rules get deployed when you did the deployment? Are you by chance deploying on older version of AKS? Port 9000 was used with tunnelfront, but your cluster should be using konnectivity at this point (which is all 443). One other question, is the cluster being deployed to the same region as the firewall?

It's strange that the first failure in this process is being run into this far into the instructions. For example, I can't think of a reason you'd have successfully gotten past any of the prior validation steps before running into an error here with this identified step. Weird that this one step (vs any of the other kubectl commands you've had to execute so far to get to this point), would be the one that trips it up. Something seems "off" here for sure.

Please validate that the firewall's rules all got deployed properly and that the cluster is otherwise healthy before this validation step (this includes that flux was installed and is syncing, etc).

The comment I had about for kubectl run is because that feature is getting more and more deprecated with every release, matter of fact for the next k8s version, we'll need to change this line to look like something else anyhow, since --limits will be officially broken at that point.

@brk3
Copy link
Author

brk3 commented Sep 17, 2021

Did all of the other firewall rules get deployed when you did the deployment?

If I'm reading correctly I can see two firewall policies created within hub-regionA.json:

  • fw-policies-base

    • DefaultNetworkRuleCollectionGroup
      • org-wide-allowed
  • fw-policies-westeurope

    • DefaultDnatRuleCollectionGroup
    • DefaultApplicationRuleCollectionGroup
      • AKS-Global-Requirements
      • Flux-Requirements
    • DefaultNetworkRuleCollectionGroup
      • AKS-Global-Requirements

They all look to have been successfully deployed in my environment.

Are you by chance deploying on older version of AKS? Port 9000 was used with tunnelfront, but your cluster should be using konnectivity at this point (which is all 443).

This is interesting, and quite possibly the issue. I haven't specified the cluster version anywhere so I'm just using whatever I get with these templates, which currently seems to be 1.21.2:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:11:29Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"0b17c6315e806a66d507e77760a5d60ab5cccfd8", GitTreeState:"clean", BuildDate:"2021-08-30T01:42:22Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

From what I can see konnectivity was introduced in k8s 1.18, though I don't see any reference to it in the AKS docs. It appears my cluster is using tunnelfront...

$ kubectl get pods -n kube-system | grep -e tunnelfront -e aks-link -e konnectivity
tunnelfront-7b6f685cb5-w4s95             1/1     Running            0          3h14m

One other question, is the cluster being deployed to the same region as the firewall?

Yes, everything is being deployed in the West Europe region.

@ckittel
Copy link
Member

ckittel commented Sep 17, 2021

Well that certainly explains what's happening. We removed the port 9000 requirement back in #199 as that was no longer required with new deployments of AKS. That's fascinating that you're getting tunnelfront on your 1.21.2 cluster.

Azure/AKS#2452 was raised by another customer that also happens to be in West Europe that noticed the same thing back in June. I wonder if, oddly, that rollout never fully completed? Any chance you can "me too" on to this linked issue and see if there is any known updates on the konnectivity rollout? Maybe the rollout has been only done in certain regions (and the default regions of this repo happen to have that change, but other regions do not yet).

I feel like we're getting somewhere on this, but we might need AKS product team's input.

@scaswell-hirez
Copy link
Contributor

scaswell-hirez commented Nov 8, 2021

I am experiencing the same issue as documented by brk3 with the following difference. There is no instance of tunnelfront to explain what is going on, and opening port 9000 does not resolve the issue. As expected in this case since it's not tunnelfront. In all other respects my results are the exact same as brk3.

kubectl get pods -n kube-system | grep -e tunnelfront -e aks-link -e konnectivity
aks-link-5bf7d5847b-p72zv                1/1     Running            0          40h
aks-link-5bf7d5847b-vlxh5                1/1     Running            0          40h
kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.3", GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", BuildDate:"2021-10-27T18:41:28Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"881d4a5a3c0f4036c714cfb601b377c4c72de543", GitTreeState:"clean", BuildDate:"2021-10-21T05:13:01Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

@scaswell-hirez
Copy link
Contributor

Following along with https://docs.microsoft.com/en-us/azure/firewall/protect-azure-kubernetes-service as brk3 did I've tried opening different ports and port combinations. Allowing UDP traffic on port 1194 resolved my ability to attach to the pod and execute the curl query.

It appears to me that 1194 is still required. My hub is in East US 2 if it matters.

@ckittel
Copy link
Member

ckittel commented Nov 30, 2021

It looks like konnectivity is rolling out more broadly now. Since the egress affordances for aks-link have been replaced with the simplified egress rules found in this reference implementation for konnectivity, I'm going to close this issue. But if your region doesn't use konnectivity, then the conversation above will help. It's just a matter of timing between the two, unfortunately.

@ckittel ckittel closed this as completed Nov 30, 2021
ckittel added a commit that referenced this issue Apr 28, 2022
* Allow communication with API server via udp/1194.

References:
#223
https://docs.microsoft.com/en-us/azure/firewall/protect-azure-kubernetes-service

* Return IP address instead of res. ID (acc  to doc)

* Minimal user feedback: echo variables to console.

* ifconfig.io to return IPv4 addr for access policy

* Notes for macOS users, having BSD sed.

* Improvement to comment.

Co-authored-by: Chad Kittel <[email protected]>

* Comment out firewall rule, but add hints.

* Enable FW rule in bicep; remove warning.

Co-authored-by: Chad Kittel <[email protected]>
ckittel added a commit that referenced this issue May 2, 2022
* Allow communication with API server via udp/1194.

References:
#223
https://docs.microsoft.com/en-us/azure/firewall/protect-azure-kubernetes-service

* Return IP address instead of res. ID (acc  to doc)

* Minimal user feedback: echo variables to console.

* ifconfig.io to return IPv4 addr for access policy

* Notes for macOS users, having BSD sed.

* Improvement to comment.

Co-authored-by: Chad Kittel <[email protected]>

* Comment out firewall rule, but add hints.

* Enable FW rule in bicep; remove warning.

* Update references to 'aks-baseline'.

* Get current branch name and pass as parameter.

* Pass domain name as parameter to curl container.

* Optimize docs for pre-existing AAD group.

- Add bash snippet to set pre-existing group.
- Add hints to skip user creation / member adding group has members.

* Hint for single-tenant deployment.

* Make namespace reader group optional.

* Fix: Print correct variable name.

* Only stage intentionally changed file for commit.

* FIx deployment failures on role lookup

* Add some clarification to docs.

* Make saveenv.sh independent of current directory.

* Append suffix to GITOPS variables...

...making sure they are also written to aks_baseline.env by saveenv.sh.

* export GITOPS variables.

* Revert "FIx deployment failures on role lookup"

This reverts commit 9234b57.

* Revert "Only stage intentionally changed file for commit."

This reverts commit fba516b.

* GITOPS variables are just 'local'.

* Update 01-prerequisites.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 11-validation.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* GITOPS variables are just 'local'.

Co-authored-by: Chad Kittel <[email protected]>
ckittel added a commit that referenced this issue May 9, 2022
* Allow communication with API server via udp/1194.

References:
#223
https://docs.microsoft.com/en-us/azure/firewall/protect-azure-kubernetes-service

* Return IP address instead of res. ID (acc  to doc)

* Minimal user feedback: echo variables to console.

* ifconfig.io to return IPv4 addr for access policy

* Notes for macOS users, having BSD sed.

* Improvement to comment.

Co-authored-by: Chad Kittel <[email protected]>

* Comment out firewall rule, but add hints.

* Enable FW rule in bicep; remove warning.

* Update references to 'aks-baseline'.

* Get current branch name and pass as parameter.

* Pass domain name as parameter to curl container.

* Optimize docs for pre-existing AAD group.

- Add bash snippet to set pre-existing group.
- Add hints to skip user creation / member adding group has members.

* Hint for single-tenant deployment.

* Make namespace reader group optional.

* Fix: Print correct variable name.

* Only stage intentionally changed file for commit.

* FIx deployment failures on role lookup

* Add some clarification to docs.

* Make saveenv.sh independent of current directory.

* Append suffix to GITOPS variables...

...making sure they are also written to aks_baseline.env by saveenv.sh.

* export GITOPS variables.

* Revert "FIx deployment failures on role lookup"

This reverts commit 9234b57.

* Revert "Only stage intentionally changed file for commit."

This reverts commit fba516b.

* GITOPS variables are just 'local'.

* Update 01-prerequisites.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 11-validation.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* GITOPS variables are just 'local'.

* Replace WAF configuration with WAF policy.

Co-authored-by: Chad Kittel <[email protected]>
ulkeba added a commit to ulkeba/aks-baseline_fork that referenced this issue May 10, 2022
* Allow communication with API server via udp/1194.

References:
mspnp#223
https://docs.microsoft.com/en-us/azure/firewall/protect-azure-kubernetes-service

* Return IP address instead of res. ID (acc  to doc)

* Minimal user feedback: echo variables to console.

* ifconfig.io to return IPv4 addr for access policy

* Notes for macOS users, having BSD sed.

* Improvement to comment.

Co-authored-by: Chad Kittel <[email protected]>

* Comment out firewall rule, but add hints.

* Enable FW rule in bicep; remove warning.

* Update references to 'aks-baseline'.

* Get current branch name and pass as parameter.

* Pass domain name as parameter to curl container.

* Optimize docs for pre-existing AAD group.

- Add bash snippet to set pre-existing group.
- Add hints to skip user creation / member adding group has members.

* Hint for single-tenant deployment.

* Make namespace reader group optional.

* Fix: Print correct variable name.

* Only stage intentionally changed file for commit.

* FIx deployment failures on role lookup

* Add some clarification to docs.

* Make saveenv.sh independent of current directory.

* Append suffix to GITOPS variables...

...making sure they are also written to aks_baseline.env by saveenv.sh.

* export GITOPS variables.

* Revert "FIx deployment failures on role lookup"

This reverts commit 9234b57.

* Revert "Only stage intentionally changed file for commit."

This reverts commit fba516b.

* GITOPS variables are just 'local'.

* Update 01-prerequisites.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 11-validation.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* GITOPS variables are just 'local'.

* Replace WAF configuration with WAF policy.

Co-authored-by: Chad Kittel <[email protected]>
ckittel added a commit that referenced this issue May 10, 2022
* Allow communication with API server via udp/1194.

References:
#223
https://docs.microsoft.com/en-us/azure/firewall/protect-azure-kubernetes-service

* Return IP address instead of res. ID (acc  to doc)

* Minimal user feedback: echo variables to console.

* ifconfig.io to return IPv4 addr for access policy

* Notes for macOS users, having BSD sed.

* Improvement to comment.

Co-authored-by: Chad Kittel <[email protected]>

* Comment out firewall rule, but add hints.

* Enable FW rule in bicep; remove warning.

* Update references to 'aks-baseline'.

* Get current branch name and pass as parameter.

* Pass domain name as parameter to curl container.

* Optimize docs for pre-existing AAD group.

- Add bash snippet to set pre-existing group.
- Add hints to skip user creation / member adding group has members.

* Hint for single-tenant deployment.

* Make namespace reader group optional.

* Fix: Print correct variable name.

* Only stage intentionally changed file for commit.

* FIx deployment failures on role lookup

* Add some clarification to docs.

* Make saveenv.sh independent of current directory.

* Append suffix to GITOPS variables...

...making sure they are also written to aks_baseline.env by saveenv.sh.

* export GITOPS variables.

* Revert "FIx deployment failures on role lookup"

This reverts commit 9234b57.

* Revert "Only stage intentionally changed file for commit."

This reverts commit fba516b.

* GITOPS variables are just 'local'.

* Update 01-prerequisites.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 11-validation.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* GITOPS variables are just 'local'.

* Fix: Peering name length for long region names.

* Update networking/spoke-BU0001A0008.bicep

Co-authored-by: Chad Kittel <[email protected]>

* Change: Replace AppGW WAF config with WAF policy resource. (#316)

* Allow communication with API server via udp/1194.

References:
#223
https://docs.microsoft.com/en-us/azure/firewall/protect-azure-kubernetes-service

* Return IP address instead of res. ID (acc  to doc)

* Minimal user feedback: echo variables to console.

* ifconfig.io to return IPv4 addr for access policy

* Notes for macOS users, having BSD sed.

* Improvement to comment.

Co-authored-by: Chad Kittel <[email protected]>

* Comment out firewall rule, but add hints.

* Enable FW rule in bicep; remove warning.

* Update references to 'aks-baseline'.

* Get current branch name and pass as parameter.

* Pass domain name as parameter to curl container.

* Optimize docs for pre-existing AAD group.

- Add bash snippet to set pre-existing group.
- Add hints to skip user creation / member adding group has members.

* Hint for single-tenant deployment.

* Make namespace reader group optional.

* Fix: Print correct variable name.

* Only stage intentionally changed file for commit.

* FIx deployment failures on role lookup

* Add some clarification to docs.

* Make saveenv.sh independent of current directory.

* Append suffix to GITOPS variables...

...making sure they are also written to aks_baseline.env by saveenv.sh.

* export GITOPS variables.

* Revert "FIx deployment failures on role lookup"

This reverts commit 9234b57.

* Revert "Only stage intentionally changed file for commit."

This reverts commit fba516b.

* GITOPS variables are just 'local'.

* Update 01-prerequisites.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 11-validation.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* Update 03-aad.md

Co-authored-by: Chad Kittel <[email protected]>

* GITOPS variables are just 'local'.

* Replace WAF configuration with WAF policy.

Co-authored-by: Chad Kittel <[email protected]>

Co-authored-by: Chad Kittel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants