Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Konnectivity implementation timeline #2452

Closed
MattJeanes opened this issue Jul 20, 2021 · 29 comments
Closed

Konnectivity implementation timeline #2452

MattJeanes opened this issue Jul 20, 2021 · 29 comments
Assignees
Labels
Feedback General feedback

Comments

@MattJeanes
Copy link

MattJeanes commented Jul 20, 2021

What happened:

Release 2021-06-17 states that starting on next weeks release Konnectivity will replace aks-link and tunnelfront for connecting the control plane and nodes, however this does not seem to have happened.

What you expected to happen:

I would have expected this rollout to have completed by now as per the timeline given on the release.

How to reproduce it (as minimally and precisely as possible):

Create a cluster in AKS - West Europe or East US region with Uptime SLA on or off. There should be konnectivity pods but no such pods exist and instead aks-link / tunnelfront exist in kube-system.

Anything else we need to know?:
I have a ticket open with Azure support who tell me that the konnectivity implementation should resolve issues we are seeing with the aks-link / tunnelfront tunnels dropping occasionally resulting in AggregatedAPIDown/AggregatedAPIErrors alerts firing coming from our Prometheus instances.

I am told this deployment has been delayed so I am raising this issue for myself and others to publicly track the implementation progress as I for example will be reducing our alerting thresholds to compensate for these errors but would like to increase it back up when this deployment is complete.

I would also like to know if this upgrade would require upgrading the Kubernetes cluster or if it will done automatically in the background and if it does need an upgrade which versions have the new konnectivity components.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"5a7170d3bbf1731483f4844c2222c70501717341", GitTreeState:"clean", BuildDate:"2021-05-25T17:36:45Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
  • Size of cluster (how many worker nodes are in the cluster?): 6 nodes, 2 system, 3 linux, 1 windows
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.): .NET Core microservices
  • Others: Prometheus, Grafana, Falco, KEDA, Kured, NGINX, Cert Manager, VPA, Kubernetes Dashboard
@ghost ghost added the triage label Jul 20, 2021
@ghost
Copy link

ghost commented Jul 20, 2021

Hi MattJeanes, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@justindavies
Copy link
Contributor

Hi there @MattJeanes, we are slightly delayed on getting this out. We will update this issue when this is on its way

@justindavies justindavies added the Feedback General feedback label Jul 20, 2021
@ghost ghost removed the triage label Jul 20, 2021
@MattJeanes
Copy link
Author

That's perfect, thank you very much. Do you have an estimated timeline for this?

@justindavies justindavies self-assigned this Jul 20, 2021
@justindavies
Copy link
Contributor

I wish I could be very accurate on this, but at the moment the best I can say is in the short vs medium term for this to be rolled out. We'll update this issue once we know it on the release track.

@ghost
Copy link

ghost commented Sep 18, 2021

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@brk3
Copy link

brk3 commented Sep 20, 2021

Hi, just posting an update on this in case it times out. I'm seeing tunnelfront being used in clusters I'm creating in the West Europe region, v1.21.2.

This is causing issues as some reference architectures are now assuming that all clusters have moved to konnectivity.

Is there anywhere I can read the official stance on tunnelfront vs aks-link vs konnectivity from the AKS team?

@ghost ghost removed the stale Stale issue label Sep 20, 2021
@sylus
Copy link

sylus commented Nov 10, 2021

Err konnectivity just got rolled out to us in canadacentral just doing a az aks update --attach-acr link attach to our cluster and now this has completely broken our connectivity to our cluster in both dev and prod.

I don't even know what to say about this feature being rolled out to us unexpectedly as I don't see canadacentral in the list nor while also breaking our production cluster.

konnectivity-agent-6944bbbc-khp7l    1/1     Running            0          9m58s
konnectivity-agent-6944bbbc-q5lxk    1/1     Running            0          9m58s

❯ klo konnectivity-agent-6944bbbc-khp7l

Error from server: Get "https://10.129.0.65:10250/containerLogs/kube-system/konnectivity-agent-6944bbbc-khp7l/konnectivity-agent?follow=true": write unix @->/tunnel-uds/socket: write: broken pipe

@shihying66
Copy link

+1 I just enabled --uptime-sla for one of our prod cluster in northeurope, it looks now Konnectivity-Agent is used instead of aks-link. Can we have a documentation on what's the impact?

@sylus
Copy link

sylus commented Nov 10, 2021

Still no fix for this on our end even though we filed a Sev A and affects 300 production users.

Even with premiere support and the backend team saying they would try to revert back to aks-link. So far since yesterday nothing has been actioned, and Microsoft even downgraded our Severity without our consent.

The issue is with all connectivity initiated by the API Server back to the cluster, which includes connections to the kubelets on each node (for logs and exec calls), and to admission webhooks (such as gatekeeper)

Edit ->

They have rolled back to aks-link on our cluster and almost instantly everything started to work again.

Additionally we now can see the old konnectivity logs and they say:

❯ klo konnectivity-agent-5cb4c5cb69-dsfn4 (XXXXX/kube-system)
I1110 11:47:11.482754 1 options.go:83] AgentCert set to "".
I1110 11:47:11.482816 1 options.go:84] AgentKey set to "".
I1110 11:47:11.482822 1 options.go:85] CACert set to "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt".
I1110 11:47:11.482829 1 options.go:86] ProxyServerHost set to "XXXXX.hcp.canadacentral.azmk8s.io".
I1110 11:47:11.482834 1 options.go:87] ProxyServerPort set to 443.
I1110 11:47:11.482838 1 options.go:88] ALPNProtos set to [konnectivity].
I1110 11:47:11.482859 1 options.go:89] HealthServerPort set to 8082.
I1110 11:47:11.482863 1 options.go:90] AdminServerPort set to 8094.
I1110 11:47:11.482867 1 options.go:91] EnableProfiling set to false.
I1110 11:47:11.482872 1 options.go:92] EnableContentionProfiling set to false.
I1110 11:47:11.482876 1 options.go:93] AgentID set to XXXXX.
I1110 11:47:11.482881 1 options.go:94] SyncInterval set to 1s.
I1110 11:47:11.482892 1 options.go:95] ProbeInterval set to 1s.
I1110 11:47:11.482898 1 options.go:96] SyncIntervalCap set to 10s.
I1110 11:47:11.482903 1 options.go:97] ServiceAccountTokenPath set to "/tokens/konnectivity-token".
I1110 11:47:11.482913 1 options.go:98] AgentIdentifiers set to .
E1110 11:47:11.512659 1 clientset.go:162] "cannot sync once" err="expected one server ID in the context, got []"

@nodunnock
Copy link

nodunnock commented Nov 12, 2021

I just go hit by this today. In one of our clusters konnectivity was rolled out. So far the effect I can see is that webhooks from API server to worker nodes fails occasionally.

error: failed to patch image update to pod template: Internal error occurred: failed calling webhook "XXX.test.svc": Post "https://XXX.test.svc:443/?timeout=10s": read unix @->/tunnel-uds/socket: read: connection reset by peer
error: ingresses.extensions "XXX" could not be patched: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://XXX-ingress-nginx-controller-admission.test.svc:443/networking/v1/ingresses?timeout=10s": read unix @->/tunnel-uds/socket: read: connection reset by peer

Scaling up the number of pods behind the webhook service seems to mitigate the issue a little bit.

The konnectivity logs are full of

E1112 12:28:32.670717       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:47702->10.181.40.126:10249: use of closed network connection"
E1112 12:28:45.090972       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:57820->10.250.193.243:443: use of closed network connection" connectionID=241
E1112 12:28:45.090994       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:57820->10.250.193.243:443: use of closed network connection"
E1112 12:29:45.655267       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:37994->10.181.40.65:10249: use of closed network connection"
E1112 12:29:56.450407       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:37108->10.181.40.4:10256: use of closed network connection"
E1112 12:30:00.982382       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:45290->10.181.40.134:9153: use of closed network connection"
E1112 12:30:14.130177       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:60206->10.181.40.4:10249: use of closed network connection"
E1112 12:30:14.409013       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:35502->10.250.196.71:443: use of closed network connection"
E1112 12:30:14.409090       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:35502->10.250.196.71:443: use of closed network connection" connectionID=243
E1112 12:31:15.518840       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:51222->10.181.40.4:10091: use of closed network connection"
E1112 12:32:30.065767       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:59522->10.250.193.243:443: use of closed network connection" connectionID=157
E1112 12:32:30.065767       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:59522->10.250.193.243:443: use of closed network connection"
E1112 12:32:30.158352       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:37736->10.250.196.71:443: use of closed network connection"
E1112 12:32:30.158443       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:37736->10.250.196.71:443: use of closed network connection" connectionID=247
E1112 12:32:44.428038       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:57044->10.250.168.211:443: use of closed network connection"
E1112 12:32:44.428050       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:57044->10.250.168.211:443: use of closed network connection" connectionID=152
E1112 12:32:50.516049       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:57514->10.250.168.211:443: use of closed network connection" connectionID=242
E1112 12:32:50.516049       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:57514->10.250.168.211:443: use of closed network connection"
E1112 12:33:32.676282       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:51112->10.181.40.126:10249: use of closed network connection"
I1112 12:33:54.730060       1 client.go:482] connection EOF

konnectivity version:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: "2021-11-12T09:02:13Z"
  name: konnectivity-agent
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: konnectivity-agent
        image: mcr.microsoft.com/oss/kubernetes/apiserver-network-proxy/agent:v0.0.24

@ghost ghost added the stale Stale issue label Jan 11, 2022
@ghost
Copy link

ghost commented Jan 11, 2022

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@ghost ghost closed this as completed Jan 19, 2022
@ghost
Copy link

ghost commented Jan 19, 2022

This issue will now be closed because it hasn't had any activity for 7 days after stale. MattJeanes feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

@MattJeanes
Copy link
Author

I'd like to reopen this as I'm not sure on the state of rollout still especially after it has been causing issues as above.

@justindavies is there any update on the rollout?

@ghost ghost removed the stale Stale issue label Jan 19, 2022
@dennis-benzinger-hybris

The rollout seems to continue in February:
https://github.com/Azure/AKS/releases/tag/2022-01-06

@ghost ghost locked as resolved and limited conversation to collaborators Feb 18, 2022
@alvinli222
Copy link
Contributor

Hi all, we stopped the rollout due to discovering some memory leaks in the upstream project. We've worked closely with the upstream team to fix these leaks and are ready to rollout again. You may have already seen a message in your Portal regarding this news. Re-opening this thread to track any issues that arise.

@alvinli222 alvinli222 reopened this Apr 29, 2022
@alvinli222 alvinli222 self-assigned this Apr 29, 2022
@Azure Azure unlocked this conversation Apr 29, 2022
@MattJeanes
Copy link
Author

Thank you for the update, really appreciated! 🙂

@alvinli222
Copy link
Contributor

alvinli222 commented May 2, 2022

Here was the communication that was sent to all clusters:

In Azure Kubernetes Service (AKS), the tunnel is a secure communication channel that allows the managed control plane to communicate with individual nodes. AKS will start rolling out a new version of our tunnel, Konnectivity, to all clusters over the next 3 weeks and clusters will automatically switch to using Konnectivity for tunnel communication.
Recommended action

Check and confirm that your networking rules allow the required ports, FQDNs, and IPs and that your networking firewall doesn’t block or rewrite the Application-Layer Protocol Negotiation (ALPN) TLS extension. If your cluster doesn’t meet these networking rules, then it’s in an unsupported state and won’t be able to receive this update. If your cluster is in an unsupported state, AKS will roll the tunnel back to the previous version to prevent disruptions of service. Please correct the networking rules as soon as possible to bring your cluster to a supported state.
Learn more about the Konnectivity service in Kubernetes and check out the AKS FAQ.

@jackmcc08
Copy link

jackmcc08 commented Jun 9, 2022

Hi @alvinli222, do you know if the Konnectivity Service was rolled out to UK West?

We deployed an AKS cluster in UK south, and it automatically set up Konnectivity-Agent Pods. But, when we deployed the same service to UK west, it did not deploy any Konnectivity-Agents and instead deployed a tunnefront pod.

Many thanks for any information you can provide.

Edit: We have also noted, that the deployments we now make to UK South are not setting up Konnectivity-Agent Pods either, so we are not too sure what is going on here. There does not seem to be an obvious Terraform or Azure option to change what is done.

@arsnyder16
Copy link

@jackmcc08 @MattJeanes Any luck with Konnectivity? I have various clusters with different workloads and in various data centers and i have yet to see Konnectivity in any of the clusters, despite AKS team reporting that the rollout is complete.

@alvinli222 I haven't modifying anything network wise on these clusters, they use kubenet and have nginx ingress deployed in them. Is there any guidance from Microsoft on why this might not be rolling out to clusters?

@miwithro
Copy link
Contributor

@jackmcc08 @arsnyder16 @MattJeanes Konnectivity is now available in ALL Azure Public Regions. So simply doing an upgrade or create option will instantiate Konnectivity.

@arsnyder16
Copy link

@miwithro Thanks! I am seeing it now, i don't think i have seen this mentioned anywhere that we need to roll the cluster like this

@miwithro
Copy link
Contributor

@arsnyder16 yes we will update the Release Notes to reflect this.

@alvinli222
Copy link
Contributor

Hi folks, additionally, you can submit an az aks update --cluster-name myAKSCluster --resource-group myResourceGroup command with no flags to receive this update as well.

@jackmcc08
Copy link

@arsnyder16 sorry for not responding sooner, got sidetracked :) yes, we are now seeing konnectivity agents in our clusters. I assume as per @miwithro comments this has now been implemented.

@bbrandt
Copy link

bbrandt commented Sep 7, 2022

@alvinli222 If I run that command in Azure Cloud Shell I get this error:

the following arguments are required: --name/-n

Examples from AI knowledge base:
az aks update --resource-group MyResourceGroup --name MyManagedCluster --load-balancer-managed-outbound-ip-count 2
Update a kubernetes cluster with standard SKU load balancer to use two AKS created IPs for the load balancer outbound connection usage.

az aks update --resource-group MyResourceGroup --name MyManagedCluster --api-server-authorized-ip-ranges 0.0.0.0/32
Restrict apiserver traffic in a kubernetes cluster to agentpool nodes.

https://docs.microsoft.com/en-US/cli/azure/aks#az_aks_update
Read more about the command in reference docs

If I change --cluster-name to --name I get this error:

Please specify one or more of "--enable-cluster-autoscaler" or "--disable-cluster-autoscaler" or "--update-cluster-autoscaler" or "--cluster-autoscaler-profile" or "--load-balancer-managed-outbound-ip-count" or "--load-balancer-outbound-ips" or "--load-balancer-outbound-ip-prefixes" or "--load-balancer-outbound-ports" or "--load-balancer-idle-timeout" or "--nat-gateway-managed-outbound-ip-count" or "--nat-gateway-idle-timeout" or "--auto-upgrade-channel" or "--attach-acr" or "--detach-acr" or "--uptime-sla" or "--no-uptime-sla" or "--api-server-authorized-ip-ranges" or "--enable-aad" or "--aad-tenant-id" or "--aad-admin-group-object-ids" or "--enable-ahub" or "--disable-ahub" or "--windows-admin-password" or "--enable-managed-identity" or "--assign-identity" or "--enable-azure-rbac" or "--disable-azure-rbac" or "--enable-public-fqdn" or "--disable-public-fqdn" or "--tags" or "--nodepool-labels" or "--enble-windows-gmsa".

Are you testing with an unreleased version of az or something?

I was thinking you meant running any az aks update command would update Konnectivity, but this did not say anything about Konnectivity in the output from this command.
az aks update --tags "aks" --name myAKSCluster --resource-group myResourceGroup

@BrianGuo987
Copy link

Hello guys,
Thanks for the new release.
We just upgrade from1.15 TO 1.22 then aks 1.23.12.
Now we have konnectivity and also tunnelfront.
Both are runnig. But the problem is we don't need tunnelfront anymore and it's still running and trying to ssh.connect to api server every 5 seconds. The tunnel front pod is restarting every 10 minutes.

Please can i know any advices on the resrarting tunnelfront pods?
Thanks a lot!
Brian

@Icybiubiubiu
Copy link

Hello guys, Thanks for the new release. We just upgrade from1.15 TO 1.22 then aks 1.23.12. Now we have konnectivity and also tunnelfront. Both are runnig. But the problem is we don't need tunnelfront anymore and it's still running and trying to ssh.connect to api server every 5 seconds. The tunnel front pod is restarting every 10 minutes.

Please can i know any advices on the resrarting tunnelfront pods? Thanks a lot! Brian

Hi @BrianGuo987 ,

Upgrade AKS from 1.5 to 1.22 is not supported. Users should not upgrade AKS cross 2+ major version. 1.5 is too old and too differences compared with 1.22.
We suggest creating new cluster and redeploy the service again.

@ghost ghost added the stale Stale issue label Dec 30, 2022
@ghost
Copy link

ghost commented Dec 30, 2022

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@MattJeanes
Copy link
Author

I think we can close this now that Konnectivity is widely available, thank you for keeping us posted on the rollout, it has been very much appreciated 🙂

@ghost ghost removed the stale Stale issue label Dec 30, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jan 29, 2023
@aritraghosh aritraghosh moved this to Archive (GA older than 1 month) in Azure Kubernetes Service Roadmap (Public) Jul 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Feedback General feedback
Projects
Status: Archive (GA older than 1 month)
Development

No branches or pull requests