Konnectivity implementation timeline #2452

MattJeanes · 2021-07-20T14:27:49Z

What happened:

Release 2021-06-17 states that starting on next weeks release Konnectivity will replace aks-link and tunnelfront for connecting the control plane and nodes, however this does not seem to have happened.

What you expected to happen:

I would have expected this rollout to have completed by now as per the timeline given on the release.

How to reproduce it (as minimally and precisely as possible):

Create a cluster in AKS - West Europe or East US region with Uptime SLA on or off. There should be konnectivity pods but no such pods exist and instead aks-link / tunnelfront exist in kube-system.

Anything else we need to know?:
I have a ticket open with Azure support who tell me that the konnectivity implementation should resolve issues we are seeing with the aks-link / tunnelfront tunnels dropping occasionally resulting in AggregatedAPIDown/AggregatedAPIErrors alerts firing coming from our Prometheus instances.

I am told this deployment has been delayed so I am raising this issue for myself and others to publicly track the implementation progress as I for example will be reducing our alerting thresholds to compensate for these errors but would like to increase it back up when this deployment is complete.

I would also like to know if this upgrade would require upgrading the Kubernetes cluster or if it will done automatically in the background and if it does need an upgrade which versions have the new konnectivity components.

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"5a7170d3bbf1731483f4844c2222c70501717341", GitTreeState:"clean", BuildDate:"2021-05-25T17:36:45Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Size of cluster (how many worker nodes are in the cluster?): 6 nodes, 2 system, 3 linux, 1 windows
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.): .NET Core microservices
Others: Prometheus, Grafana, Falco, KEDA, Kured, NGINX, Cert Manager, VPA, Kubernetes Dashboard

The text was updated successfully, but these errors were encountered:

ghost · 2021-07-20T14:27:52Z

Hi MattJeanes, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
Please abide by the AKS repo Guidelines and Code of Conduct.
If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

justindavies · 2021-07-20T17:30:32Z

Hi there @MattJeanes, we are slightly delayed on getting this out. We will update this issue when this is on its way

MattJeanes · 2021-07-20T17:42:03Z

That's perfect, thank you very much. Do you have an estimated timeline for this?

justindavies · 2021-07-20T17:44:31Z

I wish I could be very accurate on this, but at the moment the best I can say is in the short vs medium term for this to be rolled out. We'll update this issue once we know it on the release track.

ghost · 2021-09-18T20:01:14Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

brk3 · 2021-09-20T08:25:18Z

Hi, just posting an update on this in case it times out. I'm seeing tunnelfront being used in clusters I'm creating in the West Europe region, v1.21.2.

This is causing issues as some reference architectures are now assuming that all clusters have moved to konnectivity.

Is there anywhere I can read the official stance on tunnelfront vs aks-link vs konnectivity from the AKS team?

sylus · 2021-11-10T02:58:09Z

Err konnectivity just got rolled out to us in canadacentral just doing a az aks update --attach-acr link attach to our cluster and now this has completely broken our connectivity to our cluster in both dev and prod.

I don't even know what to say about this feature being rolled out to us unexpectedly as I don't see canadacentral in the list nor while also breaking our production cluster.

konnectivity-agent-6944bbbc-khp7l    1/1     Running            0          9m58s
konnectivity-agent-6944bbbc-q5lxk    1/1     Running            0          9m58s

❯ klo konnectivity-agent-6944bbbc-khp7l

Error from server: Get "https://10.129.0.65:10250/containerLogs/kube-system/konnectivity-agent-6944bbbc-khp7l/konnectivity-agent?follow=true": write unix @->/tunnel-uds/socket: write: broken pipe

shihying66 · 2021-11-10T08:05:19Z

+1 I just enabled --uptime-sla for one of our prod cluster in northeurope, it looks now Konnectivity-Agent is used instead of aks-link. Can we have a documentation on what's the impact?

sylus · 2021-11-10T15:15:33Z

Still no fix for this on our end even though we filed a Sev A and affects 300 production users.

Even with premiere support and the backend team saying they would try to revert back to aks-link. So far since yesterday nothing has been actioned, and Microsoft even downgraded our Severity without our consent.

The issue is with all connectivity initiated by the API Server back to the cluster, which includes connections to the kubelets on each node (for logs and exec calls), and to admission webhooks (such as gatekeeper)

Edit ->

They have rolled back to aks-link on our cluster and almost instantly everything started to work again.

Additionally we now can see the old konnectivity logs and they say:

❯ klo konnectivity-agent-5cb4c5cb69-dsfn4 (XXXXX/kube-system)
I1110 11:47:11.482754 1 options.go:83] AgentCert set to "".
I1110 11:47:11.482816 1 options.go:84] AgentKey set to "".
I1110 11:47:11.482822 1 options.go:85] CACert set to "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt".
I1110 11:47:11.482829 1 options.go:86] ProxyServerHost set to "XXXXX.hcp.canadacentral.azmk8s.io".
I1110 11:47:11.482834 1 options.go:87] ProxyServerPort set to 443.
I1110 11:47:11.482838 1 options.go:88] ALPNProtos set to [konnectivity].
I1110 11:47:11.482859 1 options.go:89] HealthServerPort set to 8082.
I1110 11:47:11.482863 1 options.go:90] AdminServerPort set to 8094.
I1110 11:47:11.482867 1 options.go:91] EnableProfiling set to false.
I1110 11:47:11.482872 1 options.go:92] EnableContentionProfiling set to false.
I1110 11:47:11.482876 1 options.go:93] AgentID set to XXXXX.
I1110 11:47:11.482881 1 options.go:94] SyncInterval set to 1s.
I1110 11:47:11.482892 1 options.go:95] ProbeInterval set to 1s.
I1110 11:47:11.482898 1 options.go:96] SyncIntervalCap set to 10s.
I1110 11:47:11.482903 1 options.go:97] ServiceAccountTokenPath set to "/tokens/konnectivity-token".
I1110 11:47:11.482913 1 options.go:98] AgentIdentifiers set to .
E1110 11:47:11.512659 1 clientset.go:162] "cannot sync once" err="expected one server ID in the context, got []"

nodunnock · 2021-11-12T15:24:22Z

I just go hit by this today. In one of our clusters konnectivity was rolled out. So far the effect I can see is that webhooks from API server to worker nodes fails occasionally.

error: failed to patch image update to pod template: Internal error occurred: failed calling webhook "XXX.test.svc": Post "https://XXX.test.svc:443/?timeout=10s": read unix @->/tunnel-uds/socket: read: connection reset by peer

error: ingresses.extensions "XXX" could not be patched: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://XXX-ingress-nginx-controller-admission.test.svc:443/networking/v1/ingresses?timeout=10s": read unix @->/tunnel-uds/socket: read: connection reset by peer

Scaling up the number of pods behind the webhook service seems to mitigate the issue a little bit.

The konnectivity logs are full of

E1112 12:28:32.670717       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:47702->10.181.40.126:10249: use of closed network connection"
E1112 12:28:45.090972       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:57820->10.250.193.243:443: use of closed network connection" connectionID=241
E1112 12:28:45.090994       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:57820->10.250.193.243:443: use of closed network connection"
E1112 12:29:45.655267       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:37994->10.181.40.65:10249: use of closed network connection"
E1112 12:29:56.450407       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:37108->10.181.40.4:10256: use of closed network connection"
E1112 12:30:00.982382       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:45290->10.181.40.134:9153: use of closed network connection"
E1112 12:30:14.130177       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:60206->10.181.40.4:10249: use of closed network connection"
E1112 12:30:14.409013       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:35502->10.250.196.71:443: use of closed network connection"
E1112 12:30:14.409090       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:35502->10.250.196.71:443: use of closed network connection" connectionID=243
E1112 12:31:15.518840       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:51222->10.181.40.4:10091: use of closed network connection"
E1112 12:32:30.065767       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:59522->10.250.193.243:443: use of closed network connection" connectionID=157
E1112 12:32:30.065767       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:59522->10.250.193.243:443: use of closed network connection"
E1112 12:32:30.158352       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:37736->10.250.196.71:443: use of closed network connection"
E1112 12:32:30.158443       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:37736->10.250.196.71:443: use of closed network connection" connectionID=247
E1112 12:32:44.428038       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:57044->10.250.168.211:443: use of closed network connection"
E1112 12:32:44.428050       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:57044->10.250.168.211:443: use of closed network connection" connectionID=152
E1112 12:32:50.516049       1 client.go:515] "conn write failure" err="write tcp 10.181.40.126:57514->10.250.168.211:443: use of closed network connection" connectionID=242
E1112 12:32:50.516049       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:57514->10.250.168.211:443: use of closed network connection"
E1112 12:33:32.676282       1 client.go:486] "connection read failure" err="read tcp 10.181.40.126:51112->10.181.40.126:10249: use of closed network connection"
I1112 12:33:54.730060       1 client.go:482] connection EOF

konnectivity version:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: "2021-11-12T09:02:13Z"
  name: konnectivity-agent
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: konnectivity-agent
        image: mcr.microsoft.com/oss/kubernetes/apiserver-network-proxy/agent:v0.0.24

ghost · 2022-01-11T20:00:45Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

ghost · 2022-01-19T01:00:40Z

This issue will now be closed because it hasn't had any activity for 7 days after stale. MattJeanes feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

MattJeanes · 2022-01-19T10:03:26Z

I'd like to reopen this as I'm not sure on the state of rollout still especially after it has been causing issues as above.

@justindavies is there any update on the rollout?

dennis-benzinger-hybris · 2022-01-19T10:06:15Z

The rollout seems to continue in February:
https://github.com/Azure/AKS/releases/tag/2022-01-06

alvinli222 · 2022-04-29T18:55:10Z

Hi all, we stopped the rollout due to discovering some memory leaks in the upstream project. We've worked closely with the upstream team to fix these leaks and are ready to rollout again. You may have already seen a message in your Portal regarding this news. Re-opening this thread to track any issues that arise.

MattJeanes · 2022-04-29T19:22:31Z

Thank you for the update, really appreciated! 🙂

alvinli222 · 2022-05-02T21:58:10Z

Here was the communication that was sent to all clusters:

In Azure Kubernetes Service (AKS), the tunnel is a secure communication channel that allows the managed control plane to communicate with individual nodes. AKS will start rolling out a new version of our tunnel, Konnectivity, to all clusters over the next 3 weeks and clusters will automatically switch to using Konnectivity for tunnel communication.
Recommended action

Check and confirm that your networking rules allow the required ports, FQDNs, and IPs and that your networking firewall doesn’t block or rewrite the Application-Layer Protocol Negotiation (ALPN) TLS extension. If your cluster doesn’t meet these networking rules, then it’s in an unsupported state and won’t be able to receive this update. If your cluster is in an unsupported state, AKS will roll the tunnel back to the previous version to prevent disruptions of service. Please correct the networking rules as soon as possible to bring your cluster to a supported state.
Learn more about the Konnectivity service in Kubernetes and check out the AKS FAQ.

jackmcc08 · 2022-06-09T15:38:42Z

Hi @alvinli222, do you know if the Konnectivity Service was rolled out to UK West?

We deployed an AKS cluster in UK south, and it automatically set up Konnectivity-Agent Pods. But, when we deployed the same service to UK west, it did not deploy any Konnectivity-Agents and instead deployed a tunnefront pod.

Many thanks for any information you can provide.

Edit: We have also noted, that the deployments we now make to UK South are not setting up Konnectivity-Agent Pods either, so we are not too sure what is going on here. There does not seem to be an obvious Terraform or Azure option to change what is done.

arsnyder16 · 2022-07-21T14:05:54Z

@jackmcc08 @MattJeanes Any luck with Konnectivity? I have various clusters with different workloads and in various data centers and i have yet to see Konnectivity in any of the clusters, despite AKS team reporting that the rollout is complete.

@alvinli222 I haven't modifying anything network wise on these clusters, they use kubenet and have nginx ingress deployed in them. Is there any guidance from Microsoft on why this might not be rolling out to clusters?

miwithro · 2022-07-22T17:06:54Z

@jackmcc08 @arsnyder16 @MattJeanes Konnectivity is now available in ALL Azure Public Regions. So simply doing an upgrade or create option will instantiate Konnectivity.

arsnyder16 · 2022-07-22T17:41:31Z

@miwithro Thanks! I am seeing it now, i don't think i have seen this mentioned anywhere that we need to roll the cluster like this

miwithro · 2022-07-22T17:44:34Z

@arsnyder16 yes we will update the Release Notes to reflect this.

alvinli222 · 2022-08-19T20:46:04Z

Hi folks, additionally, you can submit an az aks update --cluster-name myAKSCluster --resource-group myResourceGroup command with no flags to receive this update as well.

jackmcc08 · 2022-08-24T14:41:37Z

@arsnyder16 sorry for not responding sooner, got sidetracked :) yes, we are now seeing konnectivity agents in our clusters. I assume as per @miwithro comments this has now been implemented.

bbrandt · 2022-09-07T00:32:52Z

@alvinli222 If I run that command in Azure Cloud Shell I get this error:

the following arguments are required: --name/-n

Examples from AI knowledge base:
az aks update --resource-group MyResourceGroup --name MyManagedCluster --load-balancer-managed-outbound-ip-count 2
Update a kubernetes cluster with standard SKU load balancer to use two AKS created IPs for the load balancer outbound connection usage.

az aks update --resource-group MyResourceGroup --name MyManagedCluster --api-server-authorized-ip-ranges 0.0.0.0/32
Restrict apiserver traffic in a kubernetes cluster to agentpool nodes.

https://docs.microsoft.com/en-US/cli/azure/aks#az_aks_update
Read more about the command in reference docs

If I change --cluster-name to --name I get this error:

Please specify one or more of "--enable-cluster-autoscaler" or "--disable-cluster-autoscaler" or "--update-cluster-autoscaler" or "--cluster-autoscaler-profile" or "--load-balancer-managed-outbound-ip-count" or "--load-balancer-outbound-ips" or "--load-balancer-outbound-ip-prefixes" or "--load-balancer-outbound-ports" or "--load-balancer-idle-timeout" or "--nat-gateway-managed-outbound-ip-count" or "--nat-gateway-idle-timeout" or "--auto-upgrade-channel" or "--attach-acr" or "--detach-acr" or "--uptime-sla" or "--no-uptime-sla" or "--api-server-authorized-ip-ranges" or "--enable-aad" or "--aad-tenant-id" or "--aad-admin-group-object-ids" or "--enable-ahub" or "--disable-ahub" or "--windows-admin-password" or "--enable-managed-identity" or "--assign-identity" or "--enable-azure-rbac" or "--disable-azure-rbac" or "--enable-public-fqdn" or "--disable-public-fqdn" or "--tags" or "--nodepool-labels" or "--enble-windows-gmsa".

Are you testing with an unreleased version of az or something?

I was thinking you meant running any az aks update command would update Konnectivity, but this did not say anything about Konnectivity in the output from this command.
az aks update --tags "aks" --name myAKSCluster --resource-group myResourceGroup

BrianGuo987 · 2022-10-26T07:48:39Z

Hello guys,
Thanks for the new release.
We just upgrade from1.15 TO 1.22 then aks 1.23.12.
Now we have konnectivity and also tunnelfront.
Both are runnig. But the problem is we don't need tunnelfront anymore and it's still running and trying to ssh.connect to api server every 5 seconds. The tunnel front pod is restarting every 10 minutes.

Please can i know any advices on the resrarting tunnelfront pods?
Thanks a lot!
Brian

Icybiubiubiu · 2022-10-31T02:42:31Z

Hello guys, Thanks for the new release. We just upgrade from1.15 TO 1.22 then aks 1.23.12. Now we have konnectivity and also tunnelfront. Both are runnig. But the problem is we don't need tunnelfront anymore and it's still running and trying to ssh.connect to api server every 5 seconds. The tunnel front pod is restarting every 10 minutes.

Please can i know any advices on the resrarting tunnelfront pods? Thanks a lot! Brian

Hi @BrianGuo987 ,

Upgrade AKS from 1.5 to 1.22 is not supported. Users should not upgrade AKS cross 2+ major version. 1.5 is too old and too differences compared with 1.22.
We suggest creating new cluster and redeploy the service again.

ghost · 2022-12-30T08:00:50Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

MattJeanes · 2022-12-30T10:48:26Z

I think we can close this now that Konnectivity is widely available, thank you for keeping us posted on the rollout, it has been very much appreciated 🙂

ghost added the triage label Jul 20, 2021

MattJeanes mentioned this issue Jul 20, 2021

tunnel-front using much more than it's requested CPU #2344

Closed

justindavies added the Feedback General feedback label Jul 20, 2021

ghost removed the triage label Jul 20, 2021

justindavies self-assigned this Jul 20, 2021

ckittel mentioned this issue Sep 17, 2021

Can't attach to pods in step 09-workload mspnp/aks-baseline#223

Closed

ghost added the stale Stale issue label Sep 18, 2021

ghost removed the stale Stale issue label Sep 20, 2021

ghost added the stale Stale issue label Jan 11, 2022

ghost closed this as completed Jan 19, 2022

ghost removed the stale Stale issue label Jan 19, 2022

ghost locked as resolved and limited conversation to collaborators Feb 18, 2022

alvinli222 reopened this Apr 29, 2022

alvinli222 self-assigned this Apr 29, 2022

Azure unlocked this conversation Apr 29, 2022

SreejaBhattacharya-MSFT mentioned this issue Jul 27, 2022

Konnectivity Agent details MicrosoftDocs/azure-docs#96217

Closed

ghost added the stale Stale issue label Dec 30, 2022

MattJeanes closed this as completed Dec 30, 2022

ghost removed the stale Stale issue label Dec 30, 2022

ghost locked as resolved and limited conversation to collaborators Jan 29, 2023

aritraghosh added this to Azure Kubernetes Service Roadmap (Public) Jul 10, 2024

aritraghosh moved this to Archive (GA older than 1 month) in Azure Kubernetes Service Roadmap (Public) Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Konnectivity implementation timeline #2452

Konnectivity implementation timeline #2452

MattJeanes commented Jul 20, 2021 •

edited

Loading

ghost commented Jul 20, 2021

justindavies commented Jul 20, 2021

MattJeanes commented Jul 20, 2021

justindavies commented Jul 20, 2021

ghost commented Sep 18, 2021

brk3 commented Sep 20, 2021 •

edited

Loading

sylus commented Nov 10, 2021 •

edited

Loading

shihying66 commented Nov 10, 2021

sylus commented Nov 10, 2021 •

edited

Loading

nodunnock commented Nov 12, 2021 •

edited

Loading

ghost commented Jan 11, 2022

ghost commented Jan 19, 2022

MattJeanes commented Jan 19, 2022

dennis-benzinger-hybris commented Jan 19, 2022

alvinli222 commented Apr 29, 2022

MattJeanes commented Apr 29, 2022

alvinli222 commented May 2, 2022 •

edited

Loading

jackmcc08 commented Jun 9, 2022 •

edited

Loading

arsnyder16 commented Jul 21, 2022

miwithro commented Jul 22, 2022

arsnyder16 commented Jul 22, 2022

miwithro commented Jul 22, 2022

alvinli222 commented Aug 19, 2022

jackmcc08 commented Aug 24, 2022

bbrandt commented Sep 7, 2022 •

edited

Loading

BrianGuo987 commented Oct 26, 2022

Icybiubiubiu commented Oct 31, 2022

ghost commented Dec 30, 2022

MattJeanes commented Dec 30, 2022

Konnectivity implementation timeline #2452

Konnectivity implementation timeline #2452

Comments

MattJeanes commented Jul 20, 2021 • edited Loading

ghost commented Jul 20, 2021

justindavies commented Jul 20, 2021

MattJeanes commented Jul 20, 2021

justindavies commented Jul 20, 2021

ghost commented Sep 18, 2021

brk3 commented Sep 20, 2021 • edited Loading

sylus commented Nov 10, 2021 • edited Loading

shihying66 commented Nov 10, 2021

sylus commented Nov 10, 2021 • edited Loading

nodunnock commented Nov 12, 2021 • edited Loading

ghost commented Jan 11, 2022

ghost commented Jan 19, 2022

MattJeanes commented Jan 19, 2022

dennis-benzinger-hybris commented Jan 19, 2022

alvinli222 commented Apr 29, 2022

MattJeanes commented Apr 29, 2022

alvinli222 commented May 2, 2022 • edited Loading

jackmcc08 commented Jun 9, 2022 • edited Loading

arsnyder16 commented Jul 21, 2022

miwithro commented Jul 22, 2022

arsnyder16 commented Jul 22, 2022

miwithro commented Jul 22, 2022

alvinli222 commented Aug 19, 2022

jackmcc08 commented Aug 24, 2022

bbrandt commented Sep 7, 2022 • edited Loading

BrianGuo987 commented Oct 26, 2022

Icybiubiubiu commented Oct 31, 2022

ghost commented Dec 30, 2022

MattJeanes commented Dec 30, 2022

MattJeanes commented Jul 20, 2021 •

edited

Loading

brk3 commented Sep 20, 2021 •

edited

Loading

sylus commented Nov 10, 2021 •

edited

Loading

sylus commented Nov 10, 2021 •

edited

Loading

nodunnock commented Nov 12, 2021 •

edited

Loading

alvinli222 commented May 2, 2022 •

edited

Loading

jackmcc08 commented Jun 9, 2022 •

edited

Loading

bbrandt commented Sep 7, 2022 •

edited

Loading