-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sonobuoy I/O timeout issues when using ipv6 address in an air-gapped environment #1136
Comments
Hi, I have an ipv6 cluster, and had the same problem. Tldr, sonobuoy cli seems to mess something up when the docker images are already present on the nodes, otherwise - it works. |
Hey @tomikonio! Thanks for the response. We've tried that suggestion of yours and it worked. When we ran it the second time, the i/o timeout issue reappeared. I wonder where in sonobuoy is the issue occuring. Do you have any idea as to where? We don't want to constantly remove and reload the images when we try to use sonobuoy. @zubron, do you have inputs regarding this? I would love hear from you. |
This is very strange behaviour indeed. The timeout error is occurring when the Sonobuoy pod is attempting to create a Kubernetes client (using kubernetes/client-go). This is required as the client is then used to communicate with the API server to perform actions on the cluster such as starting the plugin pods, updating annotations on the main Sonobuoy pod, etc. The error seems to happen at the point when it first tries to connect to the API server to get the API group resources. Given that this error is occurring from a pod already running on the cluster when the image has already been pulled, I'm struggling to understand how it could be related to the image being pulled in the first place. The Sonobuoy CLI generates a Kubernetes manifest which it deploys on the cluster. You can inspect that manifest by looking at the output from Would either of you mind sharing a bit more about your environments? I haven't done any testing on IPv6 and I would like to set up an environment to investigate this issue more. It would help to replicate it as closely as possible. @tomikonio - Is your environment also an air-gapped environment? |
Thanks for the reply @zubron! My current environment is a simple air-gapped three node cluster (1 master , 2 workers) with a private registry virtual machine in which the images are stored. The files are pulled from an environment with access to the internet and is saved and transferred to the private repository. The setup is IPV6. Here is the log in which we followed @tomikonio's suggestion to delete the existing images and reuploaded them into the private repository.
Here is the sample e2e test result.
Here is the sonobuoy pod logs:
|
@johnray21216 Thanks for the logs. How did you provision your cluster? I notice you're using Calico, and one of the pods has a high number of restarts which seems like it could be affecting the network. What versions of each of these components are you using? |
I am using the following components: kubernetes: 1.16.3 |
Did you set up the cluster manually or are you using a managed offering? If managed, which provider? |
I managed it manually. |
I checked the logs of that pod with an unusual high number of restarts and then grep'ed the logs with the WARNINGS.
|
I'm not familiar with Calico and its related projects. That warning is coming from Felix. I don't know how policy management works with Calico but it looks like there might be something in your configuration which is causing the traffic for the sonobuoy-serviceaccount to be dropped? I would recommend looking at your Calico configuration to see if there is anything in place that would cause this as this is not controlled by Sonobuoy. |
Hello @zubron. Apologies for the late response. According to one of my colleague, the high restart of the calico node was caused by a trial-and-error setup of calico for ipv6. But we will still try to look into this suggestion of yours. May I ask if there are other factors that could result in this unusual behavior? Please let me know if you need additional information from us. |
Ah, I see. I think it's still suspicious that you have those warnings regarding the sonbouoy service account. Reading the code from Felix where that warning is being generated, it looks like a deny all rule is being applied for inbound and outbound traffic for the profile for the sonbouoy service account which would prevent it from talking to the API server. Given that it was able to run once successfully leads me to think that it's not an issue with Sonobuoy itself (my first thought was that it was an IPv6 specific issue). Sonobuoy is capable of running on the cluster, but it looks like there is something in the networking configuration on the cluster which is preventing it from communicating with the API server on subsequent runs. I would revisit the Calico settings to see if there is anything in place where it would default to denying traffic. I have a small sample which isolates the section of code that is failing in Sonobuoy which you can try out. It might be easier for debugging the networking in your cluster: https://github.com/zubron/k8s-in-cluster-config |
Hi, I wonder what configuration in calico may cause this issue.
Yields:
Hmm, it seems that sonobuoy is being blocked here. But we are not using any calico network policies or profiles. |
Is there anything relevant on log lines that aren’t just those warnings? Also, the warnings for other profiles that aren’t recognized seem to be coming from the e2e tests themselves. The format for the profile names seems to be “ksa..“ and all the namespaces match the format used by those tests. The workloads that run as part of the e2e tests aren’t managed by sonobuoy. Sonobuoy initiates the run of the test framework but the workloads for tests run independently. So it seems that it affecting other workloads. When you were able to run the tests, did they run successfully? |
Thease are the logs when sonobuoy is started (from all 3 nodes):
Thease are the logs when sonobuoy is deleted:
Hmm, it seems the warning is issued at termination. And yes, on the first run of sonobuoy the tests were successfully completed. |
Hello @zubron! I deployed your k8s-in-cluster-config sample and this was the log message:
I've pulled and pushed the zubron/k8s-in-cluster-config image into my private repository. I've created a new environment and re ran sonobuoy. It still has the i/o timeout issue and the calico node logs still shows the following:
I will try your suggestion regarding checking the calico manifest. I will provide feedback tomorrow. |
My suspicion is that it will fail with the same timeout error on any subsequent runs (deleting the resources and then retrying) just as sonobuoy fails on subsequent runs after working correctly the first time. |
@tomikonio Thanks for providing the logs. I wonder if there is some issue in Calico where if the profile generated for the service account is set to deny all traffic upon deletion but isn't cleaned up from the data store, the same profile is reused on subsequent runs (since its name is formed from the same namespace and service account name). That profiles rules are to deny traffic and so subsequent runs of sonobuoy fail. I wonder if you use a different namespace for Sonobuoy for the subsequent runs (using I still don't understand why deleting the images has any impact though. I wonder if re-pulling the images causes some extra delay in the process or some other network usages which allows the state to repair. Either way, this is not something controlled by Sonobuoy, and seems to be an issue in Calico. Recreating a namespace and service account is a valid thing to do. |
Hello @zubron! I have done checking on calico and we found out no issues with it. We were able to confirm that we are using the default "allow all" policy in calico. We are still doing checks in our cluster environment. The unusual thing was sonobuoy works IF we pull the image from the repository but not in instances that the image exists in the node where the pod is scheduled. We are still investigating this and would love to hear your inputs also regarding this. |
That is strange behaviour but something that is completely outside of Sonobuoy's control. The image already being present on the node should have absolutely no impact on the behaviour of the container using that image. You could try setting the imagePullPolicy to Also, as we discussed yesterday, you are seeing the same behaviour with the sample program so I don't think it's anything specific to the sonobuoy image. |
Hello @zubron, do we add that flag on sonobuoy? I've tried that and it it still failed. I'll do a follow up regarding this if we have further knowledge of the issue. |
Yes, it’s a flag that you use with “sonobuoy run”. It sets an imagePullPolicy of Always on all the sonobuoy pods which forces the kubelet to pull the images each time. It’s the only way that sonobuoy can provide any influence over whether the image is pulled before starting the pods. |
Hello @zubron! Posting this here for my colleagues to see. We have tried the following scenarios already.
In the first run, BOTH endpoints work and was able to proceed with the test. But, when we ran sonobuoy consecutively (with removing images, just the sonobuoy resources), the issue STILL persists. We have already checked with our kubernetes setup and our calico configuration. I may need to open an issue to calico on this though. Is there a possibility that sonobuoy isn't able to communicate with the Cluster? We are using the 0.17.2 version and the 0.18.0 on our end. |
Hi @johnray21216. Given that Sonobuoy is able to connect to the API server and run the first time, nothing fundamentally changes with a second run that would prevent it from communicating. As I have stated in our conversations on Slack, I strongly suspect that it is an issue with Calico due to the logs where it states that it will drop all traffic for the service account. Yes, you may have configured Calico correctly, however there could still be a bug there. If you haven't already, I would recommend either reaching out to them on their Slack or Discourse, or opening an issue in their project. (Links here: https://www.projectcalico.org/community/). |
There has not been much activity here. We'll be closing this issue if there are no follow-ups within 15 days. |
What steps did you take and what happened:
I have an air-gapped Kubernetes setup remotely that is using ipv6. The following architecture is as follows:
The images to be used is in a private registry virtual machine environment. The kubernetes setup is a simple Single-Master with 2 worker nodes. The setup was able to be configured to use ipv6. When trying to run sonobuoy from the private registry, the application runs successfully but returns an i/o timeout error in the logs.
What did you expect to happen:
Sonobuoy will pull the images from the private registry and run successfully.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
[2020-06-29 12:15:53.498] [root@opm bin]# sonobuoy logs --kubeconfig $HOME/bin/a-ipv6-k8s_config [2020-06-29 12:15:56.163] namespace="sonobuoy" pod="sonobuoy" container="kube-sonobuoy" [2020-06-29 12:15:56.163] time="2020-06-29T04:14:37Z" level=info msg="Scanning plugins in ./plugins.d (pwd: /)" [2020-06-29 12:15:56.163] time="2020-06-29T04:14:37Z" level=info msg="Scanning plugins in /etc/sonobuoy/plugins.d (pwd: /)" [2020-06-29 12:15:56.163] time="2020-06-29T04:14:37Z" level=info msg="Directory (/etc/sonobuoy/plugins.d) does not exist" [2020-06-29 12:15:56.163] time="2020-06-29T04:14:37Z" level=info msg="Scanning plugins in ~/sonobuoy/plugins.d (pwd: /)" [2020-06-29 12:15:56.163] time="2020-06-29T04:14:37Z" level=info msg="Directory (~/sonobuoy/plugins.d) does not exist" [2020-06-29 12:15:56.163] time="2020-06-29T04:15:07Z" level=error msg="could not get api group resources: Get https://[<ipv6_address>]:443/api?timeout=32s: dial tcp [<ipv6_address>]:443: i/o timeout" [2020-06-29 12:15:56.163] time="2020-06-29T04:15:07Z" level=info msg="no-exit was specified, sonobuoy is now blocking"
Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-13T11:23:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-13T11:13:49Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
We've already created an issue before which can be found here. Can someone help me on this one? Thank you.
The text was updated successfully, but these errors were encountered: