-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failures for CertManager deployment through kfctl when webhook is not ready #543
Comments
/assign @krishnadurai |
Found the root cause of this problem The order of deployment for cert-manager with overlays self-signed or lets-encrypt is:
Since
/cc @jbrette @yanniszark |
I found this commit in kustomize: kubernetes-sigs/kustomize@a09b42b It would fix our issue since validating webhook is applied last, however that would mean that it's not validating the created objects. I think the behavior we want is for kustomize to just honor the order declared in the @jbrette @kkasravi can we make kustomize preserve the original order? |
Yes we need to change the reorder to none, this can be done in kfctl, kustomize.go where we are setting it to legacy |
Let me quickly try removing: And deploy KfDef and see if the deployment is stable. |
@krishnadurai yes that should work since we're not doing any ordering |
@kkasravi @yanniszark Here is the order after I remove those lines:
And this is 'as-is' as defined in kustomization.yaml, just that the overlay files are being given precedence over base resources. Since CRDs are created, the ClusterIssuer resource is accepted by the API server. Just let me know if overlays being applied first should be the correct order. |
@krishnadurai @kkasravi the reason that the overlay files come first is because we still use the deprecated bases field. In order to honor the ordering, we should edit the kustomization merge code to use resources for everything. |
@yanniszark @krishnadurai @kkasravi @jlewi Have a look at this PR I do believe this will solve most of your issues because it gives you control on the order and allows to filter directly in kfctl. I left the "legacy" order right now. Moreover that code is compatible so could potentially be loaded as a kustomize external transformer (but I do believe that the version of the yaml) being different would cause the .so file not to load. I won't have time to work on it....so just feel free to take over the PR if you feel this is the right solution. |
thanks @jbrette, i can pick up the PR. @yanniszark we can't switch out bases for resources w/o making a few other changes otherwise we'll get errors from kustomize. Hoping to get our kustomize targets fully compatible with kustomize-v3 for our point release |
This issue has been resolved by kfctl apply retries in PR kubeflow/kubeflow#4360. The ordering wasn't the root cause of this issue as mutating webhook wasn't ready until cluster issuer is applied. Ordering cluster issuer before mutating webhook wouldn't have been a clean solution. As for the ordering issue, I have created a tracker issue: kubeflow/kfctl#65 /close |
@krishnadurai: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which version is working corretly ? Any solution ? |
@patsadow2 cert-manager v0.11.0 is working correctly if you apply it through KfDef (kfctl) since it retries on failures as its webhook becomes ready. |
Same situation here, what should we do if it fails repeatedly? The webhook seems ready but it goes into the same error every time (failed calling webhook). The retries always fail. |
@camille-rodriguez I'm seeing the same behavior as you.
|
@WillBeebe I eventually fixed that issue by making sure that the kubernetes-master pod was able to access the cert-manager-webhook pod on port 6443. It was a firewall issue |
@camille-rodriguez could you share your fix, please? I'm getting the same msg as @WillBeebe and you saw. I'm working with minikube and kfctl_k8s_istio.v1.0.0.yaml thanks in advance! |
@Softsapiens So in my case it is a deployment on multiple servers with charmed-kubernetes, and I had to request a firewall port to be opened for communications between the kubernetes-master nodes and the cert-manager-webhook (since they were in different subnets). I'm unsure how it works in minikube. When I do testing with kubeflow, I typically deploy it with microk8s on an ubuntu vm |
I ran into this today on minikube with v1.0.0-istio and kfctl 1.0-rc4 |
Oh to be clear it did retry but not enough. I manually re-ran it and then it continued the deployment. |
/reopen @holdenk was this because the cert-manager pods were not ready to validate or mutate the ClusterIssuer setup by Kubeflow? Could you please let us know? |
@krishnadurai: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'm running into this one following the standard aws documentation (https://www.kubeflow.org/docs/aws/deploy/install-kubeflow/, without cognito).
kubectl -n kubeflow get all -A : https://gist.github.com/andaag/4ddb060a2deb58ee149c5779b5a6ced2 //edit : this problem is likely unrelated, it was related to not enough nodes in the nodegroup to start all the services. |
See: #806 It looks like this is a known issue with cert-manager taking a long time to start see: In kfctl 1.0-rc.3 we introduced an increased timeout to try to fix this If you encounter this
|
I am using v1.0-rc.4. Rery kfctl apply didn't work. |
I just hit this issue again with 2 subsequent deployment attempts. (In case anyone else hits the same problem, I did the re-apply like this:
) |
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml" kfctl version Maybe we should create LB first of all? |
@jlewi do we have the frequency of this failure using Spartakus data? |
I'm wondering whether the GKE 'click to deploy' webapp would also be prone to hitting the same issue once it supports KF1, since a re-apply would not be so straightforward in that context. |
I encountered the exact same issue. While using an EKS environment on AWS I get that same Code 500 as others have posted above and it goes away after a few retries and everything works. However on an on-premise system which is air-gapped (meaning I had to go through the tedious process of changing the various YAML config files under the kustomize folder to point to a private repo) it seems the kfctl installer for whatever reason is unable to create the certificates to populate the webhook-certs secret under the knative-serving namespace. I'm currently still trying to track down the cause of the issue and how to fix it which is what lead me to this github issue. kfctl v1.0-rc.1.0-g963c787 |
Tried v1.0 release, using https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml on existing GKE cluster. Still failed. |
@duoduo-ai did you have an older version of cert-manager installed in your cluster? |
Joining the bandwagon here, I can't get Kubeflow up and running because of the same error.
Using Is webhook.cert-manager.io an external address? Cause then it is stopped by our proxy :/ Also, everything is running smoothly in |
@Kyrremann |
@krishnadurai okay. It's up and running
But it logs a bit strange:
How is |
|
Okay, I've been trying to dig some more, and I can't quite find out what is wrong. Any tips for where I can debug some more? I'm looking through the log for the api-server, but haven't found anything interesting yet. We do have a proxy, maybe it's trying to communicate with something outside the cluster, and that is failing? I'm guessing the cert-manager is talking with letsencrypt? At least I saw that mention in the Kustomize config. |
Adding some more debuging Which means that cert-manager is running, but the apiserver can't contact it. |
Anyone here having the same issue with AWS EKS. Its solved after I started a cluster with t.large(16gb memory) instance. Turns out resources wasn't enough. I was testing with medium one |
I'm closing this issue because its no longer clear whether one issue or multiple issues are being discussed. Lets try to open specific focused issues for any problems that are still present. Here are some of the existing relevant issues.
|
I have exactly the same issue randomly popping up when I install Kubeflow on my EKS cluster =/ . Any workaround for when I install Kubeflow with this configuration? Thanks |
Issue Label Bot is not confident enough to auto-label this issue. |
/kind bug
/priority p1
Cert Manager v0.11.0 base files are deployed with ValidatingWebhookConfiguration on. This was chosen due to Simplified Webhook Bootstrapping (scroll down the page for the section).
Whenever cert-manager isn't ready with the webook, kfctl apply fails with the error:
Second time apply for KFDef through kfctl always works.
A suggestion is to stick to the default cert-manager configuration by using an overlay for turning the validation configuration off.
/cc @yanniszark
The text was updated successfully, but these errors were encountered: