Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures for CertManager deployment through kfctl when webhook is not ready #543

Closed
krishnadurai opened this issue Oct 17, 2019 · 43 comments

Comments

@krishnadurai
Copy link
Contributor

krishnadurai commented Oct 17, 2019

/kind bug
/priority p1

Cert Manager v0.11.0 base files are deployed with ValidatingWebhookConfiguration on. This was chosen due to Simplified Webhook Bootstrapping (scroll down the page for the section).

Whenever cert-manager isn't ready with the webook, kfctl apply fails with the error:

Second time apply for KFDef through kfctl always works.

Error from server (InternalError): error when creating "/tmp/kout": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request

A suggestion is to stick to the default cert-manager configuration by using an overlay for turning the validation configuration off.

/cc @yanniszark

@krishnadurai
Copy link
Contributor Author

/assign @krishnadurai

@krishnadurai krishnadurai changed the title Intermittent failures for CertManager deployment through kfctl when webhook is not ready Failures for CertManager deployment through kfctl when webhook is not ready Oct 18, 2019
@krishnadurai
Copy link
Contributor Author

Found the root cause of this problem

The order of deployment for cert-manager with overlays self-signed or lets-encrypt is:

namespace
mutatingwebhookconfiguration
serviceaccount
clusterrole
clusterrolebinding
configmap
service
deployment
apiservice
clusterissuer
validatingwebhookconfiguration

Since clusterissuer requires validatingwebhookconfiguration to be ready for validation, the config apply for it fails with the error:

failed calling webhook "webhook.cert-manager.io"

/cc @jbrette @yanniszark

@yanniszark
Copy link
Contributor

I found this commit in kustomize: kubernetes-sigs/kustomize@a09b42b

It would fix our issue since validating webhook is applied last, however that would mean that it's not validating the created objects.

I think the behavior we want is for kustomize to just honor the order declared in the kustomization.yaml. This way, we can be explicit about the ordering and fix issues as they arise.
Because kustomize decides the ordering, we are unable to fix this in a straightforward way.

@jbrette @kkasravi can we make kustomize preserve the original order?

@kkasravi
Copy link
Contributor

Yes we need to change the reorder to none, this can be done in kfctl, kustomize.go where we are setting it to legacy

@krishnadurai
Copy link
Contributor Author

Let me quickly try removing:

https://github.com/kubeflow/kubeflow/blob/8a2c452d8576449e8459648ca24c4e2780d30f52/bootstrap/pkg/kfapp/kustomize/kustomize.go#L944-L947

And deploy KfDef and see if the deployment is stable.

@kkasravi
Copy link
Contributor

@krishnadurai yes that should work since we're not doing any ordering

@krishnadurai
Copy link
Contributor Author

krishnadurai commented Oct 18, 2019

@kkasravi @yanniszark Here is the order after I remove those lines:

customresourcedefinition #From CRDs
application #From the application overlay
clusterissuer #From the self-signed/lets-encrypt overlay
namespace
apiservice
clusterrolebinding
clusterrole
deployment
mutatingwebhookconfiguration
serviceaccount
service
validatingwebhookconfiguration
configmap

And this is 'as-is' as defined in kustomization.yaml, just that the overlay files are being given precedence over base resources. Since CRDs are created, the ClusterIssuer resource is accepted by the API server.

Just let me know if overlays being applied first should be the correct order.

@yanniszark
Copy link
Contributor

@krishnadurai @kkasravi the reason that the overlay files come first is because we still use the deprecated bases field.
Some things are declared as bases, others as resources and resources come before bases.

In order to honor the ordering, we should edit the kustomization merge code to use resources for everything.
@kkasravi can you take a look at the merged kustomization.yaml and confirm?

@jbrette
Copy link
Contributor

jbrette commented Oct 18, 2019

@yanniszark @krishnadurai @kkasravi @jlewi Have a look at this PR

I do believe this will solve most of your issues because it gives you control on the order and allows to filter directly in kfctl. I left the "legacy" order right now.

Moreover that code is compatible so could potentially be loaded as a kustomize external transformer (but I do believe that the version of the yaml) being different would cause the .so file not to load.

I won't have time to work on it....so just feel free to take over the PR if you feel this is the right solution.

@kkasravi
Copy link
Contributor

thanks @jbrette, i can pick up the PR.

@yanniszark we can't switch out bases for resources w/o making a few other changes otherwise we'll get errors from kustomize. Hoping to get our kustomize targets fully compatible with kustomize-v3 for our point release

@krishnadurai
Copy link
Contributor Author

This issue has been resolved by kfctl apply retries in PR kubeflow/kubeflow#4360. The ordering wasn't the root cause of this issue as mutating webhook wasn't ready until cluster issuer is applied. Ordering cluster issuer before mutating webhook wouldn't have been a clean solution.

As for the ordering issue, I have created a tracker issue: kubeflow/kfctl#65
This still requires to be addressed as the application order of resources needs to be predictable and modifiable for first-time application success.

/close

@k8s-ci-robot
Copy link
Contributor

@krishnadurai: Closing this issue.

In response to this:

This issue has been resolved by kfctl apply retries in PR kubeflow/kubeflow#4360. The ordering wasn't the root cause of this issue as mutating webhook wasn't ready until cluster issuer is applied. Ordering cluster issuer before mutating webhook wouldn't have been a clean solution.

As for the ordering issue, I have created a tracker issue: kubeflow/kfctl#65
This still requires to be addressed as the application order of resources needs to be predictable and modifiable for first-time application success.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@HelloMyDevWorld
Copy link

HelloMyDevWorld commented Oct 27, 2019

Which version is working corretly ? Any solution ?

@krishnadurai
Copy link
Contributor Author

@patsadow2 cert-manager v0.11.0 is working correctly if you apply it through KfDef (kfctl) since it retries on failures as its webhook becomes ready.

@camille-rodriguez
Copy link

Same situation here, what should we do if it fails repeatedly? The webhook seems ready but it goes into the same error every time (failed calling webhook). The retries always fail.

@WillBeebe
Copy link
Contributor

@camille-rodriguez I'm seeing the same behavior as you.

validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook configured
WARN[0118] Encountered error during apply:  (kubeflow.error): Code 500 with message: Apply.Run  Error error when creating "/tmp/kout696595989": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request  filename="kustomize/kustomize.go:193"
WARN[0118] Will retry in 11 seconds.                     filename="kustomize/kustomize.go:194"
namespace/cert-manager unchanged

@camille-rodriguez
Copy link

@WillBeebe I eventually fixed that issue by making sure that the kubernetes-master pod was able to access the cert-manager-webhook pod on port 6443. It was a firewall issue

@Softsapiens
Copy link

@camille-rodriguez could you share your fix, please? I'm getting the same msg as @WillBeebe and you saw. I'm working with minikube and kfctl_k8s_istio.v1.0.0.yaml

thanks in advance!

@camille-rodriguez
Copy link

@Softsapiens So in my case it is a deployment on multiple servers with charmed-kubernetes, and I had to request a firewall port to be opened for communications between the kubernetes-master nodes and the cert-manager-webhook (since they were in different subnets). I'm unsure how it works in minikube. When I do testing with kubeflow, I typically deploy it with microk8s on an ubuntu vm

@holdenk
Copy link
Contributor

holdenk commented Feb 17, 2020

I ran into this today on minikube with v1.0.0-istio and kfctl 1.0-rc4

@holdenk
Copy link
Contributor

holdenk commented Feb 17, 2020

Oh to be clear it did retry but not enough. I manually re-ran it and then it continued the deployment.

@krishnadurai
Copy link
Contributor Author

/reopen

@holdenk was this because the cert-manager pods were not ready to validate or mutate the ClusterIssuer setup by Kubeflow?
Another cause can be that the pods were ready, but the validation did not happen in spite of the pods being ready though the validation server isn't ready.

Could you please let us know?

@k8s-ci-robot k8s-ci-robot reopened this Feb 17, 2020
@k8s-ci-robot
Copy link
Contributor

@krishnadurai: Reopened this issue.

In response to this:

/reopen

@holdenk was this because the cert-manager pods were not ready to validate or mutate the ClusterIssuer setup by Kubeflow?
Another cause can be that the pods were ready, but the validation did not happen in spite of the pods being ready though the validation server isn't ready.

Could you please let us know?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andaag
Copy link

andaag commented Feb 19, 2020

I'm running into this one following the standard aws documentation (https://www.kubeflow.org/docs/aws/deploy/install-kubeflow/, without cognito).

apiservice.apiregistration.k8s.io/v1beta1.webhook.cert-manager.io unchanged
application.app.k8s.io/cert-manager unchanged
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook configured
WARN[0633] Encountered error applying application cert-manager:  (kubeflow.error): Code 500 with message: Apply.Run  Error error when creating "/tmp/kout257957979": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request  filename="kustomize/kustomize.go:202"
WARN[0633] Will retry in 24 seconds.                     filename="kustomize/kustomize.go:203"
namespace/cert-manager unchanged
mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook configured
serviceaccount/cert-manager unchanged
serviceaccount/cert-manager-cainjector unchanged
serviceaccount/cert-manager-webhook unchanged
clusterrole.rbac.authorization.k8s.io/cert-manager-edit unchanged
clusterrole.rbac.authorization.k8s.io/cert-manager-view unchanged
clusterrole.rbac.authorization.k8s.io/cert-manager-webhook:webhook-requester unchanged
clusterrole.rbac.authorization.k8s.io/cert-manager-cainjector unchanged
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-certificates unchanged
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-challenges unchanged
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-clusterissuers unchanged
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-ingress-shim unchanged
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-issuers unchanged
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-orders unchanged
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-cainjector unchanged
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-certificates unchanged
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-challenges unchanged
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-clusterissuers unchanged
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-ingress-shim unchanged
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-issuers unchanged
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-orders unchanged
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-webhook:auth-delegator configured
configmap/cert-manager-parameters unchanged
service/cert-manager unchanged
service/cert-manager-webhook unchanged
deployment.apps/cert-manager unchanged
deployment.apps/cert-manager-cainjector configured
deployment.apps/cert-manager-webhook configured
apiservice.apiregistration.k8s.io/v1beta1.webhook.cert-manager.io unchanged
application.app.k8s.io/cert-manager unchanged
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook configured
ERRO[0659] Permanently failed applying application cert-manager; error:  (kubeflow.error): Code 500 with message: Apply.Run  Error error when creating "/tmp/kout357160928": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request  filename="kustomize/kustomize.go:206"
Error: failed to apply:  (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize:  (kubeflow.error): Code 500 with message: Apply.Run  Error error when creating "/tmp/kout357160928": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request
Usage:
  kfctl apply -f ${CONFIG} [flags]

Flags:
  -f, --file string   Static config file to use. Can be either a local path:
                      		export CONFIG=./kfctl_gcp_iap.yaml
                      	or a URL:
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_gcp_iap.0.7.0.yaml
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_existing_arrikto.0.7.0.yaml
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_aws.0.7.0.yaml
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.0.yaml
                      	kfctl apply -V --file=${CONFIG}
  -h, --help          help for apply
  -V, --verbose       verbose output default is false

failed to apply:  (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize:  (kubeflow.error): Code 500 with message: Apply.Run  Error error when creating "/tmp/kout357160928": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request

kubectl -n kubeflow get all -A : https://gist.github.com/andaag/4ddb060a2deb58ee149c5779b5a6ced2

//edit : this problem is likely unrelated, it was related to not enough nodes in the nodegroup to start all the services.

@jlewi
Copy link
Contributor

jlewi commented Feb 21, 2020

See: #806

It looks like this is a known issue with cert-manager taking a long time to start see:
cert-manager/cert-manager#2537

In kfctl 1.0-rc.3 we introduced an increased timeout to try to fix this
https://github.com/kubeflow/kfctl/releases/tag/v1.0-rc.3

If you encounter this

  1. Ensure using a kfctl newer than 1.0-rc.3 to pick up the increased timeout
  2. You should be able to rerun kfctl apply after waiting for cert manager to become available to continue the deployment.

@xiaohanhuang
Copy link

See: #806

It looks like this is a known issue with cert-manager taking a long time to start see:
jetstack/cert-manager#2537

In kfctl 1.0-rc.3 we introduced an increased timeout to try to fix this
https://github.com/kubeflow/kfctl/releases/tag/v1.0-rc.3

If you encounter this

  1. Ensure using a kfctl newer than 1.0-rc.3 to pick up the increased timeout
  2. You should be able to rerun kfctl apply after waiting for cert manager to become available to continue the deployment.

I am using v1.0-rc.4. Rery kfctl apply didn't work.

@amygdala
Copy link

amygdala commented Feb 26, 2020

I just hit this issue again with 2 subsequent deployment attempts.
I'm using CONFIG_URI=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.0.yaml & the kfctl for https://github.com/kubeflow/kfctl/releases/tag/v1.0-rc.4 (which is kfctl v1.0-rc.3-1-g24b60e8)
I tried a second 'kfctl apply' after the failure, which in my case fixed it.

(In case anyone else hits the same problem, I did the re-apply like this:

export CONFIG_FILE=<path_to_deployment_dir>/kfctl_gcp_iap.v1.0.0.yaml
kfctl apply -V -f ${CONFIG_FILE}

)

@vovkanaz
Copy link

vovkanaz commented Feb 26, 2020

export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml"
Existing cluster on Hetzner cloud without Loadbalancer...
WARN[0286] Encountered error applying application cert-manager: (kubeflow.error): Code 500 with message: Apply.Run Error error when creating "/tmp/kout088206345": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request filename="kustomize/kustomize.go:202"

kfctl version
kfctl v1.0-rc.3-1-g24b60e8

Maybe we should create LB first of all?

@krishnadurai
Copy link
Contributor Author

@jlewi do we have the frequency of this failure using Spartakus data?

@amygdala
Copy link

I'm wondering whether the GKE 'click to deploy' webapp would also be prone to hitting the same issue once it supports KF1, since a re-apply would not be so straightforward in that context.

@garrettkyle
Copy link

garrettkyle commented Feb 26, 2020

I encountered the exact same issue. While using an EKS environment on AWS I get that same Code 500 as others have posted above and it goes away after a few retries and everything works.

However on an on-premise system which is air-gapped (meaning I had to go through the tedious process of changing the various YAML config files under the kustomize folder to point to a private repo) it seems the kfctl installer for whatever reason is unable to create the certificates to populate the webhook-certs secret under the knative-serving namespace. I'm currently still trying to track down the cause of the issue and how to fix it which is what lead me to this github issue.

kfctl v1.0-rc.1.0-g963c787

@duoduo-ai
Copy link

Tried v1.0 release, using https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml on existing GKE cluster. Still failed.

@krishnadurai
Copy link
Contributor Author

@duoduo-ai did you have an older version of cert-manager installed in your cluster?

@Kyrremann
Copy link

Kyrremann commented Mar 6, 2020

Joining the bandwagon here, I can't get Kubeflow up and running because of the same error.

WARN[0011] Encountered error applying application cert-manager:  (kubeflow.error): Code 500 with message: Apply.Run  Error error when creating "/tmp/kout198143198": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request  filename="kustomize/kustomize.go:202"

Using https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_istio_dex.v1.0.0.yaml.

Is webhook.cert-manager.io an external address? Cause then it is stopped by our proxy :/

Also, everything is running smoothly in cert-manager namespace.

@krishnadurai
Copy link
Contributor Author

@Kyrremann webhook.cert-manager.io is an internal address. This error occurs when the cert-manager webhook service is not ready to validate or mutate the Kubeflow ClusterIssuer.

@Kyrremann
Copy link

@krishnadurai okay. It's up and running

$ k get pods
NAME                                      READY   STATUS    RESTARTS   AGE
cert-manager-cainjector-c578b68fc-6zxlb   1/1     Running   0          30m
cert-manager-fcc6cd946-9fxfs              1/1     Running   1          76m
cert-manager-webhook-657b94c676-mvbxt     1/1     Running   0          2m18s

But it logs a bit strange:

$ k logs cert-manager-webhook-657b94c676-mvbxt
flag provided but not defined: -v
Usage of tls:
  -tls-cert-file string

I0306 15:11:46.988922       1 secure_serving.go:123] Serving securely on [::]:6443

How is webhook.cert-manager.io defined internally?

@krishnadurai
Copy link
Contributor Author

krishnadurai commented Mar 6, 2020

webhook.cert-manager.io is the name of the mutating-webhook:

@Kyrremann
Copy link

Okay, I've been trying to dig some more, and I can't quite find out what is wrong. cert-manager-webhook is running, and it seems like it may be responding.

Any tips for where I can debug some more? I'm looking through the log for the api-server, but haven't found anything interesting yet.

We do have a proxy, maybe it's trying to communicate with something outside the cluster, and that is failing? I'm guessing the cert-manager is talking with letsencrypt? At least I saw that mention in the Kustomize config.

@Kyrremann
Copy link

Adding some more debuging
$ k describe apiservice v1beta1.webhook.cert-manager.io
gives me this error
failing or missing response from https://10.254.132.114:443/apis/webhook.cert-manager.io/v1beta1: bad status from https://10.254.132.114:443/apis/webhook.cert-manager.io/v1beta1: 403

Which means that cert-manager is running, but the apiserver can't contact it.

@Can-Sahin
Copy link

Anyone here having the same issue with AWS EKS. Its solved after I started a cluster with t.large(16gb memory) instance. Turns out resources wasn't enough. I was testing with medium one

@jlewi
Copy link
Contributor

jlewi commented Apr 20, 2020

I'm closing this issue because its no longer clear whether one issue or multiple issues are being discussed. Lets try to open specific focused issues for any problems that are still present.

Here are some of the existing relevant issues.

@jlewi jlewi closed this as completed Apr 20, 2020
@EKami
Copy link

EKami commented Oct 25, 2020

I have exactly the same issue randomly popping up when I install Kubeflow on my EKS cluster =/ . Any workaround for when I install Kubeflow with this configuration? Thanks

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.