Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"openshift-install destroy cluster" leaves auth dir, breaking next install #522

Closed
sttts opened this issue Oct 23, 2018 · 17 comments
Closed

Comments

@sttts
Copy link
Contributor

sttts commented Oct 23, 2018

After installing, there is the auth/kubeconfig file in the current checkout. When destroying the cluster, this file is not deleted. A following cluster install load this files, but it does not match the new certificates. Hence, the cluster does not work as expected for the user.

@sttts
Copy link
Contributor Author

sttts commented Oct 23, 2018

Couldn't double check this due to the current AWS quota issues. Please confirm if this is an issue.

@wking
Copy link
Member

wking commented Oct 24, 2018

A following cluster install load this files...

More on this over here and later. For now, the easiest approach is probably to use --dir=whatever and then after you destroy the cluster rm -rf whatever to give you a clean slate for the next run.

@crawford
Copy link
Contributor

I don't follow how the presence of auth/kubeconfig caused the certificate mismatch. If the installer's state file was still present, it would have reused all of the previously generated TLS assets, resulting in valid certificates. Did you remove the installer's state file but leave the kubeconfig?

@sttts
Copy link
Contributor Author

sttts commented Oct 25, 2018

Did you remove the installer's state file but leave the kubeconfig?

Yes, I did.

@stlaz
Copy link

stlaz commented Oct 25, 2018

Encountered the very same problem - several times. Removing the auth/ directory always fixes the cert mismatch.

@crawford
Copy link
Contributor

Yes, that makes sense. You are effectively telling the installer to ignore whatever kubeconfig it has generated and to instead use the one you have provided (which will almost always be incorrect).

We should revisit this UX issue soon.

@mazzystr
Copy link

mazzystr commented Oct 31, 2018

I have been doing the following...

  • bin/openshift-install destroy-cluster
  • rm -rf terraform.state terraform.tfvars terraform.tfstate .openshift_install_state.json metadata.json
  • bin/openshift-install cluster
  • mv auth/kubeconfig ~/.kube/ccallegar
  • export KUBECONFIG=$HOME/.kube/ccallegar

The result is a failed cluster. All instances behind the int and ext aws elb are OutOfService. Kublet services are failed on instances ... essentially there is no OpenShift cluster.

You will see the following output (and systemd journal logs) when starting kublet ....

[root@ip-10-0-7-17 ~]# /usr/bin/hyperkube kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --rotate-certificates --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --allow-privileged --node-labels=node-role.kubernetes.io/master --minimum-container-ttl-duration=6m0s --client-ca-file=/etc/kubernetes/ca.crt --cloud-provider=aws --anonymous-auth=false --register-with-taints=node-role.kubernetes.io/master=:NoSchedule
Flag --rotate-certificates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --allow-privileged has been deprecated, will be removed in a future version
Flag --minimum-container-ttl-duration has been deprecated, Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.
Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --rotate-certificates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --allow-privileged has been deprecated, will be removed in a future version
Flag --minimum-container-ttl-duration has been deprecated, Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.
Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
I1030 18:27:51.651642 4254 server.go:418] Version: v1.11.0+d4cacc0
I1030 18:27:51.652086 4254 aws.go:1032] Building AWS cloudprovider
I1030 18:27:51.652953 4254 aws.go:994] Zone not specified in configuration file; querying AWS metadata service
I1030 18:27:51.836182 4254 tags.go:76] AWS cloud filtering on ClusterID: ccallegar
F1030 18:27:51.842709 4254 server.go:262] failed to run Kubelet: cannot create certificate signing request: Post https://ccallegar-api.ccallegar.sysdeseng.com:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: EOF

Executing the following fixes everything!

  • bin/openshift-install destroy-cluster
  • rm -rf terraform.state terraform.tfvars terraform.tfstate .openshift_install_state.json metadata.json auth
  • bin/openshift-install cluster
  • mv auth/kubeconfig ~/.kube/ccallegar
  • export KUBECONFIG=$HOME/.kube/ccallegar

@wking
Copy link
Member

wking commented Oct 31, 2018

rm -rf terraform.state terraform.tfvars terraform.tfstate .openshift_install_state.json metadata.json auth

Docs for the easier --dir approach are in flight with #532.

@sttts
Copy link
Contributor Author

sttts commented Nov 1, 2018

I still think the UX is broken. There should be a default for --dir and a warning or even error if that dir already exists, with an override if it is really intended to re-use a dir.

@wking
Copy link
Member

wking commented Nov 1, 2018

There should be a default for --dir...

This would make life easier for devs, where cluster installs happen many times a day. But I think it would make life slightly more complicated for external users, who are more likely to successfully launch a single cluster and then walk away from the installer for weeks+. I'm fine biasing the UX in favor of external users.

However, folks who preserve their state file (or, with #556, their state directory) should be fine reusing an earlier run's assets exept for the 30-minute-validity kubelet-client cert and its descendants. I think we could guard against that with asset-load-time validators.

The only remaining issue then would be folks who removed parents but not frozen child assets between runs. #556 does a better job tracking frozen/modified assets, with this warning (which we can strengthen as discussed there) in the dangerous case.

@jeremyeder
Copy link

I wanted to post an error message from a state my system is in now which I think is related to this issue. openshift-install cluster delete fails with the following message:

DependencyViolation: resource sg-0d577e26d576aabc9 has a dependent object

@wking
Copy link
Member

wking commented Dec 7, 2018

DependencyViolation: resource sg-0d577e26d576aabc9 has a dependent object

This is a separate issues. This issues is about leftovers on the installer host. Your issue is about leftovers in the AWS account. Can you file a new issue with the full teardown logs? Check in ${INSTALL_DIR}/.openshift_install.log, and scrub whatever you post for any information you consider sensitive.

@jeremyeder
Copy link

Ack - #836

@wgordon17
Copy link
Contributor

@wking

But I think it would make life slightly more complicated for external users, who are more likely to successfully launch a single cluster and then walk away from the installer for weeks+.

Why would having a default --dir value be slightly more complicated for external users? Having a default --dir automatically be set to the cluster name would make it nice and obvious I would think.

What are the potential issues here? I anticipate that once this GA's, it would be writing to the system default kubeconfig (~/.kube/config), right? So I guess I'm curious what the complications would be?

@rajatchopra
Copy link
Contributor

Fixed with this commit: b686588

Closing the issue. Please re-open if it still exists. I have tested it against a libvirt cluster and auth gets removed upon a destroy.

@rajatchopra
Copy link
Contributor

Why would having a default --dir value be slightly more complicated for external users? Having a default --dir automatically be set to the cluster name would make it nice and obvious I would think.

Setting to the cluster name may be okay, but only when we are generating it all from scratch. We have to support the case where install-config is imported in.

What are the potential issues here? I anticipate that once this GA's, it would be writing to the system default kubeconfig (~/.kube/config), right? So I guess I'm curious what the complications would be?

~/.kube/config may not be ideal (though useful in many cases) because multiple clusters may (and likely) need to be launched from a single local machine. So isolating a cluster to its target 'dir' is strongly desirable.

@tzvatot
Copy link

tzvatot commented May 5, 2019

This issue still happens with

./openshift-install v4.1.0-201904211700-dirty
built from commit f3b726cc151f5a3d66bc7e23e81b3013f1347a7e
release image quay.io/openshift-release-dev/ocp-release@sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867

Not sure why this version is marked "dirty" - I downloaded it from the official link.

wking added a commit to wking/openshift-docs that referenced this issue May 22, 2019
Users occasionally have trouble with installations where they recycled
an asset directory from a previous cluster, and so pick up state like
expired X.509 certificates [1] or unexpected release images [2].
While current installers attempt to remove most assets upon successful
cluster deletion, there are still some outstanding issues with that
[3].  It's safer to just use a fresh directory, and this commit tries
to get wording to that effect into each flow that passes through
'openshift-install create ...'.  The analogous upstream docs are in
[4].

I'm not adjusting installation-generate-ignition-configs.adoc, because
it is only consumed by the metal and vSphere flows, and they both go
through modules/installation-initializing-manual first.
installation-initializing-manual.adoc suggests a mkdir, which will
fail if the directory already exists, and these folks are already
thinking about the installer loading information from their asset
directory, so it didn't seem like they needed the same warning.

[1]: openshift/installer#522
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1713016#c4
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1673251
[4]: https://github.com/openshift/installer/blame/8811e63e3f70196f088d6bbf3993ca9043ac3909/README.md#L53-L55
kalexand-rh pushed a commit to kalexand-rh/openshift-docs that referenced this issue May 24, 2019
Users occasionally have trouble with installations where they recycled
an asset directory from a previous cluster, and so pick up state like
expired X.509 certificates [1] or unexpected release images [2].
While current installers attempt to remove most assets upon successful
cluster deletion, there are still some outstanding issues with that
[3].  It's safer to just use a fresh directory, and this commit tries
to get wording to that effect into each flow that passes through
'openshift-install create ...'.  The analogous upstream docs are in
[4].

I'm not adjusting installation-generate-ignition-configs.adoc, because
it is only consumed by the metal and vSphere flows, and they both go
through modules/installation-initializing-manual first.
installation-initializing-manual.adoc suggests a mkdir, which will
fail if the directory already exists, and these folks are already
thinking about the installer loading information from their asset
directory, so it didn't seem like they needed the same warning.

[1]: openshift/installer#522
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1713016#c4
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1673251
[4]: https://github.com/openshift/installer/blame/8811e63e3f70196f088d6bbf3993ca9043ac3909/README.md#L53-L55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants