"openshift-install destroy cluster" leaves auth dir, breaking next install #522

sttts · 2018-10-23T19:52:30Z

After installing, there is the auth/kubeconfig file in the current checkout. When destroying the cluster, this file is not deleted. A following cluster install load this files, but it does not match the new certificates. Hence, the cluster does not work as expected for the user.

The text was updated successfully, but these errors were encountered:

sttts · 2018-10-23T19:53:12Z

Couldn't double check this due to the current AWS quota issues. Please confirm if this is an issue.

wking · 2018-10-24T00:15:53Z

A following cluster install load this files...

More on this over here and later. For now, the easiest approach is probably to use --dir=whatever and then after you destroy the cluster rm -rf whatever to give you a clean slate for the next run.

crawford · 2018-10-25T00:43:54Z

I don't follow how the presence of auth/kubeconfig caused the certificate mismatch. If the installer's state file was still present, it would have reused all of the previously generated TLS assets, resulting in valid certificates. Did you remove the installer's state file but leave the kubeconfig?

sttts · 2018-10-25T08:21:12Z

Did you remove the installer's state file but leave the kubeconfig?

Yes, I did.

stlaz · 2018-10-25T09:13:33Z

Encountered the very same problem - several times. Removing the auth/ directory always fixes the cert mismatch.

crawford · 2018-10-25T19:06:36Z

Yes, that makes sense. You are effectively telling the installer to ignore whatever kubeconfig it has generated and to instead use the one you have provided (which will almost always be incorrect).

We should revisit this UX issue soon.

mazzystr · 2018-10-31T13:18:08Z

I have been doing the following...

bin/openshift-install destroy-cluster
rm -rf terraform.state terraform.tfvars terraform.tfstate .openshift_install_state.json metadata.json
bin/openshift-install cluster
mv auth/kubeconfig ~/.kube/ccallegar
export KUBECONFIG=$HOME/.kube/ccallegar

The result is a failed cluster. All instances behind the int and ext aws elb are OutOfService. Kublet services are failed on instances ... essentially there is no OpenShift cluster.

You will see the following output (and systemd journal logs) when starting kublet ....

[root@ip-10-0-7-17 ~]# /usr/bin/hyperkube kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --rotate-certificates --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --allow-privileged --node-labels=node-role.kubernetes.io/master --minimum-container-ttl-duration=6m0s --client-ca-file=/etc/kubernetes/ca.crt --cloud-provider=aws --anonymous-auth=false --register-with-taints=node-role.kubernetes.io/master=:NoSchedule
Flag --rotate-certificates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --allow-privileged has been deprecated, will be removed in a future version
Flag --minimum-container-ttl-duration has been deprecated, Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.
Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --rotate-certificates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --allow-privileged has been deprecated, will be removed in a future version
Flag --minimum-container-ttl-duration has been deprecated, Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.
Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
I1030 18:27:51.651642 4254 server.go:418] Version: v1.11.0+d4cacc0
I1030 18:27:51.652086 4254 aws.go:1032] Building AWS cloudprovider
I1030 18:27:51.652953 4254 aws.go:994] Zone not specified in configuration file; querying AWS metadata service
I1030 18:27:51.836182 4254 tags.go:76] AWS cloud filtering on ClusterID: ccallegar
F1030 18:27:51.842709 4254 server.go:262] failed to run Kubelet: cannot create certificate signing request: Post https://ccallegar-api.ccallegar.sysdeseng.com:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: EOF

Executing the following fixes everything!

bin/openshift-install destroy-cluster
rm -rf terraform.state terraform.tfvars terraform.tfstate .openshift_install_state.json metadata.json auth
bin/openshift-install cluster
mv auth/kubeconfig ~/.kube/ccallegar
export KUBECONFIG=$HOME/.kube/ccallegar

wking · 2018-10-31T18:20:13Z

rm -rf terraform.state terraform.tfvars terraform.tfstate .openshift_install_state.json metadata.json auth

Docs for the easier --dir approach are in flight with #532.

sttts · 2018-11-01T09:00:03Z

I still think the UX is broken. There should be a default for --dir and a warning or even error if that dir already exists, with an override if it is really intended to re-use a dir.

wking · 2018-11-01T11:23:32Z

There should be a default for --dir...

This would make life easier for devs, where cluster installs happen many times a day. But I think it would make life slightly more complicated for external users, who are more likely to successfully launch a single cluster and then walk away from the installer for weeks+. I'm fine biasing the UX in favor of external users.

However, folks who preserve their state file (or, with #556, their state directory) should be fine reusing an earlier run's assets exept for the 30-minute-validity kubelet-client cert and its descendants. I think we could guard against that with asset-load-time validators.

The only remaining issue then would be folks who removed parents but not frozen child assets between runs. #556 does a better job tracking frozen/modified assets, with this warning (which we can strengthen as discussed there) in the dangerous case.

jeremyeder · 2018-12-07T20:20:37Z

I wanted to post an error message from a state my system is in now which I think is related to this issue. openshift-install cluster delete fails with the following message:

DependencyViolation: resource sg-0d577e26d576aabc9 has a dependent object

wking · 2018-12-07T20:29:39Z

DependencyViolation: resource sg-0d577e26d576aabc9 has a dependent object

This is a separate issues. This issues is about leftovers on the installer host. Your issue is about leftovers in the AWS account. Can you file a new issue with the full teardown logs? Check in ${INSTALL_DIR}/.openshift_install.log, and scrub whatever you post for any information you consider sensitive.

jeremyeder · 2018-12-07T20:37:42Z

Ack - #836

wgordon17 · 2018-12-17T19:10:26Z

@wking

But I think it would make life slightly more complicated for external users, who are more likely to successfully launch a single cluster and then walk away from the installer for weeks+.

Why would having a default --dir value be slightly more complicated for external users? Having a default --dir automatically be set to the cluster name would make it nice and obvious I would think.

What are the potential issues here? I anticipate that once this GA's, it would be writing to the system default kubeconfig (~/.kube/config), right? So I guess I'm curious what the complications would be?

rajatchopra · 2019-01-17T23:53:37Z

Fixed with this commit: b686588

Closing the issue. Please re-open if it still exists. I have tested it against a libvirt cluster and auth gets removed upon a destroy.

rajatchopra · 2019-01-17T23:57:29Z

Why would having a default --dir value be slightly more complicated for external users? Having a default --dir automatically be set to the cluster name would make it nice and obvious I would think.

Setting to the cluster name may be okay, but only when we are generating it all from scratch. We have to support the case where install-config is imported in.

What are the potential issues here? I anticipate that once this GA's, it would be writing to the system default kubeconfig (~/.kube/config), right? So I guess I'm curious what the complications would be?

~/.kube/config may not be ideal (though useful in many cases) because multiple clusters may (and likely) need to be launched from a single local machine. So isolating a cluster to its target 'dir' is strongly desirable.

tzvatot · 2019-05-05T11:01:46Z

This issue still happens with

./openshift-install v4.1.0-201904211700-dirty
built from commit f3b726cc151f5a3d66bc7e23e81b3013f1347a7e
release image quay.io/openshift-release-dev/ocp-release@sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867

Not sure why this version is marked "dirty" - I downloaded it from the official link.

Users occasionally have trouble with installations where they recycled an asset directory from a previous cluster, and so pick up state like expired X.509 certificates [1] or unexpected release images [2]. While current installers attempt to remove most assets upon successful cluster deletion, there are still some outstanding issues with that [3]. It's safer to just use a fresh directory, and this commit tries to get wording to that effect into each flow that passes through 'openshift-install create ...'. The analogous upstream docs are in [4]. I'm not adjusting installation-generate-ignition-configs.adoc, because it is only consumed by the metal and vSphere flows, and they both go through modules/installation-initializing-manual first. installation-initializing-manual.adoc suggests a mkdir, which will fail if the directory already exists, and these folks are already thinking about the installer loading information from their asset directory, so it didn't seem like they needed the same warning. [1]: openshift/installer#522 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1713016#c4 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1673251 [4]: https://github.com/openshift/installer/blame/8811e63e3f70196f088d6bbf3993ca9043ac3909/README.md#L53-L55

sttts mentioned this issue Oct 25, 2018

pkg/asset: Introduce Load() into the Asset interface that loads assets (from disk) #374

Merged

This was referenced Oct 26, 2018

README: Add more hints about cleanup and reinstallation #532

Merged

WIP: RFC: pkg/assets: Merkle DAG asset library #556

Closed

TheRealJon mentioned this issue Dec 7, 2018

Create cluster fails when run directly after destroy cluster without cleaning up previous install artifacts #825

Closed

wking mentioned this issue Dec 17, 2018

Unclear how to restart a fresh "bin/openshift-install install cluster" #845

Closed

wking mentioned this issue Jan 5, 2019

Openshift installer is not correctly tracking terraform resources #1000

Closed

This was referenced Jan 13, 2019

Libvirt installation seems to always fail #1017

Closed

Introduce a destroy state function that will clean up statefile #1086

Merged

rajatchopra closed this as completed Jan 17, 2019

wking mentioned this issue May 22, 2019

modules/installation*: Suggest per-cluster asset directories openshift/openshift-docs#15011

Closed

kalexand-rh mentioned this issue May 24, 2019

modules/installation*: Suggest per-cluster asset directories openshift/openshift-docs#15031

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"openshift-install destroy cluster" leaves auth dir, breaking next install #522

"openshift-install destroy cluster" leaves auth dir, breaking next install #522

sttts commented Oct 23, 2018

sttts commented Oct 23, 2018

wking commented Oct 24, 2018

crawford commented Oct 25, 2018

sttts commented Oct 25, 2018

stlaz commented Oct 25, 2018

crawford commented Oct 25, 2018

mazzystr commented Oct 31, 2018 •

edited

Loading

wking commented Oct 31, 2018

sttts commented Nov 1, 2018

wking commented Nov 1, 2018

jeremyeder commented Dec 7, 2018

wking commented Dec 7, 2018

jeremyeder commented Dec 7, 2018

wgordon17 commented Dec 17, 2018

rajatchopra commented Jan 17, 2019

rajatchopra commented Jan 17, 2019

tzvatot commented May 5, 2019

"openshift-install destroy cluster" leaves auth dir, breaking next install #522

"openshift-install destroy cluster" leaves auth dir, breaking next install #522

Comments

sttts commented Oct 23, 2018

sttts commented Oct 23, 2018

wking commented Oct 24, 2018

crawford commented Oct 25, 2018

sttts commented Oct 25, 2018

stlaz commented Oct 25, 2018

crawford commented Oct 25, 2018

mazzystr commented Oct 31, 2018 • edited Loading

Executing the following fixes everything!

wking commented Oct 31, 2018

sttts commented Nov 1, 2018

wking commented Nov 1, 2018

jeremyeder commented Dec 7, 2018

wking commented Dec 7, 2018

jeremyeder commented Dec 7, 2018

wgordon17 commented Dec 17, 2018

rajatchopra commented Jan 17, 2019

rajatchopra commented Jan 17, 2019

tzvatot commented May 5, 2019

mazzystr commented Oct 31, 2018 •

edited

Loading