Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] crc start fails on Waiting until the user's pull secret is written to the instance disk... step #4110

Closed
lavocatt opened this issue Apr 8, 2024 · 19 comments
Labels
kind/bug Something isn't working status/need triage

Comments

@lavocatt
Copy link

lavocatt commented Apr 8, 2024

General information

  • OS: Linux
  • Hypervisor: KVM
  • Did you run crc setup before starting it (Yes/No)? yes
  • Running CRC on: Laptop

CRC version

CRC version: 2.34.1+b470b5
OpenShift version: 4.15.3
Podman version: 4.4.4

CRC status

crc status --log-level debug
DEBU CRC version: 2.34.1+b470b5                   
DEBU OpenShift version: 4.15.3                    
DEBU Podman version: 4.4.4                        
DEBU Running 'crc status'                         
CRC VM:          Running
OpenShift:       Unreachable (v4.15.3)
RAM Usage:       1.932GB of 10.96GB
Disk Usage:      19.05GB of 32.68GB (Inside the CRC VM)
Cache Usage:     26.27GB
Cache Directory: /home/tlavocat/.crc/cache

CRC config

crc config view
- consent-telemetry                     : yes

Host Operating System

NAME="Fedora Linux"
VERSION="39 (Workstation Edition)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora Linux 39 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f39/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-11-12
VARIANT="Workstation Edition"
VARIANT_ID=workstation

Steps to reproduce

  1. crc setup
  2. crc start -p ~/Documents/pull-secret

Expected

I expect a normal startup.

Actual

 Error waiting for apiserver: Temporary error: ssh command error:
command : timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig
err     : Process exited with status 1
 (x222)

Logs

Before gather the logs try following if that fix your issue

$ crc delete -f
$ crc cleanup
$ crc setup
$ crc start --log-level debug

Please consider posting the output of crc start --log-level debug on http://gist.github.com/ and post the link in the issue.

See: https://gist.github.com/lavocatt/3f5d8fca0ac3ad52ddae982b906c3080

@lavocatt lavocatt added kind/bug Something isn't working status/need triage labels Apr 8, 2024
@lavocatt
Copy link
Author

lavocatt commented Apr 8, 2024

Update: restarting the command was successful at passing the step Waiting for kube-apiserver availability but failed on Waiting until the user's pull secret is written to the instance disk... afterward.
Could it be linked to: #4036?

@lavocatt
Copy link
Author

lavocatt commented Apr 8, 2024

DEBU Running SSH command: timeout 5s oc get secret --context admin --cluster crc --kubeconfig /opt/kubeconfig 
DEBU SSH command results: err: <nil>, output: NAME                       TYPE                                  DATA   AGE
builder-dockercfg-z4hls    kubernetes.io/dockercfg               1      13d
builder-token-58b2c        kubernetes.io/service-account-token   4      14d
default-dockercfg-mwkzf    kubernetes.io/dockercfg               1      13d
default-token-snsbf        kubernetes.io/service-account-token   4      14d
deployer-dockercfg-qr92t   kubernetes.io/dockercfg               1      13d
deployer-token-2wc2f       kubernetes.io/service-account-token   4      14d 
DEBU NAME                       TYPE                                  DATA   AGE
builder-dockercfg-z4hls    kubernetes.io/dockercfg               1      13d
builder-token-58b2c        kubernetes.io/service-account-token   4      14d
default-dockercfg-mwkzf    kubernetes.io/dockercfg               1      13d
default-token-snsbf        kubernetes.io/service-account-token   4      14d
deployer-dockercfg-qr92t   kubernetes.io/dockercfg               1      13d
deployer-token-2wc2f       kubernetes.io/service-account-token   4      14d 
DEBU Running SSH command: <hidden>                
DEBU SSH command succeeded                        
DEBU Waiting for availability of resource type 'machineconfigs' 
DEBU retry loop: attempt 0                        
DEBU Running SSH command: timeout 5s oc get machineconfigs --context admin --cluster crc --kubeconfig /opt/kubeconfig 
DEBU SSH command results: err: <nil>, output: NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
00-worker                                          b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-master-container-runtime                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-master-kubelet                                  b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-worker-container-runtime                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-worker-kubelet                                  b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
97-master-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
97-worker-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
98-master-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
98-worker-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-master-generated-registries                     b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-master-ssh                                                                                 3.2.0             14d
99-node-sizing-for-crc                                                                        3.2.0             14d
99-openshift-machineconfig-master-dummy-networks                                              3.2.0             14d
99-worker-generated-registries                     b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-worker-ssh                                                                                 3.2.0             14d
rendered-master-01227552b0ddd4e8983d4ee6c0c8dcef   b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             13d
rendered-worker-ebd0dd73a3bb2c3974af877858d1f047   b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             13d 
DEBU NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
00-worker                                          b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-master-container-runtime                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-master-kubelet                                  b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-worker-container-runtime                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
01-worker-kubelet                                  b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
97-master-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
97-worker-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
98-master-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
98-worker-generated-kubelet                        b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-master-generated-registries                     b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-master-ssh                                                                                 3.2.0             14d
99-node-sizing-for-crc                                                                        3.2.0             14d
99-openshift-machineconfig-master-dummy-networks                                              3.2.0             14d
99-worker-generated-registries                     b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             14d
99-worker-ssh                                                                                 3.2.0             14d
rendered-master-01227552b0ddd4e8983d4ee6c0c8dcef   b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             13d
rendered-worker-ebd0dd73a3bb2c3974af877858d1f047   b693d2ffbac7ddab712901662fabffe2fda81562   3.4.0             13d 
DEBU Running SSH command: timeout 30s oc get machineconfigs 99-master-ssh -o jsonpath='{.spec.config.passwd.users[0].sshAuthorizedKeys[0]}' --context admin --cluster crc --kubeconfig /opt/kubeconfig 
DEBU SSH command results: err: <nil>, output: ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAG+WEPCXfkqRWah/UPVqZx9DMMVGc9k99WO+hZ8lKQXUeKZ7DxES0PJs6SZquOGUBlUkvJ7o5H9Jn1PpuNJy6QMawAboAY6p+tq9nNwjGiVThXhqKxV+fwxCfh6LBPcdjtvuwbbrij6MtOoi0qGAUCIbRVexzD+UuhJEk+CrPEbi5XlbA== 
INFO Waiting until the user's pull secret is written to the instance disk... 
DEBU retry loop: attempt 0                        
DEBU Running SSH command: <hidden>                
DEBU SSH command succeeded                        
DEBU error: Temporary error: pull secret not updated to disk - sleeping 2s 
DEBU retry loop: attempt 1                        
DEBU Running SSH command: <hidden>                

@lavocatt lavocatt changed the title [BUG] crc start fails on Waiting for kube-apiserver availability... step [BUG] crc start fails on Waiting until the user's pull secret is written to the instance disk... step Apr 8, 2024
@lavocatt
Copy link
Author

lavocatt commented Apr 9, 2024

Discussion with @praveenkumar and @cfergeau we found out that the issue was linked to the DNS resolution. This is the working configuration:

crc config view
- consent-telemetry                     : no
- network-mode                          : user
- pull-secret-file                      : /home/tlavocat/Documents/pull-secret

Thanks for your help!

@lavocatt lavocatt closed this as completed Apr 9, 2024
@cfergeau
Copy link
Contributor

cfergeau commented Apr 9, 2024

Would stil be nice to fix the DNS resolution problem you were having :)

@lavocatt
Copy link
Author

lavocatt commented Apr 9, 2024

Would stil be nice to fix the DNS resolution problem you were having :)

I can reopen the issue if you want

@jmazzitelli
Copy link

jmazzitelli commented Apr 25, 2024

I'm getting this same error. I can't tell - is there a solution?

I do have these set, which I thought would do it, but does not:

crc config set network-mode user
crc config set host-network-access true

This is with v2.34.1

@gbraad
Copy link
Contributor

gbraad commented Apr 26, 2024

Would stil be nice to fix the DNS resolution problem you were having :)

DNS resolution is a network setup issue, ... What was the DNS server set to? Any specific would help to determine what causes a lookup failure.

network-mode will force the user-mode network stack and resolve on the host. This would not guarantee success.

More info about the DNS used is necessary to determine the problem.

@jmazzitelli
Copy link

Well, that's interesting. I turned off my corporate VPN and it started fine.

What would that indicate? Is there something I can do for this to work if I am on the VPN? I would like to remain on VPN if possible.

Note, when I just got this to work, I did have network-mode user and host-network-access true set explicitly (it must have been configured previously, I did not mean to set this, but I did have them set for a prior run). This is what crc config view shows:

$ crc config view
- consent-telemetry                     : no
- cpus                                  : 6
- disable-update-check                  : true
- disk-size                             : 48
- host-network-access                   : true
- kubeadmin-password                    : kiali
- memory                                : 32768
- network-mode                          : user

I can try it without those settings just to see what happens. But at minimum, the above works for me when I am off the VPN.

@jmazzitelli
Copy link

jmazzitelli commented Apr 26, 2024

OK, this is interesting. It is hung again. This is what I did.

I completely removed my $HOME/.crc directory.
I then started crc v2.34.1 (passing in my pull secret via the cmdline option).
I am NOT connected to the VPN.
My config is:

$ crc config view
- consent-telemetry                     : no
- cpus                                  : 6
- disable-update-check                  : true
- disk-size                             : 48
- kubeadmin-password                    : kiali
- memory                                : 32768
- pull-secret-file                      : /home/jmazzite/bin/downloads/openshift-pull-secret.txt

So it seems like those network-mode/host-network-access had some positive effect when I am not on VPN.

I am going to try this test again. I will completely remove $HOME/.crc again, only this time I will explicitly set those two network config options while remaining off the VPN. I'll see if that works.

(and I would like to repeat - none of these problems occur with v2.32.0 - so something changed between that version of CRC and v2.34.1 that is causing this problem)

@jmazzitelli
Copy link

jmazzitelli commented Apr 26, 2024

It failed when I turned off those network settings, but now, even worse, after control-c'ing the hung crc process, I can't restart anything. Continue to get the following:

Is 'crc daemon' running? Cannot reach daemon API: Get "http://unix/api/version": dial unix /home/jmazzite/.crc/crc-http.sock: connect: no such file or directory

I see in my crc output logs:

DEBU Running 'systemctl --user status crc-daemon.service' 
DEBU Command failed: exit status 3                
DEBU stdout: * crc-daemon.service - CRC daemon
     Loaded: loaded (/home/jmazzite/.config/systemd/user/crc-daemon.service; static)
    Drop-In: /usr/lib/systemd/user/service.d
             `-10-timeout-abort.conf
     Active: inactive (dead)

I ran crc cleanup and I think it might be getting further along, but I can't tell. With debug on, I see this continually scrolling:

DEBU Running SSH command: timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig 
DEBU SSH command results: err: Process exited with status 1, output:  
DEBU E0426 16:51:24.979560    8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.980018    8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.981353    8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.981675    8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
E0426 16:51:24.982956    8719 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host
Unable to connect to the server: dial tcp: lookup api.crc.testing on 192.168.130.1:53: no such host 
DEBU error: Temporary error: ssh command error:
command : timeout 5s oc get nodes --context admin --cluster crc --kubeconfig /opt/kubeconfig
err     : Process exited with status 1

@jmazzitelli
Copy link

I'm back able to start (but this time I went back to 2.32.0). I'm staying on that version for now :) I am on Fedora 39 (RedHat CSB machine) - and I know one other person who sees the same problem as I do on 2.34.1 on the same operating system. I can likely help work on debugging this with someone on Monday - but for now I'm going to continue using with 2.32.0 and ignore upgrading.

@jmazzitelli
Copy link

This is the log with the "hang" problem as it tries to store the pull secret.

crc-problem.txt

If you want to see the full series of calls I made to set up crc, I wrap it in this hack script.

I was NOT on the VPN, and I did NOT have the network configs set.

$ crc config view

- consent-telemetry                     : no
- cpus                                  : 6
- disable-update-check                  : true
- disk-size                             : 48
- kubeadmin-password                    : kiali
- memory                                : 32768
- pull-secret-file                      : /home/jmazzite/bin/downloads/openshift-pull-secret.txt

@jmazzitelli
Copy link

And here is the CRC logs when I can start it successfully:

crc-success.txt

$ crc version
CRC version: 2.34.1+b470b5
OpenShift version: 4.15.3
Podman version: 4.4.4

$ oc get nodes
NAME   STATUS   ROLES                         AGE   VERSION
crc    Ready    control-plane,master,worker   32d   v1.28.7+6e2789b

$ oc whoami --show-server
https://api.crc.testing:6443

I was NOT on the VPN but I DID have the network configs set:

$ crc config view
- consent-telemetry                     : no
- cpus                                  : 6
- disable-update-check                  : true
- disk-size                             : 48
- host-network-access                   : true
- kubeadmin-password                    : kiali
- memory                                : 32768
- network-mode                          : user
- pull-secret-file                      : /home/jmazzite/bin/downloads/openshift-pull-secret.txt

(There is one other less important thing that is interesting. Trying to connect via ssh to the CRC cluster doesn't work like it does when I can start it on 2.32. This command:

ssh -y -i ~/.crc/machines/crc/id_ecdsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null core@$(crc ip) ls

would normally just work and not require me to enter a password, but now it is asking me for a password. The -i identity should let me in though. So I don't know what is different. Maybe things changed in 2.34.1 wrt ssh access?)

@praveenkumar
Copy link
Member

After investigating with Fedora-CSB laptops looks like the systemd-resolved service is running but not consumed by NetworkManager so crc's preflight checks add the dispatcher file which is doesn't work since /etc/resolv.conf is not managed by systemd-resolved and in this case in the VM internal dns resolutions fails. We have 2 different workaround

Workaround-1: Disable/mask the systemd-resolved and follow the steps

$ sudo systemctl stop systemd-resolved.service
$ sudo systemctl mask systemd-resolved.service
$ crc delete && crc cleanup && crc setup && crc start

Workaround-2: Make sure /etc/resolv.conf managed by systemd-resolved

$ sudo rm -fr /etc/resolv.conf
$ sudo ln -s ../run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
$ sudo systemctl restart NetworkManager
$ crc delete && crc cleanup && crc setup && crc start

jmazzitelli added a commit to jmazzitelli/kiali that referenced this issue Apr 30, 2024
@jmazzitelli
Copy link

Workaround-1: Disable/mask the systemd-resolved and follow the steps

$ sudo systemctl stop systemd-resolved.service
$ sudo systemctl mask systemd-resolved.service
$ crc delete && crc cleanup && crc setup && crc start

I can confirm the above workaround works for me.

@jmazzitelli
Copy link

@praveenkumar just so I'm clear - the workarounds you mentioned, are those going to be the final solution for those with this problem? Or is the plan for CRC to change in a future version so the workarounds will not be needed?

@phantomjinx
Copy link

The workaround does not apply to my setup as I routinely disable system-resolved and run by own DNS server through docker.

[phantomjinx@microraptor:~/.crc/cache] 4 $ service systemd-resolved status
Redirecting to /bin/systemctl status systemd-resolved.service
○ systemd-resolved.service - Network Name Resolution
     Loaded: loaded (/usr/lib/systemd/system/systemd-resolved.service; disabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: inactive (dead)
       Docs: man:systemd-resolved.service(8)
             man:org.freedesktop.resolve1(5)
             https://www.freedesktop.org/wiki/Software/systemd/writing-network-configuration-managers
             https://www.freedesktop.org/wiki/Software/systemd/writing-resolver-clients

see #4144 for details of my configuration.

jshaughn pushed a commit to kiali/kiali that referenced this issue May 4, 2024
* [hack] bump crc to 2.34.1 / openshift 4.15

see crc-org/crc#4110 (comment)

* add cleanup. now delete won't purge the downloaded bundles so it is faster to restart everything. but if you want to clean things up, the new cleanup command is what you want
@ferhoyos
Copy link

ferhoyos commented May 7, 2024

By applying the workaround, I can execute OCP 4.15 on my Fedora 39, but as a side effect, I am unable to run a Minikube cluster

😄 minikube v1.32.0 on Fedora 39
✨ Using the kvm2 driver based on existing profile
👍 Starting control plane node minikube in cluster minikube
🔄 Restarting existing kvm2 VM for "minikube" ...
:face_palm: StartHost failed, but will try again: driver start: ensuring active networks: starting network default: virError(Code=1, Domain=0, Message='internal error: Network is already in use by interface virbr0')
🔄 Restarting existing kvm2 VM for "minikube" ...
😿 Failed to start kvm2 VM. Running "minikube delete" may fix it: driver start: ensuring active networks: starting network default: virError(Code=1, Domain=0, Message='internal error: Network is already in use by interface virbr0')
❌ Exiting due to GUEST_PROVISION: error provisioning guest: Failed to start host: driver start: ensuring active networks: starting network default: virError(Code=1, Domain=0, Message='internal error: Network is already in use by interface virbr0')

If this happens, you have to unmask and then enable systemd-resolved service again to fix the minikube issue:

sudo systemctl unmask systemd-resolved
sudo systemctl enable systemd-resolved
sudo systemctl start systemd-resolved

@jmazzitelli
Copy link

Just want to close the loop on this - all my problems seem to have gone away with the new CRC v2.36.0. I can run CRC without messing with systemd-resolved service and I am still able to also start minikube successfully (no longer get that error Network is already in use by interface virbr0')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working status/need triage
Projects
None yet
Development

No branches or pull requests

7 participants