Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman site crash leaves setup in confusing state #1298

Closed
bryonbaker opened this issue Nov 30, 2023 · 1 comment · Fixed by #1302
Closed

Podman site crash leaves setup in confusing state #1298

bryonbaker opened this issue Nov 30, 2023 · 1 comment · Fixed by #1302
Assignees

Comments

@bryonbaker
Copy link

bryonbaker commented Nov 30, 2023

Description

There are possibly a few bugs and enhancements in this issue...
There is a way to cause a router crash when initialising a Podman Site that leaves the environment in a state that is difficult for a user to understand what is going on and how to get a clean state that another skupper init can be performed.

In this case the crash is caused by a new RHEL9 requirement for the CPU architecture. If running in a virtual machine then the defaults may not pass the actual host architecture through to the guest, causing the router to crash once skupper cli thinks it has been successfully started. The error the router crashes with is: Fatal glibc error: CPU does not support x86-64-v2. Configuring the hypervisor to pass the actual CPU architecture through stops the crash.

However, this crash has uncovered an issue related to the state that the host is left in when the router crashes. Other crash types will likely have the same issue.

At the heart of this issue is a lack of guidance to help a user get out of the mess.

What did I do?

  1. Built a virtual machine on Proxmix that has a CPU architectire of kvm64 (default)
  2. Start RHSI:
    $ skupper init --site-name MASTER --platform podman --ingress-host 10.10.10.6
    It is recommended to enable lingering for bryon, otherwise Skupper may not start on boot.
    Skupper is now installed for user 'bryon'.  Use 'skupper status' to get more information.
    
  3. After discovering that the router was not actually running I tried again:
$ skupper init --site-name MASTER --platform podman --ingress-host 10.10.10.6
Skupper has already been initialized for user 'bryon'.
  1. With no understanding of what has gone wrong I cleaned up the deployment and tried again
$ skupper delete
Skupper is now removed for user 'bryon'.

$ skupper init --site-name MASTER --platform podman --ingress-host 10.10.10.15
Error: Error initializing Skupper - ingress port already bound :55671

What happened

The router started and then crashed. No feedback was provided to the user.
The logs from journalctl (attached) had two significant entries (see below and attached)
The router port remained locked until I rebooted the machine.

Errors in system logs

1. CPU Architecture
Nov 29 07:27:28 rhsi-2 skupper-controller-podman[1956]: Fatal glibc error: CPU does not support x86-64-v2

2. Core Dump

Nov 29 07:27:26 rhsi-2 systemd-coredump[1903]: Process 1879 (podman) of user 1000 dumped core.
                                               
                                               Module libbz2.so.1 from rpm bzip2-1.0.8-13.fc38.x86_64
                                               Module libsepol.so.2 from rpm libsepol-3.5-1.fc38.x86_64
                                               Module libpcre2-8.so.0 from rpm pcre2-10.42-1.fc38.1.x86_64
                                               Module libattr.so.1 from rpm attr-2.5.1-6.fc38.x86_64
                                               Module libacl.so.1 from rpm acl-2.3.1-6.fc38.x86_64
                                               Module libcrypt.so.2 from rpm libxcrypt-4.4.36-1.fc38.x86_64
                                               Module libeconf.so.0 from rpm libeconf-0.5.2-1.fc38.x86_64
                                               Module libsemanage.so.2 from rpm libsemanage-3.5-2.fc38.x86_64
                                               Module libselinux.so.1 from rpm libselinux-3.5-1.fc38.x86_64
                                               Module libaudit.so.1 from rpm audit-3.1.2-5.fc38.x86_64
                                               Module libseccomp.so.2 from rpm libseccomp-2.5.3-4.fc38.x86_64
                                               Module podman from rpm podman-4.7.2-1.fc38.x86_64
                                               Stack trace of thread 1881:
                                               #0  0x0000555c0ed3dde1 runtime.raise.abi0 (podman + 0x47dde1)
                                               #1  0x0000555c0ed1d54e runtime.sigfwdgo (podman + 0x45d54e)
                                               #2  0x0000555c0ed1bb47 runtime.sigtrampgo (podman + 0x45bb47)
                                               #3  0x0000555c0ed3e0e9 runtime.sigtramp.abi0 (podman + 0x47e0e9)
                                               #4  0x00007f6119f88bb0 __restore_rt (libc.so.6 + 0x3dbb0)
                                               #5  0x0000555c0ed3dde1 runtime.raise.abi0 (podman + 0x47dde1)
                                               #6  0x0000555c0ed05bed runtime.fatalpanic (podman + 0x445bed)
                                               #7  0x0000555c0ed052db runtime.gopanic (podman + 0x4452db)
                                               #8  0x0000555c0ed1cddd runtime.sigpanic (podman + 0x45cddd)
                                               #9  0x0000555c0f5e2ddb github.com/containers/podman/v4/pkg/errorhandling.CloseQuiet (podman + 0xd22ddb)
                                               #10 0x0000555c0fbfd906 github.com/containers/podman/v4/libpod.(*Runtime).setupRootlessPortMappingViaRLK.func1 (podman + 0x133d906)
                                               #11 0x0000555c0fbfd862 github.com/containers/podman/v4/libpod.(*Runtime).setupRootlessPortMappingViaRLK (podman + 0x133d862)
                                               #12 0x0000555c0fbcfd0f github.com/containers/podman/v4/libpod.(*Container).setupRootlessNetwork (podman + 0x130fd0f)
                                               #13 0x0000555c0fb9fa25 github.com/containers/podman/v4/libpod.(*Container).handleRestartPolicy (podman + 0x12dfa25)
                                               #14 0x0000555c0fb84f18 github.com/containers/podman/v4/libpod.(*Container).Cleanup (podman + 0x12c4f18)
                                               #15 0x0000555c0fce44de github.com/containers/podman/v4/pkg/domain/infra/abi.(*ContainerEngine).ContainerCleanup (podman + 0x14244de)
                                               #16 0x0000555c0fe494ed github.com/containers/podman/v4/cmd/podman/containers.cleanup (podman + 0x15894ed)
                                               #17 0x0000555c0f292722 github.com/spf13/cobra.(*Command).execute (podman + 0x9d2722)
                                               #18 0x0000555c0f292f9d github.com/spf13/cobra.(*Command).ExecuteC (podman + 0x9d2f9d)
                                               #19 0x0000555c0ff308ac main.Execute (podman + 0x16708ac)
                                               #20 0x0000555c0ff2ff1f main.main (podman + 0x166ff1f)
                                               #21 0x0000555c0ed08052 runtime.main (podman + 0x448052)
                                               #22 0x0000555c0ed3c461 runtime.goexit.abi0 (podman + 0x47c461)
                                               ELF object binary architecture: AMD x86-64

What did I expect to happen

For the crash I would have expected skupper cli not to return until the router had properly stabilised.
On the second skupper init that failed I would have expected some help to guide the user as to how to clean up the environment

Environment Details

  1. Use a hypervisor (proxmox) and set the CPU architecture to kvm64.
  2. Fedora 38 Workstation
  3. Skupper release: 1.5
  4. Skupper router version: quay.io/skupper/skupper-router 2.5.0 sha256:55f014d0fcf4b612eccf0f74cfb3cc298a52d529c5187b8a53ffe18b2f6a4a70

Attachments

Journalctl output
joutnalctl.log

@bryonbaker bryonbaker changed the title Podman site crash leaves etup in confusing stte Podman site crash leaves setup in confusing state Nov 30, 2023
@fgiorgetti fgiorgetti self-assigned this Nov 30, 2023
@fgiorgetti
Copy link
Member

When the CPU architecture requirement is not met, the container restarts in a loop, until the Podman network runs out of IP addresses (there is a podman fix to this issue included as part of 4.8 release).

Once this situation happens, there are rootlessport processes left behind which keep the host ports, used by Skupper podman, bound, preventing new attempts to create sites from working.

I am going to push a preventive fix that will make sure the containers can run
successfully on the respective host, before creating the site and binding ports
on the host machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants