Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary failure in name resolution for service name when using docker compose with podman #24566

Closed
taladar opened this issue Nov 14, 2024 · 6 comments · Fixed by #24578
Closed
Labels
5.3 kind/bug Categorizes issue or PR as related to a bug. network Networking related issue or feature regression

Comments

@taladar
Copy link

taladar commented Nov 14, 2024

Issue Description

A docker-compose file that worked fine on podman 5.2.5 (and works fine again when downgrading to 5.2.5) is broken on 5.3.0 because the service gets a reproducible "Temporary failure in name resolution" error when trying to lookup the database container name.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Create docker-compose.yml
name: repro

services:
    database:
      image: debian:bookworm
      init: true
      stop_grace_period: 0s
      command: ["sleep", "60"]
    dnstest:
      image: debian:bookworm
      init: true
      stop_grace_period: 0s
      command: ["getent", "hosts", "database"]
  1. Run docker compose up --abort-on-container-exit
  2. Observe getent exit with status 0 on podman 5.2.5, with status 2 (One or more supplied key could not be found in the database) on podman 5.3.0, oddly enough in both cases an IP/host combination is printed but in the usual DNS APIs this difference results in "Temporary failure in name resolution"
  3. Cleanup with docker compose down

Describe the results you received

With DNS lookups in 5.3.0 I get an error when looking up one of the names of other services in the docker-compose.yml

Describe the results you expected

With DNS lookups in 5.2.5 I do get the hostname resolved to the IP of the container as I would expect.

podman info output

For 5.3.0

host:
  arch: amd64
  buildahVersion: 1.38.0
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: app-containers/conmon-2.1.11
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.1.11, commit: unknown'
  cpuUtilization:
    idlePercent: 93.8
    systemPercent: 1.51
    userPercent: 4.7
  cpus: 24
  databaseBackend: boltdb
  distribution:
    distribution: gentoo
    version: "2.17"
  eventLogger: journald
  freeLocks: 2039
  hostname: taladardesktop
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.11.7
  linkmode: dynamic
  logDriver: journald
  memFree: 10312306688
  memTotal: 67315314688
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: app-containers/aardvark-dns-1.12.2-r1
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.12.2
    package: app-containers/netavark-1.12.2-r1
    path: /usr/libexec/podman/netavark
    version: netavark 1.12.2
  ociRuntime:
    name: crun
    package: app-containers/crun-1.17
    path: /usr/bin/crun
    version: |-
      crun version 1.17
      commit: 000fa0d4eeed8938301f3bcf8206405315bc1017
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: net-misc/passt-2024.09.06
    version: |
      pasta 2024.09.06
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: app-containers/slirp4netns-1.2.0
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.5
  swapFree: 0
  swapTotal: 0
  uptime: 71h 5m 6.00s (Approximately 2.96 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/taladar/.config/containers/storage.conf
  containerStore:
    number: 7
    paused: 0
    running: 4
    stopped: 3
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/taladar/.local/share/containers/storage
  graphRootAllocated: 214681255936
  graphRootUsed: 169024409600
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 531
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/taladar/.local/share/containers/storage/volumes
version:
  APIVersion: 5.3.0
  Built: 1731596777
  BuiltTime: Thu Nov 14 16:06:17 2024
  GitCommit: ""
  GoVersion: go1.23.2
  Os: linux
  OsArch: linux/amd64
  Version: 5.3.0

For the 5.2.5 where the problem does not occur (after downgrade)

host:
  arch: amd64
  buildahVersion: 1.37.5
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: app-containers/conmon-2.1.11
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.1.11, commit: unknown'
  cpuUtilization:
    idlePercent: 93.79
    systemPercent: 1.51
    userPercent: 4.7
  cpus: 24
  databaseBackend: boltdb
  distribution:
    distribution: gentoo
    version: "2.17"
  eventLogger: journald
  freeLocks: 2039
  hostname: taladardesktop
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.11.7
  linkmode: dynamic
  logDriver: journald
  memFree: 10394812416
  memTotal: 67315314688
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: app-containers/aardvark-dns-1.12.2-r1
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.12.2
    package: app-containers/netavark-1.12.2-r1
    path: /usr/libexec/podman/netavark
    version: netavark 1.12.2
  ociRuntime:
    name: crun
    package: app-containers/crun-1.17
    path: /usr/bin/crun
    version: |-
      crun version 1.17
      commit: 000fa0d4eeed8938301f3bcf8206405315bc1017
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: net-misc/passt-2024.09.06
    version: |
      pasta 2024.09.06
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: app-containers/slirp4netns-1.2.0
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.5
  swapFree: 0
  swapTotal: 0
  uptime: 71h 8m 55.00s (Approximately 2.96 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/taladar/.config/containers/storage.conf
  containerStore:
    number: 7
    paused: 0
    running: 4
    stopped: 3
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/taladar/.local/share/containers/storage
  graphRootAllocated: 214681255936
  graphRootUsed: 169024729088
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 531
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/taladar/.local/share/containers/storage/volumes
version:
  APIVersion: 5.2.5
  Built: 1731597565
  BuiltTime: Thu Nov 14 16:19:25 2024
  GitCommit: ""
  GoVersion: go1.23.2
  Os: linux
  OsArch: linux/amd64
  Version: 5.2.5

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

No response

@taladar taladar added the kind/bug Categorizes issue or PR as related to a bug. label Nov 14, 2024
@Luap99
Copy link
Member

Luap99 commented Nov 14, 2024

Can you provide the full error output? It is not clear to me what the expect error situation is.

This seems to work for me using podman from main

$ ~/go/bin/docker-compose up --abort-on-container-exit
[+] Building 0.0s (0/0)                                                                                                                        docker-container:default
[+] Running 2/0
 ✔ Container repro-dnstest-1   Created                                                                                                                             0.0s 
 ✔ Container repro-database-1  Created                                                                                                                             0.0s 
Attaching to repro-database-1, repro-dnstest-1
repro-dnstest-1   | 10.89.0.8       database.dns.podman
repro-dnstest-1 exited with code 0
Aborting on container exit...
[+] Stopping 2/2
 ✔ Container repro-dnstest-1   Stopped                                                                                                                             0.0s 
 ✔ Container repro-database-1  Stopped    

@Luap99
Copy link
Member

Luap99 commented Nov 14, 2024

One thing to note here AFAICT both containers are created and started in parallel by compose so there is no dependency as such if the dnstest container is created and started before the database is running it would not resolve the name.

@taladar
Copy link
Author

taladar commented Nov 14, 2024

Oops, I guess i left out the depends_on part when trying to minimize this but the actual use case has that bit.

@Luap99 Luap99 added network Networking related issue or feature 5.3 labels Nov 15, 2024
@Luap99
Copy link
Member

Luap99 commented Nov 15, 2024

Ok I managed to reproduce it. The key seems to be to have another container with dns running before, i.e.

$ podman network create n1
$ podman run -d --network n1 docker.io/library/debian:bookworm sleep inf

Then your reproducer seem to work, I don't really know of any 5.3network chnages that would affect this but I guess I try to bisect it.

@Luap99
Copy link
Member

Luap99 commented Nov 15, 2024

I think this is a upgrade issue. The new version causes a nil deref as it is expecting data to be written on rootless-netns setup that was not written with 5.2.
Can you update to 5.3 again then stop all your containers and make sure /run/user/$UID/containers/networks/rootless-netns/ is empty or does not exists. Then start the containers again and it should just work again. Or just reboot which also clears out the old state as it is on a tmpfs

I will submit a fix for the nil deref but it is best to restart in order to get the new functionality of host.containers.internal (https://blog.podman.io/2024/10/podman-5-3-changes-for-improved-networking-experience-with-pasta/)

Simple reproducer using 5.2.5 and podman 5.3.0 as bin/podman

$ podman run -d --network n1 docker.io/library/debian:bookworm sleep inf
55696891183d28905d47e64f66fcb34b734b67edc835b778c7f1e5f73aa96054
$ bin/podman run  --network n1 docker.io/library/debian:bookworm true
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x14890b6]

goroutine 1 [running]:
github.com/containers/podman/v5/libpod.(*Container).addHosts(0xc00015c820)
        /home/pholzing/go/src/github.com/containers/podman/libpod/container_internal_common.go:2326 +0x1d6
github.com/containers/podman/v5/libpod.(*Container).completeNetworkSetup(0xc00015c820)
        /home/pholzing/go/src/github.com/containers/podman/libpod/container_internal.go:1027 +0x85
github.com/containers/podman/v5/libpod.(*Container).init(0xc00015c820, {0x203eaf8, 0xc0005a2420}, 0x0)
        /home/pholzing/go/src/github.com/containers/podman/libpod/container_internal.go:1146 +0x8b1
github.com/containers/podman/v5/libpod.(*Container).prepareToStart(0xc00015c820, {0x203eaf8, 0xc0005a2420}, 0xf0?)
        /home/pholzing/go/src/github.com/containers/podman/libpod/container_internal.go:861 +0x3bc
github.com/containers/podman/v5/libpod.(*Container).Attach(0xc00015c820, {0x203eaf8, 0xc0005a2420}, 0xc00037c510, {0x1ceb888, 0xd}, 0xc000052720, 0x1)
        /home/pholzing/go/src/github.com/containers/podman/libpod/container_api.go:169 +0x211
github.com/containers/podman/v5/pkg/domain/infra/abi/terminal.StartAttachCtr({0x203eaf8, 0xc0005a2420}, 0xc00015c820, 0xc00007e028, 0xc00007e030, 0x0, {0x1ceb888, 0xd}, 0x1, 0x1)
        /home/pholzing/go/src/github.com/containers/podman/pkg/domain/infra/abi/terminal/terminal_common.go:92 +0x494
github.com/containers/podman/v5/pkg/domain/infra/abi.(*ContainerEngine).ContainerRun(0xc00007e4a0, {0x203eaf8, 0xc0005a2420}, {{0x0, 0x0}, 0x0, {0x1ceb888, 0xd}, 0xc00007e030, 0x0, ...})
        /home/pholzing/go/src/github.com/containers/podman/pkg/domain/infra/abi/containers.go:1187 +0x76d
github.com/containers/podman/v5/cmd/podman/containers.run(0x2d56520, {0xc0001c8c40, 0x2, 0x4})
        /home/pholzing/go/src/github.com/containers/podman/cmd/podman/containers/run.go:224 +0xa83
github.com/spf13/cobra.(*Command).execute(0x2d56520, {0xc000040260, 0x4, 0x4})
        /home/pholzing/go/src/github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:985 +0xaca
github.com/spf13/cobra.(*Command).ExecuteC(0x2d34000)
        /home/pholzing/go/src/github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
        /home/pholzing/go/src/github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:1041
github.com/spf13/cobra.(*Command).ExecuteContext(...)
        /home/pholzing/go/src/github.com/containers/podman/vendor/github.com/spf13/cobra/command.go:1034
main.Execute()
        /home/pholzing/go/src/github.com/containers/podman/cmd/podman/root.go:116 +0xb4
main.main()
        /home/pholzing/go/src/github.com/containers/podman/cmd/podman/main.go:61 +0x4b2

@taladar
Copy link
Author

taladar commented Nov 15, 2024

I can confirm that an upgrade to 5.3.0 followed by a reboot does not exhibit the problem anymore.

Thank you for the quick help.

openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/podman that referenced this issue Nov 18, 2024
In theory RootlessNetnsInfo() should never return nil here. However that
was actually only true when the rootless netns was set up before and
wrote the right cache file with the ip addresses.

Given this cache file is a new feature just added in 5.3 if you updated
from 5.2 or earlier the file will not exists thus cause failures for all
following started containers.
The fix for this is to stop all containers and make sure the
rootless-netns was removed so the next start creates it new with the
proper 5.3 cache file. However as there is no way to rely on users doing
that and it is also not requirement so simply handle the nil deref here.

The only way to test this would be to run the old version then the new
version which we cannot really do in CI. We do have upgrade test for
that but they are root only and likely need a lot more work to get them
going rootless but certainly worth to explore to prevent such problems
in the future.

Fixes: a1e6603 ("libpod: make use of new pasta option from c/common")
Fixes: containers#24566

Signed-off-by: Paul Holzinger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.3 kind/bug Categorizes issue or PR as related to a bug. network Networking related issue or feature regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants