Failed to create bus connection: No such file or directory - when calling /api/v1/health endpoint #1732

adnoh · 2018-03-15T14:33:31Z

Description

I've just deployed opensuse/portus:2.3.0 (with puma) connected to a postgres:9.6 pod into kubernetes.
Everything works fine except of Portus is logging "Failed to create bus connection: No such file or directory" after every Kubernetes LiveProbe calling /api/v1/health check. The Database is accessible, the software works like expected, calling the URL in the browser gives me success all over the place (database, clair, registry) - but it always logs this error. the /api/v1/_ping endpoint works like expected (no Error in Log).

Steps to reproduce

Watching the POD Output of Portus
calling the /api/v1/health endpoint

Expected behavior: No Error Message in Log
Actual behavior: Error Message in Log, but everything is working

mssola · 2018-04-05T13:28:08Z

Weird, I cannot seem to reproduce this issue 🤔 Could you set the log level to debug as explained on this page and then paste the logs ?

Vad1mo · 2018-04-05T13:31:42Z

What I see when calling /api/v1/health is that it takes quite some time to respond. e.b. up to 30 sek.
Another thing is that some OS processes are started which may cause problems on K8s. I am not sure if that is related but it might be something to look at.

adnoh · 2018-04-11T16:14:18Z

Failed to create bus connection: No such file or directory
Failed to create bus connection: No such file or directory
Database ready
Failed to create bus connection: No such file or directory
[44] Puma starting in cluster mode...
[44] * Version 3.10.0 (ruby 2.5.0-p0), codename: Russell's Teapot
[44] * Min threads: 1, max threads: 4
[44] * Environment: production
[44] * Process workers: 2
[44] * Preloading application
[Mailer config] Host: portus.foo.net
[Mailer config] Protocol: https://
[44] * Listening on ssl://0.0.0.0:3000?cert=/certificates/portus/tls.crt&key=/certificates/portus/tls.key&verify_mode=none
[44] Use Ctrl-C to stop
[44] - Worker 0 (pid: 47) booted, phase: 0
[44] - Worker 1 (pid: 51) booted, phase: 0
Started GET "/api/v1/health" for 192.168.1.1 at 2018-04-11 15:01:45 +0000
Failed to create bus connection: No such file or directory
(7.4ms) SELECT COUNT() FROM "registries"
Registry Load (0.9ms) SELECT "registries". FROM "registries" ORDER BY "registries"."id" ASC LIMIT 1

sh-4.3# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 79 0.0 0.0 20000 3108 pts/0 Ss 15:04 0:00 sh
root 85 0.0 0.0 43612 3396 pts/0 R+ 15:04 0:00 _ ps auxf
root 1 0.0 0.0 20000 2940 ? Ss 15:00 0:00 /bin/bash /init
root 39 0.0 0.0 19996 2748 ? S 15:00 0:00 /bin/bash /usr/sbin/portusctl exec pumactl -F /srv/Portus/config/puma.rb start
root 41 0.1 0.0 64900 18588 ? Sl 15:00 0:00 _ packaging/suse/portusctl/bin/portusctl exec pumactl -F /srv/Portus/config/puma.rb start
root 44 1.4 0.1 268204 99292 ? Sl 15:00 0:03 _ puma 3.10.0 (ssl://0.0.0.0:3000?cert=/certificates/portus/tls.crt&key=/certificates/portus/tls.k
root 47 0.0 0.1 811012 100880 ? Sl 15:00 0:00 _ puma: cluster worker 0: 44 [Portus]
root 51 0.1 0.1 878604 108440 ? Sl 15:00 0:00 _ puma: cluster worker 1: 44 [Portus]

mssola · 2018-04-30T14:35:06Z

That's frustrating, I've tried to reproduce it in multiple ways with no luck... Wouldn't you have a configuration that I could use in order to reproduce it ?

ghost · 2018-06-04T23:21:52Z

I'm also see this. Fresh deployment of portus to Kubernetes

rendhalver · 2018-06-18T12:38:26Z

I am seeing the same errors here.
We have portus 2.3 deployed into Kubernetes with the Chart in this PR. helm/charts#2766.

linkvt · 2018-06-21T08:35:07Z

I found out, that something inside the Portus container expects the DBUS socket to be available at /var/run/dbus/system_bus_socket. When mounting the socket from the host machine from /run/dbus/system_bus_socket the error disappeared - but I didn't notice a difference in the application.

@mssola we run

Ubuntu 16.04 machines
with Kubernetes 1.10.1
setup with rancher/rke

The same error appeared when executing portusctl e.g. with portusctl exec rake portus:info.
After mounting the dbus socket, which is located at /run/dbus/system_dbus_socket (at least on Ubuntu) the error disappeared.
The same error also appeared when running timedatectl (see related issue on StackOverflow).

flavio · 2018-07-10T16:19:31Z

I run into the very same issue while trying to run opensuse/portus:2.3 on top of Kubernetes and I was able to find the root cause.

I'll just write down the whole process, it might be useful for @mssola or others ;)

Debugging

First of all I changed the command of the pod to run ["sh", "-c", "sleep 24h"]. By doing that I had plenty of time for debugging.

Then I entered the pod (just use docker exec or kubectl exec) and performed the following steps:

zypper in strace
export PORTUS_INIT_COMMAND="rails r /srv/Portus/bin/check_db.rb"
strace -e trace=open -f -o strace.log /init

Then I looked at the strace.log. The file is big, just search for exited with 1 and look at the lines above it. They will look something similar to:

407   open("/srv/Portus/vendor/bundle/ruby/2.5.0/gems/thor-0.19.4/lib/thor/util.rb", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
407   open("/srv/Portus/vendor/bundle/ruby/2.5.0/gems/thor-0.19.4/lib/thor/util.rb", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
407   open("/srv/Portus/packaging/suse/portusctl/lib/constants.rb", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
407   open("/srv/Portus/packaging/suse/portusctl/lib/constants.rb", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
407   open("/proc/1/cgroup", O_RDONLY|O_CLOEXEC) = 7
409   open("/proc/self/status", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 9
409   open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
409   open("/usr/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
409   open("/proc/filesystems", O_RDONLY) = 3
409   open("/proc/self/stat", O_RDONLY|O_CLOEXEC) = 3
409   open("/sys/fs/kdbus/0-system/bus", O_RDWR|O_NOCTTY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
409   +++ exited with 1 +++

Basically rails is going through all the gems requires by Portus. These are not something you should be worried about (if the problem were into one of these gems others would have complained about it).

Then it reads something Portus specific: /srv/Portus/packaging/suse/portusctl/lib/constants.rb and then it performs a bunch of other open requests.

Some of them triggered by interest:

407   open("/proc/1/cgroup", O_RDONLY|O_CLOEXEC) = 7
409   open("/proc/self/status", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 9
[...]
409   open("/proc/filesystems", O_RDONLY) = 3
409   open("/proc/self/stat", O_RDONLY|O_CLOEXEC) = 3
409   open("/sys/fs/kdbus/0-system/bus", O_RDWR|O_NOCTTY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

The most significant are:

/proc/1/cgroup reading this file is a common way to figure out if you are running into a container.
/sys/fs/kdbus/0-system/bus: this entry is not inside of the container and neither on my host because I have kdbus disabled.

Then I looked into the ruby code that was opened right before this series of open(s):

Portus/packaging/suse/portusctl/lib/constants.rb

Lines 6 to 14 in 0583e1a

    
           def dockerized? 
        
             File.read("/proc/1/cgroup").include?("docker") 
        
           end 
        
           # This one is set by the bash wrapper we deliver with our RPM 
        
           # See packaging/suse/bin/portusctl 
        
           BUNDLER_BIN = ENV["BUNDLER_BIN"] 
        
           HOSTNAME    = (dockerized? || ENV["TRAVIS"] ? `hostname -f` : `hostnamectl --static status`).chomp 
        
           PORTUS_ROOT = ENV["PORTUS_ROOT"] ? ENV["PORTUS_ROOT"] : "/srv/Portus"

As you can see the HOSTNAME variable is computed by invoking the dockerized? function and then, upon its failure, using the hostname and hostnamectl commands.

The dockerized? function searches the docker string inside of the contents of /proc/1/cgroup.
This will return false when the container is run by Kubernetes because PID 1 doesn't belong to a docker cgroup. Instead it belongs to the kubepods cgroup (see this document.

As a proof:

bash-4.3# cat /proc/1/cgroup 
12:hugetlb:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
11:cpuset:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
10:perf_event:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
9:pids:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
8:devices:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
7:memory:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
6:rdma:/
5:net_cls,net_prio:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
4:freezer:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
3:cpu,cpuacct:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
2:blkio:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
1:name=systemd:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
0::/podruntime.slice/containerd.service

Given dockerized? failed the code it will then invoke hostname -f and hostnamectl --static status. The latter one tries to access /sys/fs/kdbus/0-system/bus, this can be confirmed by running:

$: hostnamectl --static status
Failed to create bus connection: No such file or directory
$: echo $?
1

The failure of hostnamectl causes a ruby exception to be raised, which leads to the abortion of the /init.

That's the origin of our bug.

Fix

I'm about to open a PR that fixes the issue by changing how the dockerized? function behaves.

The code determining if Portus is running containerized is broken on Kubernetes because in this case PID 1 doesn't belong to the `docker` cgroup slice. That causes the code to invoke `hostnamectl`, that tries to access kdbus from within a container. That causes Rails to fail immediately. Signed-off-by: Flavio Castelli <[email protected]>

flavio · 2018-07-10T16:26:02Z

BTW, given he commented on this issue... I just wanted to send ❤️ to @rendhalver for his work on Portus' helm chart!

Improve detection of containerized environments - fix issue #1732

mssola · 2018-07-12T09:26:45Z

Fixed by #1879.

rendhalver · 2018-07-12T13:23:01Z

I think you got the PR reference wrong. #1732 is this issue.

mssola · 2018-07-12T13:31:28Z

@rendhalver fixed 😅

ghostsquad · 2018-08-07T16:47:23Z

Just curious if this was erroring, what was it trying to do that it could not do because of the error?

* upstream/v2.3: Bump version Remind users to login again after password update Restrict deletes into the repository Bump version update sprockets to fix cve-2018-3760 Bump version oauth/gitlab: Be sure to load all groups Allow to use private gitlab server Fixed regression on registries not being created Make it possible to connect to a database socket Allow Puma to bind to unix socket also in production user: skip validations when creating portus user passwords: don't allow the portus user to reset user: do not allow the update of the portus user Improve detection of containerized environments - fix issue SUSE#1732

mssola added the needs info label Apr 5, 2018

flavio mentioned this issue Jul 10, 2018

Improve detection of containerized environments - fix issue #1732 #1879

Merged

mssola added a commit that referenced this issue Jul 12, 2018

Merge pull request #1879 from SUSE/v2.3-fix-1732

eaab41d

Improve detection of containerized environments - fix issue #1732

mssola closed this as completed Jul 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to create bus connection: No such file or directory - when calling /api/v1/health endpoint #1732

Failed to create bus connection: No such file or directory - when calling /api/v1/health endpoint #1732

adnoh commented Mar 15, 2018

mssola commented Apr 5, 2018

Vad1mo commented Apr 5, 2018

adnoh commented Apr 11, 2018

mssola commented Apr 30, 2018

ghost commented Jun 4, 2018 •

edited by ghost

Loading

rendhalver commented Jun 18, 2018

linkvt commented Jun 21, 2018

flavio commented Jul 10, 2018

flavio commented Jul 10, 2018

mssola commented Jul 12, 2018 •

edited

Loading

rendhalver commented Jul 12, 2018

mssola commented Jul 12, 2018

ghostsquad commented Aug 7, 2018

Failed to create bus connection: No such file or directory - when calling /api/v1/health endpoint #1732

Failed to create bus connection: No such file or directory - when calling /api/v1/health endpoint #1732

Comments

adnoh commented Mar 15, 2018

Description

Steps to reproduce

mssola commented Apr 5, 2018

Vad1mo commented Apr 5, 2018

adnoh commented Apr 11, 2018

mssola commented Apr 30, 2018

ghost commented Jun 4, 2018 • edited by ghost Loading

rendhalver commented Jun 18, 2018

linkvt commented Jun 21, 2018

flavio commented Jul 10, 2018

Debugging

Fix

flavio commented Jul 10, 2018

mssola commented Jul 12, 2018 • edited Loading

rendhalver commented Jul 12, 2018

mssola commented Jul 12, 2018

ghostsquad commented Aug 7, 2018

ghost commented Jun 4, 2018 •

edited by ghost

Loading

mssola commented Jul 12, 2018 •

edited

Loading