Skip to content
This repository has been archived by the owner on Apr 17, 2023. It is now read-only.

Failed to create bus connection: No such file or directory - when calling /api/v1/health endpoint #1732

Closed
adnoh opened this issue Mar 15, 2018 · 13 comments

Comments

@adnoh
Copy link

adnoh commented Mar 15, 2018

Description

I've just deployed opensuse/portus:2.3.0 (with puma) connected to a postgres:9.6 pod into kubernetes.
Everything works fine except of Portus is logging "Failed to create bus connection: No such file or directory" after every Kubernetes LiveProbe calling /api/v1/health check. The Database is accessible, the software works like expected, calling the URL in the browser gives me success all over the place (database, clair, registry) - but it always logs this error. the /api/v1/_ping endpoint works like expected (no Error in Log).

Steps to reproduce

  1. Watching the POD Output of Portus
  2. calling the /api/v1/health endpoint
  • Expected behavior: No Error Message in Log
  • Actual behavior: Error Message in Log, but everything is working
@mssola
Copy link
Collaborator

mssola commented Apr 5, 2018

Weird, I cannot seem to reproduce this issue 🤔 Could you set the log level to debug as explained on this page and then paste the logs ?

@Vad1mo
Copy link
Contributor

Vad1mo commented Apr 5, 2018

What I see when calling /api/v1/health is that it takes quite some time to respond. e.b. up to 30 sek.
Another thing is that some OS processes are started which may cause problems on K8s. I am not sure if that is related but it might be something to look at.

@adnoh
Copy link
Author

adnoh commented Apr 11, 2018

Failed to create bus connection: No such file or directory
Failed to create bus connection: No such file or directory
Database ready
Failed to create bus connection: No such file or directory
[44] Puma starting in cluster mode...
[44] * Version 3.10.0 (ruby 2.5.0-p0), codename: Russell's Teapot
[44] * Min threads: 1, max threads: 4
[44] * Environment: production
[44] * Process workers: 2
[44] * Preloading application
[Mailer config] Host: portus.foo.net
[Mailer config] Protocol: https://
[44] * Listening on ssl://0.0.0.0:3000?cert=/certificates/portus/tls.crt&key=/certificates/portus/tls.key&verify_mode=none
[44] Use Ctrl-C to stop
[44] - Worker 0 (pid: 47) booted, phase: 0
[44] - Worker 1 (pid: 51) booted, phase: 0
Started GET "/api/v1/health" for 192.168.1.1 at 2018-04-11 15:01:45 +0000
Failed to create bus connection: No such file or directory
(7.4ms) SELECT COUNT() FROM "registries"
Registry Load (0.9ms) SELECT "registries".
FROM "registries" ORDER BY "registries"."id" ASC LIMIT 1


sh-4.3# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 79 0.0 0.0 20000 3108 pts/0 Ss 15:04 0:00 sh
root 85 0.0 0.0 43612 3396 pts/0 R+ 15:04 0:00 _ ps auxf
root 1 0.0 0.0 20000 2940 ? Ss 15:00 0:00 /bin/bash /init
root 39 0.0 0.0 19996 2748 ? S 15:00 0:00 /bin/bash /usr/sbin/portusctl exec pumactl -F /srv/Portus/config/puma.rb start
root 41 0.1 0.0 64900 18588 ? Sl 15:00 0:00 _ packaging/suse/portusctl/bin/portusctl exec pumactl -F /srv/Portus/config/puma.rb start
root 44 1.4 0.1 268204 99292 ? Sl 15:00 0:03 _ puma 3.10.0 (ssl://0.0.0.0:3000?cert=/certificates/portus/tls.crt&key=/certificates/portus/tls.k
root 47 0.0 0.1 811012 100880 ? Sl 15:00 0:00 _ puma: cluster worker 0: 44 [Portus]
root 51 0.1 0.1 878604 108440 ? Sl 15:00 0:00 _ puma: cluster worker 1: 44 [Portus]

@mssola
Copy link
Collaborator

mssola commented Apr 30, 2018

That's frustrating, I've tried to reproduce it in multiple ways with no luck... Wouldn't you have a configuration that I could use in order to reproduce it ?

@ghost
Copy link

ghost commented Jun 4, 2018

I'm also see this. Fresh deployment of portus to Kubernetes

@rendhalver
Copy link

I am seeing the same errors here.
We have portus 2.3 deployed into Kubernetes with the Chart in this PR. helm/charts#2766.

@linkvt
Copy link

linkvt commented Jun 21, 2018

I found out, that something inside the Portus container expects the DBUS socket to be available at /var/run/dbus/system_bus_socket. When mounting the socket from the host machine from /run/dbus/system_bus_socket the error disappeared - but I didn't notice a difference in the application.

@mssola we run

  • Ubuntu 16.04 machines
  • with Kubernetes 1.10.1
  • setup with rancher/rke

The same error appeared when executing portusctl e.g. with portusctl exec rake portus:info.
After mounting the dbus socket, which is located at /run/dbus/system_dbus_socket (at least on Ubuntu) the error disappeared.
The same error also appeared when running timedatectl (see related issue on StackOverflow).

@flavio
Copy link
Member

flavio commented Jul 10, 2018

I run into the very same issue while trying to run opensuse/portus:2.3 on top of Kubernetes and I was able to find the root cause.

I'll just write down the whole process, it might be useful for @mssola or others ;)

Debugging

First of all I changed the command of the pod to run ["sh", "-c", "sleep 24h"]. By doing that I had plenty of time for debugging.

Then I entered the pod (just use docker exec or kubectl exec) and performed the following steps:

  • zypper in strace
  • export PORTUS_INIT_COMMAND="rails r /srv/Portus/bin/check_db.rb"
  • strace -e trace=open -f -o strace.log /init

Then I looked at the strace.log. The file is big, just search for exited with 1 and look at the lines above it. They will look something similar to:

407   open("/srv/Portus/vendor/bundle/ruby/2.5.0/gems/thor-0.19.4/lib/thor/util.rb", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
407   open("/srv/Portus/vendor/bundle/ruby/2.5.0/gems/thor-0.19.4/lib/thor/util.rb", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
407   open("/srv/Portus/packaging/suse/portusctl/lib/constants.rb", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
407   open("/srv/Portus/packaging/suse/portusctl/lib/constants.rb", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 7
407   open("/proc/1/cgroup", O_RDONLY|O_CLOEXEC) = 7
409   open("/proc/self/status", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 9
409   open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/librt.so.1", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
409   open("/usr/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
409   open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
409   open("/proc/filesystems", O_RDONLY) = 3
409   open("/proc/self/stat", O_RDONLY|O_CLOEXEC) = 3
409   open("/sys/fs/kdbus/0-system/bus", O_RDWR|O_NOCTTY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
409   +++ exited with 1 +++

Basically rails is going through all the gems requires by Portus. These are not something you should be worried about (if the problem were into one of these gems others would have complained about it).

Then it reads something Portus specific: /srv/Portus/packaging/suse/portusctl/lib/constants.rb and then it performs a bunch of other open requests.

Some of them triggered by interest:

407   open("/proc/1/cgroup", O_RDONLY|O_CLOEXEC) = 7
409   open("/proc/self/status", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 9
[...]
409   open("/proc/filesystems", O_RDONLY) = 3
409   open("/proc/self/stat", O_RDONLY|O_CLOEXEC) = 3
409   open("/sys/fs/kdbus/0-system/bus", O_RDWR|O_NOCTTY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

The most significant are:

  • /proc/1/cgroup reading this file is a common way to figure out if you are running into a container.
  • /sys/fs/kdbus/0-system/bus: this entry is not inside of the container and neither on my host because I have kdbus disabled.

Then I looked into the ruby code that was opened right before this series of open(s):

def dockerized?
File.read("/proc/1/cgroup").include?("docker")
end
# This one is set by the bash wrapper we deliver with our RPM
# See packaging/suse/bin/portusctl
BUNDLER_BIN = ENV["BUNDLER_BIN"]
HOSTNAME = (dockerized? || ENV["TRAVIS"] ? `hostname -f` : `hostnamectl --static status`).chomp
PORTUS_ROOT = ENV["PORTUS_ROOT"] ? ENV["PORTUS_ROOT"] : "/srv/Portus"

As you can see the HOSTNAME variable is computed by invoking the dockerized? function and then, upon its failure, using the hostname and hostnamectl commands.

The dockerized? function searches the docker string inside of the contents of /proc/1/cgroup.
This will return false when the container is run by Kubernetes because PID 1 doesn't belong to a docker cgroup. Instead it belongs to the kubepods cgroup (see this document.

As a proof:

bash-4.3# cat /proc/1/cgroup 
12:hugetlb:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
11:cpuset:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
10:perf_event:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
9:pids:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
8:devices:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
7:memory:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
6:rdma:/
5:net_cls,net_prio:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
4:freezer:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
3:cpu,cpuacct:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
2:blkio:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
1:name=systemd:/kubepods/besteffort/pod6a7fa3f0d8a58d80e638403440e5b1fd/85ee832d266322364048e5a6de6cb2a42cdee0d0580c1cbd5fb461fa57536d62
0::/podruntime.slice/containerd.service

Given dockerized? failed the code it will then invoke hostname -f and hostnamectl --static status. The latter one tries to access /sys/fs/kdbus/0-system/bus, this can be confirmed by running:

$: hostnamectl --static status
Failed to create bus connection: No such file or directory
$: echo $?
1

The failure of hostnamectl causes a ruby exception to be raised, which leads to the abortion of the /init.

That's the origin of our bug.

Fix

I'm about to open a PR that fixes the issue by changing how the dockerized? function behaves.

flavio added a commit that referenced this issue Jul 10, 2018
The code determining if Portus is running containerized is broken on
Kubernetes because in this case PID 1 doesn't belong to the `docker`
cgroup slice.

That causes the code to invoke `hostnamectl`, that tries to access
kdbus from within a container. That causes Rails to fail immediately.

Signed-off-by: Flavio Castelli <[email protected]>
@flavio
Copy link
Member

flavio commented Jul 10, 2018

BTW, given he commented on this issue... I just wanted to send ❤️ to @rendhalver for his work on Portus' helm chart!

mssola added a commit that referenced this issue Jul 12, 2018
Improve detection of containerized environments - fix issue #1732
@mssola
Copy link
Collaborator

mssola commented Jul 12, 2018

Fixed by #1879.

@mssola mssola closed this as completed Jul 12, 2018
@rendhalver
Copy link

I think you got the PR reference wrong. #1732 is this issue.

@mssola
Copy link
Collaborator

mssola commented Jul 12, 2018

@rendhalver fixed 😅

@ghostsquad
Copy link

Just curious if this was erroring, what was it trying to do that it could not do because of the error?

ldnunes pushed a commit to OSSystems/Portus that referenced this issue Dec 5, 2018
* upstream/v2.3:
  Bump version
  Remind users to login again after password update
  Restrict deletes into the repository
  Bump version
  update sprockets to fix cve-2018-3760
  Bump version
  oauth/gitlab: Be sure to load all groups
  Allow to use private gitlab server
  Fixed regression on registries not being created
  Make it possible to connect to a database socket
  Allow Puma to bind to unix socket also in production
  user: skip validations when creating portus user
  passwords: don't allow the portus user to reset
  user: do not allow the update of the portus user
  Improve detection of containerized environments - fix issue SUSE#1732
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants