Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd seccomp filters cause systemd-timesyncd core dumps on Kernel 6.11 when an app uses eBPF modules #1205

Closed
XZVB12 opened this issue Oct 4, 2024 · 20 comments

Comments

@XZVB12
Copy link

XZVB12 commented Oct 4, 2024

Include the following information:

  • OpenSnitch version: 1.6.6
  • OS: Manjaro linux
  • Version: 24.1
  • Window Manager: KDE / Kwin
  • Kernel version: 6.11.0 - 6.11.1

To Reproduce
Describe in detail as much as you can what happened.

Steps to reproduce the behavior:

  1. build eBPF modules,for example http://aur.archlinux.org/packages/opensnitch-ebpf-module
  2. reboot
  3. systemctl status systemd-timesyncd
  4. See error

This issue appears on 6.11 kernel and does not affect older version.
A more detailed discussion: systemd/systemd#34615

Removing eBPF modules solves the problem.

@gustavo-iniguez-goya
Copy link
Collaborator

Hi @XZVB12 ,

I haven't analyzed nor try to reproduce this issue, but per similar cases we've had in the past, this is usually a bug in the other program (systemd-timesyncd in this case). We used to crash Transmission for example, or even the kernel, because there were some conditions not handled correctly (delayed connections, connections unexpectedly blocked/killed, ...)

I'll try to reproduce it, but before that some questions:

  • is this a "regular" kernel or a tuned kernel (liquorix, etc)?
  • do you have a rule to allow systemd-timesyncd?
  • could you change the LogLevel to DEBUG, and make systemd-timesyncd crash again? then copy /var/log/opensnitchd.log to /tmp , get the full-time of the coredump (ls -l --full-time /var/lib/coredump/...) and try to correlate the date of the core with the opensnitch logs, to see if we can obtain some hints of what was doing the daemon.

On the other hand, @Flat commented on systemd/systemd#34615 (comment):

I also notice it was giving me an issue with python running with systemd's SyscallFilters enabled and that is how I figured out what was causing it due to strace showing me uretprobe() = -1 ENOSYS (Function not implemented)

Could you post the output of ~ $ sudo opensnitchd -check-requirements? in order to see if the uprobes have been removed from the kernel 6.11 configuration.

Thank you!

@XZVB12
Copy link
Author

XZVB12 commented Oct 8, 2024

  • is this a "regular" kernel or a tuned kernel (liquorix, etc)?

Hi.Yes, this is "regular" Manjaro Linux kernel

  • do you have a rule to allow systemd-timesyncd?

Yes.
изображение

  • could you change the LogLevel to DEBUG, and make systemd-timesyncd crash again? then copy /var/log/opensnitchd.log to /tmp , get the full-time of the coredump (ls -l --full-time /var/lib/coredump/...) and try to correlate the date of the core with the opensnitch logs, to see if we can obtain some hints of what was doing the daemon.

I’ll try it later

Could you post the output of ~ $ sudo opensnitchd -check-requirements? in order to see if the uprobes have been removed from the kernel 6.11 configuration.

Checking system requirements for kernel version 6.11.2-2-MANJARO
------------------------------------------------------------------------------

Checking => CONFIG_KPROBES=y
Checking => CONFIG_KPROBES_ON_FTRACE=y
Checking => CONFIG_HAVE_KPROBES=y
Checking => CONFIG_HAVE_KPROBES_ON_FTRACE=y
Checking => CONFIG_KPROBE_EVENTS=y

* kprobes        ✔

Checking => CONFIG_UPROBES=y
Checking => CONFIG_UPROBE_EVENTS=y

* uprobes        ✔

Checking => CONFIG_FTRACE=y

* ftrace         ✔

Checking => CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
Checking => CONFIG_FTRACE_SYSCALLS=y

* syscalls       ✔

Checking => CONFIG_NETFILTER_NETLINK_QUEUE=[my]
Checking => CONFIG_NFT_QUEUE=[my]
Checking => CONFIG_NETFILTER_XT_TARGET_NFQUEUE=[my]

* nfqueue        ✔

Checking => CONFIG_NETFILTER_NETLINK=[my]
Checking => CONFIG_NETFILTER_NETLINK_QUEUE=[my]
Checking => CONFIG_NETFILTER_NETLINK_ACCT=[my]

* netlink        ✔

Checking => CONFIG_INET_DIAG=[my]
Checking => CONFIG_INET_TCP_DIAG=[my]
Checking => CONFIG_INET_UDP_DIAG=[my]
Checking => CONFIG_INET_DIAG_DESTROY=[my]

* net diagnostics        ✔

@XZVB12
Copy link
Author

XZVB12 commented Oct 8, 2024

  • could you change the LogLevel to DEBUG, and make systemd-timesyncd crash again?

According to systemd-coredump event occurred at 13:25:27

sudo coredumpctl info 3603
PID: 3603 (systemd-timesyn)
UID: 977 (systemd-timesync)
GID: 977 (systemd-timesync)
Signal: 11 (SEGV)
Timestamp: Tue 2024-10-08 13:25:27

For this time in opensnitch log there are the following entries:


�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF exit event] -> 3587
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF exit event inCache] -> 3587
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF exit event] -> 3590
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF exit event inCache] -> 3590
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF exec event] ppid: 1, pid: 3603, /usr/bin/opensnitchd -> [/usr/lib/systemd/systemd-executor --deserialize 50 --log-level info --log-target journal-or-kmsg]
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF event inCache] -> 3603
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m new connection udp => 41641:192.168.0.2 -> 1.1.1.1 (0.arch.pool.ntp.org):53 uid: 977, mark: 0
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [ebpf conn] not in cache, but in execEvents: udp41641192.168.0.21.1.1.153, 3603 -> /usr/lib/systemd/systemd-timesyncd
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [ebpf conn] adding item to cache: udp41641192.168.0.21.1.1.153
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m �[1m�[32m��[0m�[0m �[1m/usr/lib/systemd/systemd-timesyncd�[0m -> 41641:�[1m192.168.0.2�[0m => �[1m0.arch.pool.ntp.org (1.1.1.1)�[0m:53, mark: 0 (�[32mallow-always-simple-usr-lib-systemd-systemd-timesyncd�[0m)
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m new connection udp => 41641:192.168.0.2 -> 1.1.1.1 (0.arch.pool.ntp.org):53 uid: 977, mark: 0
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m error deleting ebpf entry: unable to delete element: no such file or directory
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [ebpf conn] in cache: udp41641192.168.0.21.1.1.153, 3603 -> /usr/lib/systemd/systemd-timesyncd
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m �[1m�[32m��[0m�[0m �[1m/usr/lib/systemd/systemd-timesyncd�[0m -> 41641:�[1m192.168.0.2�[0m => �[1m0.arch.pool.ntp.org (1.1.1.1)�[0m:53, mark: 0 (�[32mallow-always-simple-usr-lib-systemd-systemd-timesyncd�[0m)
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m New DNS record: 89.221.207.113 -> 0.arch.pool.ntp.org
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m New DNS record: 82.142.168.18 -> 0.arch.pool.ntp.org
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m New DNS record: 91.206.16.3 -> 0.arch.pool.ntp.org
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m New DNS record: 162.159.200.123 -> 0.arch.pool.ntp.org
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF exec event] ppid: 1, pid: 3606, /usr/bin/opensnitchd -> [/usr/lib/systemd/systemd-executor --deserialize 113 --log-level info --log-target journal-or-kmsg]
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF event inCache] -> 3606
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF exit event] -> 3603
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF exit event inCache] -> 3603
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF exit event] -> 3606
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m [eBPF exit event inCache] -> 3606
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m client.disconnect()
�[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m UI auth: simple

The error I see here: �[2m[2024-10-08 13:25:27]�[0m �[2m�[30m�[100m DBG �[0m error deleting ebpf entry: unable to delete element: no such file or directory

I attached a log file with information where systemd-timesync starts, so you should easily find . Maybe it will help more
opensnitchd cut.log

@gustavo-iniguez-goya
Copy link
Collaborator

thank you @XZVB12 !

I've managed to reproduce it on manjaro with systemd-timesyncd 256.6 and kernel 6.11.0.6-MANJARO.

The problematic ebpf module is opensnitch-dns.o , if you rename it, systemd-timesyncd will stop crashing:
~ $ sudo mv /usr/lib/opensnitchd/ebpf/opensnitch-dns.o /usr/lib/opensnitchd/ebpf/opensnitch-dns.o.bak

I'll try to understand why systemd-timesyncd is crashing and if we can do anything about it.

@gustavo-iniguez-goya
Copy link
Collaborator

we hook 2 libc functions: gethostbyname and getaddrinfo.

systemd-timesyncd crashes when a ebpf hook is attached to uretprobe getaddrinfo (just hooking it, even without reading any variable or register). The hooks for gethostbyname or uprobe getaddrinfo work just fine.

Here's a bpftrace script that causes systemd-timesyncd to crash, without opensnitch running:

uprobe-getaddrinfo.bt.txt

(rename it to .bt, and execute it with bpftrace uprobe-getaddrinfo.bt)

so in my opinion, even if we're indirectly causing it, the bug is not in our code. This segfault will occur with any software that hooks libc's getaddrinfo uretprobe.

@Flat
Copy link

Flat commented Oct 8, 2024

I'd just like to add that I've also had the same issue happen with python using the archlinux reflector script, https://wiki.archlinux.org/title/Reflector so it does not seem to be solely systemd-timesyncd

@XZVB12
Copy link
Author

XZVB12 commented Oct 8, 2024

so in my opinion, even if we're indirectly causing it, the bug is not in our code. This segfault will occur with any software that hooks libc's getaddrinfo uretprobe.

Any idea what might be the reason? At this point, the kernel is 6.11.2 , and the problem still persists and apparently not noticed. Although I initially noticed this bug in the -RC versions of the kernel, I just didn’t report it before the final release

@gustavo-iniguez-goya
Copy link
Collaborator

no idea @XZVB12 , a change in the 6.11.x kernels maybe.

I've just tested it on kernel 6.10.11-2-MANJARO and systemd-timesyncd doesn't crash, so maybe some of the changes of uretprobes might be the reason of the crashes:

91bd008d4e2b4962ecb9a10e40c2fb666b0aeb92
Merge tag 'probes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

https://lore.kernel.org/lkml/[email protected]/T/

https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.11

@Flat
Copy link

Flat commented Oct 8, 2024

Sounds like it could be related to this

  • Add uretprobe syscall which speeds up the uretprobe 10-30% faster. This
    syscall is automatically used from user-space trampolines which are
    generated by the uretprobe. If this syscall is used by normal
    user program, it will cause SIGILL. Note that this is currently only
    implemented on x86_64.

@XZVB12
Copy link
Author

XZVB12 commented Oct 8, 2024

Sounds like it could be related to this

So it’s not a bug now, but a new way of "speeds up" ?

@gustavo-iniguez-goya
Copy link
Collaborator

gustavo-iniguez-goya commented Oct 8, 2024

it could be a bug in the 6.11 kernel (or systemd maybe).

let's summarize what's happening:

Workaround 1: rename opensnitchd-dns.o and restart opensnitch

~ $ sudo mv /usr/lib/opensnitchd/ebpf/opensnitch-dns.o /usr/lib/opensnitchd/ebpf/opensnitch-dns.o.bak
~ $ sudo systemctl restart opensnitchd

Workaround 2: #1205 (comment)

I am pretty sure it is due to the SystemCallFilter option set in the .service file used to launch via systemd. If you remove that line it will not crash.

@gustavo-iniguez-goya
Copy link
Collaborator

@XZVB12 , even if this is a kernel bug, the process which is crashing is systemd-timesyncd, so there seems to be a condition that they're not controlling, when systemd-timesyncd is launched by systemd.

So maybe they should take a look at this problem. They can use the steps above to reproduce it.

@Flat
Copy link

Flat commented Oct 9, 2024

@XZVB12 , even if this is a kernel bug, the process which is crashing is systemd-timesyncd, so there seems to be a condition that they're not controlling, when systemd-timesyncd is launched by systemd.

So maybe they should take a look at this problem. They can use the steps above to reproduce it.

I am pretty sure it is due to the SystemCallFilter option set in the .service file used to launch via systemd. If you remove that line it will not crash. The odd part is it should not be a SIGSEV it should be EPERM or EACCES. Adding uretprobe syscall to the filters does not help either.

@zquestz
Copy link

zquestz commented Oct 9, 2024

Just a heads up, this is crashing stubby on Arch as well.

@zquestz
Copy link

zquestz commented Oct 9, 2024

Just did a test, and removing the SystemCallFilter fixes both the issues on my machine. Oddly enough, you can also get them running again by reinstalling the glibc package... For some reason the SystemCallFilter doesn't trip if you reinstall glibc then run the systemd services...

@XZVB12
Copy link
Author

XZVB12 commented Oct 10, 2024

Just a heads up, this is crashing stubby on Arch as well.

It seems that any distribution with the 6.11 kernel will be affected by this problem

So maybe they should take a look at this problem. They can use the steps above to reproduce it.

I reopened the discussion and gave links to new information . Perhaps you could also join the discussion, maybe it would be easier to communicate directly.

@gustavo-iniguez-goya gustavo-iniguez-goya changed the title opensnitch eBPF modules causes systemd-timesyncd core dump on Kernel 6.11 systemd seccomp filters cause systemd-timesyncd core dumps on Kernel 6.11 when an app uses eBPF modules Oct 10, 2024
@gustavo-iniguez-goya
Copy link
Collaborator

Thank you @XZVB12 and @Flat for handling and analyzing this issue.

Hopefully it'll be resolved in the next systemd release.

@UnconnectedBedna
Copy link

UnconnectedBedna commented Nov 15, 2024

This is back again, but this time the boot freezes for 60s when trying to sync, and then systemd-timesyncd fails with core dump.

Same thing happens on 6.6 LTS, (freezes for one minute) BUT the service does NOT fail, only gets stuck for 60s.

After boot:

sudo /usr/lib/systemd/systemd-timesyncd

Contacted time server 192.121.108.99:123 (0.arch.pool.ntp.org).
Initial clock synchronization to Fri 2024-11-15 23:06:08.458151 CET.

Works just fine, so this is something at boot time.
I do have opensnitch-ebpf-module installed.

If you want me to make a new post I can do that, but since this seems to be some kind of regression, I started by posting here.

@Flat
Copy link

Flat commented Nov 15, 2024

The original issue still hasn't been resolved. libseccomp will need a new release with the syscalls added. seccomp/libseccomp#439 once this has been merged and libseccomp updated it should resolve this issue.

@UnconnectedBedna
Copy link

UnconnectedBedna commented Nov 15, 2024

Ah, that is what is happening.
Strange that it started working on my system, and then regressed again.

I DID try to check seccomp first before posting here with the links referenced (seccomp/libseccomp#438 > seccomp/libseccomp#435) and they looked merged, but clearly there is more judging by the link you provided. Ie, my "research" was sloppy.

Thank you for the clarification!
I'll just hold on for the devs to "catch up", I can wait one minute for the boot to complete.. xD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants