Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support user mode stacktraces in events #2175

Merged
merged 4 commits into from
Apr 8, 2024

Conversation

anfedotoff
Copy link
Contributor

@anfedotoff anfedotoff commented Mar 1, 2024

Hi 👋!
I was inspired by #1429 and decided to add the ability for collecting user mode stacktraces. User-mode stacktraces allow you to enrich information about events and understand why this event occurred. For example, this information will help you understand why the seccomp policy is triggered.
Please, have a look, I'm ready to make changes if needed. Thanks!

Support user mode stacktraces in events. To enable this feature, set userStackTrace: true in the policy Post action.
To distinguish different stacktraces, kernel stacktraces are now enabled with kernelStackTrace policy field (renamed from stackTrace), and posted in kernel_stack_trace event field (renamed from stack_trace).

@anfedotoff anfedotoff requested review from a team and mtardy as code owners March 1, 2024 16:25
Copy link

netlify bot commented Mar 1, 2024

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 2de28b1
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/660d9268bb999300085ef424
😎 Deploy Preview https://deploy-preview-2175--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mtardy mtardy added the release-note/minor This PR introduces a minor user-visible change label Mar 1, 2024
@mtardy
Copy link
Member

mtardy commented Mar 1, 2024

Welcome! Thanks for taking the time to add this, I started the CI, and some checks might be worth a look to see what's wrong. I'll try to review your PR quickly!

The only thing that immediately came to my mind is that we would break compatibility if we changed the name of the previous stack trace thing to kernel stack trace, but that might make sense to do so. Let me take a look and we'll see.

@anfedotoff
Copy link
Contributor Author

I'm trying to fix vmtests. But I'm lack of logs. I think, the problem is that there is no gcc on VM image and I couldn't compile test binary. @mtardy, do you have any ideas how to handle this in the right way?

@mtardy
Copy link
Member

mtardy commented Mar 4, 2024

I think, the problem is that there is no gcc on VM image and I couldn't compile test binary. @mtardy, do you have any ideas how to handle this in the right way?

Yes, so usually, the C programs we need for testing are located in contrib/tester-progs, you just need to add them in the Makefile PROGS variables and also add the output binary name in the contrib/tester-progs/.gitignore and they should be there for you when testing. Those are built with make tester-progs from the top level Makefile. Sometimes you can also reuse an existing prog also!

You should find full logs there https://github.com/cilium/tetragon/actions/runs/8132145926?pr=2175 but I don't remember if there's more than what's displayed.

@mtardy
Copy link
Member

mtardy commented Mar 5, 2024

Thanks a lot, tests look good now so it's mostly on me! Again I'll do a proper review soon :)

Copy link
Member

@mtardy mtardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for doing this. I tried it and took a look, there is just a small issue with the fact that the BPF helper might return a lot of errors and I just need to check why and what we should do with that. Also, another thank you for adding proper testing :)!

Most questions are about retro-compatibility and deps addition choices. I think it's a nice opportunity that tonight we have the first Tetragon community call so I would love to bring this to the discussion. If you can join that would be lovely, otherwise I'll write the conclusion of the discussion here anyway. See https://isogo.to/tetragon-meeting-notes.

Again thank you, this does look very cool on dynamically linked binaries, especially when symbols are not stripped, it gives awesome observability!

bpf/process/types/basic.h Show resolved Hide resolved
contrib/tester-progs/go/getcpu/getcpu.go Outdated Show resolved Hide resolved
docs/content/en/docs/concepts/tracing-policy/selectors.md Outdated Show resolved Hide resolved
go.mod Show resolved Hide resolved
pkg/encoder/encoder.go Outdated Show resolved Hide resolved
pkg/option/flags.go Outdated Show resolved Hide resolved
pkg/encoder/encoder.go Outdated Show resolved Hide resolved
pkg/procsyms/procsyms.go Show resolved Hide resolved
@mtardy mtardy added the kind/enhancement This improves or streamlines existing functionality label Mar 11, 2024
@anfedotoff
Copy link
Contributor Author

anfedotoff commented Mar 11, 2024

Thanks a lot for doing this. I tried it and took a look, there is just a small issue with the fact that the BPF helper might return a lot of errors and I just need to check why and what we should do with that. Also, another thank you for adding proper testing :)!

Most questions are about retro-compatibility and deps addition choices. I think it's a nice opportunity that tonight we have the first Tetragon community call so I would love to bring this to the discussion. If you can join that would be lovely, otherwise I'll write the conclusion of the discussion here anyway. See https://isogo.to/tetragon-meeting-notes.

Again thank you, this does look very cool on dynamically linked binaries, especially when symbols are not stripped, it gives awesome observability!

Thank you for review and invitation! I'll try to take a part in the meeting where we can discuss the PR details! I will answer on comments soon).

@mtardy
Copy link
Member

mtardy commented Mar 13, 2024

Could you git checkout main && git fetch && git pull origin main && git checkout user-stacktrace && git rebase main from your branch, run make protogen and then push the rebased version with git push origin user-stacktrace --force-with-lease. Otherwise, with conflicts we can't run the CI.

@anfedotoff anfedotoff force-pushed the user-stacktrace branch 2 times, most recently from 288c439 to 6e0d5e8 Compare March 14, 2024 10:01
@anfedotoff anfedotoff force-pushed the user-stacktrace branch 2 times, most recently from 6061323 to 313c0d4 Compare March 26, 2024 10:42
@anfedotoff
Copy link
Contributor Author

Hi, @mtardy!
Could you tell me, please, is there anything to do here on my side for now?

@mtardy
Copy link
Member

mtardy commented Mar 26, 2024

Hi, @mtardy! Could you tell me, please, is there anything to do here on my side for now?

Hey, sorry last week was KubeCon week in Paris and I was busy there. Not really, I've been chatting about the EFAULT error and there's not much we can do, from time to time, we'll indeed get a page fault when retrieving the trace from the kernel side and we can't avoid it. The best thing we could do is to warn for this in the doc (with some {{< caution >}} for example) on why it can happen and then this is not to be relied upon for very sensitive operations!

I think your PR is ready for me to take a last look and merge, even if some stuff is missing I'll take the time to add them after (like the doc thing I mentioned), I don't want you to wait longer again.

@mtardy
Copy link
Member

mtardy commented Mar 26, 2024

One thing we could do is to keep the --expose-address flag (in addition to the new one) but as deprecated, you can maybe see there's a way to put that thing hidden so that people's config keeps working. Then we can remove it in some minor versions.

@mtardy
Copy link
Member

mtardy commented Mar 26, 2024

Those two comments are mostly nits, if your PR passes the tests, we can consider merging and fixing the flag and the docs after.

@mtardy
Copy link
Member

mtardy commented Mar 26, 2024

the check is broken currently on forks but could you bump the version here

const CustomResourceDefinitionSchemaVersion = "1.1.8"
.

@mtardy
Copy link
Member

mtardy commented Mar 26, 2024

it seems that you might need to generate properly some files with make help can help: https://github.com/cilium/tetragon/actions/runs/8434642759/job/23106133639?pr=2175

@anfedotoff
Copy link
Contributor Author

One thing we could do is to keep the --expose-address flag (in addition to the new one) but as deprecated, you can maybe see there's a way to put that thing hidden so that people's config keeps working. Then we can remove it in some minor versions.

You mean to keep expose-kernel-addresses as alias to expose-stack-addresses? expose-kernel-addresses is the original name used before.

@anfedotoff
Copy link
Contributor Author

Those two comments are mostly nits, if your PR passes the tests, we can consider merging and fixing the flag and the docs after.

Yes, let's do this way!

@anfedotoff
Copy link
Contributor Author

The best thing we could do is to warn for this in the doc (with some {{< caution >}} for example) on why it can happen and then this is not to be relied upon for very sensitive operations!

It's a good point, but I don't understand why EFAULT is happened sometimes. I only have some ideas, but they might be wrong. Anyway some {{< caution >}} message would be helpful for users.

@anfedotoff
Copy link
Contributor Author

I think, it is good to squash all commits in one and write some commit description. I can do when PR will be camera ready or it could be done while merging.

@mtardy
Copy link
Member

mtardy commented Mar 26, 2024

I think, it is good to squash all commits in one and write some commit description. I can do when PR will be camera ready or it could be done while merging.

I think you could squash some things but please keep the separation for, for example:

  • code
  • addition of test
  • documentation
  • make generate etc... + bump of crd.

@mtardy
Copy link
Member

mtardy commented Mar 26, 2024

One thing we could do is to keep the --expose-address flag (in addition to the new one) but as deprecated, you can maybe see there's a way to put that thing hidden so that people's config keeps working. Then we can remove it in some minor versions.

You mean to keep expose-kernel-addresses as alias to expose-stack-addresses? expose-kernel-addresses is the original name used before.

ah indeed. Maybe yes that would be a good idea to not break compatibility.

@anfedotoff anfedotoff requested a review from willfindlay as a code owner March 27, 2024 12:42
@anfedotoff
Copy link
Contributor Author

@mtardy It seems to me, that ARM tests doesn't fit the limit...

yes it reminds me of this #2210, what's weird is that usually if we rerun we should have a successful run.

Do you have any ideas how to debug this? Maybe let's raise the limit and see are they successful at all?

You can try to push a patch by bumping the two timeouts (one in the go tests and one in the gha workflow), as of now there's no way to tell unfortunately if:

  • this is flaky anyway
  • your PR makes this thing fail constantly
  • your PR made it worse

I can try to run locally as I have an arm64 machine with your PR and see if this appears or not!

I increased test duration for 2 times (80m), it doesn't help... If you have time to debug this problem on ARM machine, it will be great. I have no ARM machine, unfortunaty, but I'll try to find an ARM VM somewhere).

@anfedotoff
Copy link
Contributor Author

anfedotoff commented Mar 28, 2024

I've tested the test that I've added on ARM VM it works:

sudo go test -run TestKprobeUser
time="2024-03-28T16:13:29Z" level=info msg="BTF discovery: default kernel btf file found" btf-file=/sys/kernel/btf/vmlinux
time="2024-03-28T16:13:34Z" level=info msg="Exit probe on acct_process"
PASS
time="2024-03-28T16:13:53Z" level=info msg="map dir `/sys/fs/bpf/testSensorTracing` still exists after test. Removing it."
ok      github.com/cilium/tetragon/pkg/sensors/tracing  25.110s

I use qemu without kvm. Maybe the problem is because I added one more test?

I ran make test it fails on:

logcapture.go:25: time="2024-03-28T21:52:10Z" level=info msg="Loaded BPF map
s and events for sensor successfully" sensor=gtp-sensor-153
    selectors_char_buf_test.go:141: loading sensors took: 4.368µs
    selectors_char_buf_test.go:166: 
                Error Trace:    /home/user/tetragon/pkg/sensors/tracing/selector
s_char_buf_test.go:166
                Error:          Not equal: 
                                expected: 1
                                actual  : 0
                Test:           TestCharBufTracepoint
                Messages:       expected events with 'pizzaisthebest'

But it doesn't hangs... So I think, the hang problem is more related to ARM CI, not to ARM tests themself.

Also, I tested on Arm VM ./pkg/sensors/tracing only and it works fine:

go test -exec "sudo" -p 1 -parallel 1  -gcflags= -timeout 600m -failfast -cover ./pkg/sensors/tracing... 
ok      github.com/cilium/tetragon/pkg/sensors/tracing  3113.221s       coverage: 68.1% of statements
...skipping...
make -C "contrib/tester-progs"
make[1]: Entering directory '/home/user/tetragon/contrib/tester-progs'
go build -o lseek-pipe ./go/lseek-pipe
go build -o getcpu ./go/getcpu
go build -o user-stacktrace ./go/user-stacktrace
make[1]: Leaving directory '/home/user/tetragon/contrib/tester-progs'
docker rm tetragon-clang || true
docker run -v /home/user/tetragon:/tetragon:Z -u $(id -u) -e BPF_TARGET_ARCH=armake -C /tetragon/bpf -j4 1a80ffef8120b2ac99e802bbd31dee10f5f15a48566832ae0866f m
make: Entering directory '/tetragon/bpf'
make: Nothing to be done for 'all'.
make: Leaving directory '/tetragon/bpf'
docker rm tetragon-clang
tetragon-clang
go test -exec "sudo" -p 1 -parallel 1  -gcflags= -timeout 600m -failfast -cover ./pkg/sensors/tracing... 
ok      github.com/cilium/tetragon/pkg/sensors/tracing  3113.221s       coverage: 68.1% of statements

@anfedotoff
Copy link
Contributor Author

@mtardy, maybe let's test only ./pkg/sensors/tracing... for debug, what do you think?

@anfedotoff anfedotoff force-pushed the user-stacktrace branch 3 times, most recently from 2004ecb to 64aa981 Compare April 1, 2024 15:43
@mtardy mtardy marked this pull request as draft April 2, 2024 09:50
@mtardy
Copy link
Member

mtardy commented Apr 2, 2024

I converted to draft since it seems your added test is the issue on arm64, it seemed you could not reproduce this on arm64 machine, I'll try as well on my side and see if I can get any idea.

It was failing on actuated machines:

Linux c99d86049feb9df7b83f0f00632bb69a9d2e8edf 5.10.201 #1 SMP Thu Feb 15 11:00:52 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

We can ssh into them if really needed if we don't find any way to reproduce this since it seems to reliably fail there.

@anfedotoff
Copy link
Contributor Author

I converted to draft since it seems your added test is the issue on arm64, it seemed you could not reproduce this on arm64 machine, I'll try as well on my side and see if I can get any idea.

It was failing on actuated machines:

Linux c99d86049feb9df7b83f0f00632bb69a9d2e8edf 5.10.201 #1 SMP Thu Feb 15 11:00:52 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

We can ssh into them if really needed if we don't find any way to reproduce this since it seems to reliably fail there.

My ARM machine:

Linux aarch64-test 5.15.0-101-generic #111-Ubuntu SMP Wed Mar 6 18:01:01 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

@anfedotoff anfedotoff force-pushed the user-stacktrace branch 2 times, most recently from 3ae6621 to 68857a5 Compare April 2, 2024 13:29
@anfedotoff anfedotoff force-pushed the user-stacktrace branch 3 times, most recently from 314852f to d48f6e8 Compare April 3, 2024 13:34
@anfedotoff anfedotoff marked this pull request as ready for review April 3, 2024 17:32
return fmt.Sprintf("%s (%s+0x%x)", fsym.Name, fsym.Module, fsym.Offset)
}

// GetFnSymbol -- returns the FnSym for a given address and PID
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be documented that this function is limited to native code only (C, C++, Go and Rust). The given approach will not work for interpreted languages (Java, Python, Perl, etc), as it can not resolve the internal state of the interpreter.

Copy link
Member

@mtardy mtardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your perseverance! That's a cool addition to the stack trace feature!

@mtardy mtardy merged commit de827f6 into cilium:main Apr 8, 2024
40 of 41 checks passed
@lambdanis lambdanis removed the release-note/minor This PR introduces a minor user-visible change label Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement This improves or streamlines existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants