-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface unwind failure reasons in status page #2649
Conversation
Each time we return from the eBPF program without a stack, bump a counter describing the reason for doing so. In user space, every time through the profiling loop, slurp these counters and surface them in the status page. My hope is that this will be a first step towards helping us debug why some processes are not being unwound in some user environments.
pkg/profiler/cpu/bpf/maps/maps.go
Outdated
return nil, err | ||
} | ||
var reasons profiler.UnwindFailedReasons | ||
if err := binary.Read(bytes.NewBuffer(val), m.byteOrder, &reasons.PcNotCovered); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just create one reader and use it for all calls to binary.Read. Also I'd probably use reflection in a loop to set the values instead of hand writing each one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know Go supported reflection, that is definitely way nicer than what I have here (and also the string generation stuff in main.go). Will restructure.
@@ -35,6 +35,45 @@ type StackID struct { | |||
TID PID | |||
} | |||
|
|||
|
|||
// TODO[btv] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think Go should be the source of truth and go generate should be used to generate a header. But others have probably trodden this path, I wonder what cilium/ebpf does.
u32 chunk_not_found; | ||
u32 null_unwind_table; | ||
u32 table_not_found; | ||
u32 rbp_failed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name probably shouldn't be architecture specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep things as close as possible to the log messages / comments in the unwinder. We can clean up all of them at the same time in a separate PR, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loooooooove what this will enable!
@@ -195,6 +196,12 @@ func (p *CPU) ProcessLastErrors() map[int]error { | |||
return p.processLastErrors | |||
} | |||
|
|||
func (p *CPU) FailedReasons() map[int]profiler.UnwindFailedReasons { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just a quick comment that the key is the PID?
cmd/parca-agent/main.go
Outdated
@@ -1032,6 +1036,25 @@ func run(logger log.Logger, reg *prometheus.Registry, flags flags, numCPU int) e | |||
default: | |||
profilingStatus = profilerStatusInactive | |||
} | |||
failedReasons := unwindFailedReasons[prflr.Name()][pid] | |||
failedReasonsStrs := make([]string, 0) | |||
v := reflect.ValueOf(failedReasons) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idiomatic thing in Go here is to list all fields individually, I realize what you're trying to do here, but I think most people reading this stumble over it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, are you suggesting I do it like I had in a previous commit (see here) ? Or are you suggesting something else?
I'm a Go novice, so I'm happy to do whatever you think is the most idiomatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that would be the typical way to write this in Go.
pkg/profiler/cpu/bpf/maps/maps.go
Outdated
v := reflect.ValueOf(reasons) | ||
for i := 0; i < v.Elem().NumField(); i++ { | ||
fv := v.Elem().Field(i) | ||
if err := binary.Read(buf, m.byteOrder, fv.Addr().Interface()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just call a single binary.Read
on the whole struct, no need for reflection here.
Each time we return from the eBPF program without a stack, bump a counter describing the reason for doing so. In user space, every time through the profiling loop, slurp these counters and surface them in the status page.
My hope is that this will be a first step towards helping us debug why some processes are not being unwound in some user environments.
Why?
We are repeatedly hearing from users that some processes are not being profiled in various situations and they don't know why. It's a bit hard to figure this out from just the
/metrics
output, because that is not broken down per process.Test Plan
Tested by looking at the status page locally