Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pprofui: add support for collecting goroutines with labels #105916

Merged
merged 1 commit into from
Jul 7, 2023

Conversation

adityamaru
Copy link
Contributor

This change refactors some of the logic in the pprofui/. Additionally, we add support for collecting cluster-wide goroutine profiles.

These goroutine profiles can be collected with or without labels. When collected without labels we request pprof profiles with debug=0, which generates Profile protobufs on every node and pprof.Merges them before redirecting the client to a flamegraph.

When collected with labels we request pprof profiles with debug=1, which generates a legacy text format with comments translating addresses to function names and line numbers, so that a programmer can read the profile without tools. These profiles cannot be pprof.Merged and so we manually stitch them together on a per-node basis before downloading them for the client as txt file.

This change also adds support for filtering the aforementioned goroutines by pprof label. This will be used by the Jobs Profiler to collect cluster-wide stacks relevant to the running job.

Informs: #105440
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru adityamaru force-pushed the collect-job-stacks branch 2 times, most recently from d3b6b78 to c618d44 Compare July 1, 2023 00:01
@adityamaru
Copy link
Contributor Author

Tested on backup/2TB/n10cpu4 and collected a cluster-wide stack:

http://adityamaru-1688173202-01-n10cpu4-0001.roachprod.crdb.io:26258/debug/pprof/ui/goroutine/?node=all&labelfilter=IMPORT

goroutine_6.txt

@adityamaru adityamaru requested review from dt and tbg July 1, 2023 01:25
@adityamaru adityamaru marked this pull request as ready for review July 1, 2023 01:25
@adityamaru adityamaru requested review from a team as code owners July 1, 2023 01:25
@adityamaru adityamaru requested a review from a team July 1, 2023 01:25
@adityamaru adityamaru requested a review from a team as a code owner July 1, 2023 01:25
@adityamaru adityamaru requested review from Santamaura and removed request for a team July 1, 2023 01:25
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I went a little nitty with the review just because I'm noticing that some practices we have "dialed in" on the replication team could be useful here too and I wanted to proselytise a little. Concretely, this could've been split up into a few commits where the logic changes are isolated and much easier to review. (As is, the review is a tad shallow, though I did make an effort at "really" reviewing the code). Please don't change it now, that's just busy work. I see this as an investment in the future - small commit style can be annoying at first, but in my opinion it's really worth it in that reviewers can much more easily engage with changes at a deeper level.


This is a great improvement, thanks for putting in the work! The main reason I'm requesting changes is that I feel strongly that non-fatal errors (missing nodes, etc) should be propagated back with the JSONResponse. I often look at clusters where we have partial data only (recently had one where it took >20 seconds just to render a 404...), and not knowing when you do can be difficult to deal with.

Other than that, mostly nits. If you haven't already, please give all of the profiles a smoke test. With changes like these, I can never quite be convinced that they work from code review alone, no matter how hard I look.

pkg/cli/zip_per_node.go Outdated Show resolved Hide resolved
pkg/server/debug/pprofui/server.go Outdated Show resolved Hide resolved
pkg/server/debug/pprofui/server.go Outdated Show resolved Hide resolved
pkg/server/debug/pprofui/server.go Outdated Show resolved Hide resolved
pkg/server/debug/pprofui/server.go Outdated Show resolved Hide resolved
pkg/server/status_local_file_retrieval.go Outdated Show resolved Hide resolved
pkg/server/status_local_file_retrieval.go Outdated Show resolved Hide resolved
var debug int
buf := bytes.NewBuffer(nil)
if req.WithLabels {
debug = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why the local debug and not just req.WithLabels everywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow this comment. The debug int is used in the WriteTo method call below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was to s/debug/req.WithLabels/g. There's just one use of the debug local. Easier to read, for me at least, if the indirection is avoided. This is minor, so your call!

pkg/server/status_local_file_retrieval.go Outdated Show resolved Hide resolved
pkg/server/status_local_file_retrieval.go Outdated Show resolved Hide resolved
@adityamaru adityamaru force-pushed the collect-job-stacks branch from c618d44 to b25619f Compare July 5, 2023 02:58
@adityamaru adityamaru requested a review from tbg July 5, 2023 03:17
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 looking good, thanks again! Looking at the label parsing again, it seems perhaps a little brittle, would appreciate if you could look that over once more and maybe add a more adversarial test using funky label values.

CI complains about ./dev gen.

var debug int
buf := bytes.NewBuffer(nil)
if req.WithLabels {
debug = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was to s/debug/req.WithLabels/g. There's just one use of the debug local. Easier to read, for me at least, if the indirection is avoided. This is minor, so your call!

pkg/server/debug/pprofui/server.go Show resolved Hide resolved
pkg/server/debug/pprofui/server.go Show resolved Hide resolved
@adityamaru adityamaru force-pushed the collect-job-stacks branch from b25619f to fd32a93 Compare July 6, 2023 13:48
This change refactors some of the logic in the pprofui/.
Additionally, we add support for collecting cluster-wide
goroutine profiles.

These goroutine profiles can be collected with or without labels.
When collected without labels we request pprof profiles
with debug=0, which generates Profile protobufs on every
node and `pprof.Merge`s them before redirecting the client
to a flamegraph.

When collected with labels we request pprof profiles with
debug=1, which generates a legacy text format with comments
translating addresses to function names and line numbers,
so that a programmer can read the profile without tools.
These profiles cannot be `pprof.Merged` and so we manually
stitch them together on a per-node basis before downloading
them for the client as txt file.

This change also adds support for filtering the aforementioned
goroutines by pprof label. This will be used by the Jobs Profiler
to collect cluster-wide stacks relevant to the running job.

Informs: cockroachdb#105440
Release note: None
@adityamaru adityamaru force-pushed the collect-job-stacks branch from fd32a93 to b38951a Compare July 6, 2023 14:45
@adityamaru
Copy link
Contributor Author

TFTR!

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Jul 7, 2023

Build succeeded:

@craig craig bot merged commit 5aeb360 into cockroachdb:master Jul 7, 2023
@adityamaru adityamaru deleted the collect-job-stacks branch July 10, 2023 12:43
adityamaru added a commit to adityamaru/cockroach that referenced this pull request Jul 12, 2023
This change collect cluster-wide goroutines that have
a pprof label tying it to the particular job's execution,
whose job execution details have been requested. This relies
on the support added to the pprofui server to collect cluster-wide,
labelled goroutines in cockroachdb#105916.

Informs: cockroachdb#105076
Release note: None
craig bot pushed a commit that referenced this pull request Jul 12, 2023
106654: sql,server: support collecting labelled goroutines in the job profiler r=dt a=adityamaru

This change collect cluster-wide goroutines that have a pprof label tying it to the particular job's execution, whose job execution details have been requested. This relies on the support added to the pprofui server to collect cluster-wide, labelled goroutines in #105916.

Informs: #105076
Release note: None

Co-authored-by: adityamaru <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants