obs: integrate on-demand CPU profiling / Go execution tracing with CRDB tracing #97215

tbg · 2023-02-15T21:32:43Z

Is your feature request related to a problem? Please describe.

Traces can answer some questions today, but not others. In particular, unexpected delays between trace events like Go scheduler latency, lock contention, high CPU usage in a subsystem such as pebble, etc, lead to time-consuming and often fruitless investigations. A "big hammer" that would allow us to get to the bottom of most of them would be invaluable.

Describe the solution you'd like

A big hammer whose centerpiece are automatic CPU and/or Go execution profiles (we suggest implementing both since they're complementary and behave the same from an API point of view).

Only a single CPU profile/execution trace (henceforth called profile interchangeably for simplicitly) can be active per node at any given point in time. We add a service on each node that multiplexes profile requests onto slots of duration <=1s long, that is, profiles are taken in 1s or shorter chunks and multiple (local in-process) listeners can subscribe to profiles for however long they wish, which means they'll get notified of any 1s chunks overlapping their tenure. No profiles are taken in the absence of listeners. Profiles get stored on node-local storage and have an identifier with which they can be retrieved at a later time. (There are retention policies, tenant access rules, and everything you need to make this work for UA, etc, details TBD).

We can use this to enable the following use cases:

stmt diagnostics requests get a new option to request profiling. When requested, any local root trace span (i.e. while any part of the trace is active on a given node) subscribes to profiles, and references to the profiles collected are stored as payloads in the span. They're then included in the stmt bundle.¹
even outside of diagnostics, could mark traces as wanting to capture debug info for "slow spans". Such spans on creation could set a timer that, once it fires, subscribes to (short) execution traces periodically as a way to snapshot the goroutine's actions. These could be referenced in the span for later retrieval, dumped into logs (similar to heap_profiler), whatever feels right. (cc @dt, @adityamaru)

For a practical example where we'd loved to have CPU profiles with traces, see https://github.com/cockroachlabs/support/issues/2033.
For Go exec traces, they would be invaluable for the NUMA issues we've seen, and many latency investigations that come with unexplainable gaps in the trace in non-CPU intensive parts. There are likely sched latencies at play and an execution trace would prove it. But we only have pprof-loop which is hard to use and isn't included in the initial outreach, so we spend lots of time asking for (and often not getting) this additional info. Customers spend lots of time putting it together too.

Describe alternatives you've considered

Can't really think of much

Additional context

I talked about this on Slack too. @sumeerbhola voiced essentially the same idea earlier, here. Variations on the design are plenty, we can pick whichever gets the job done.
Go execution traces have got a lot of momentum upstream at the moment, so their overhead is likely going down dramatically in the near future, and is already - supposedly - relatively low. There's a working group and active exploration so it's far from just talk.
rpc,sql: propagate pprof labels for SQL queries to KV #86012 ties in well with this proposal.
cli: build in pprof-loop.sh for CPU profiles and Go execution traces #97174 is a stepping stone

Jira issue: CRDB-24848

Epic CRDB-32402

note that if the diagnostics request has a "min latency" bound, this might in effect turn on profiling continuously on some or all nodes in the cluster, so we need to validate that this is OK, perhaps using suitable RBAC to gate it only to admins, or only allow CPU profiles with a lower profiling rate, etc ↩

adityamaru · 2023-02-26T13:47:19Z

I've been thinking some more about your proposal and I'm bullish that we have almost all the parts to put something together. If we emitted the runtime traces as StructuredEvents for each of the spans subscribed to the profiler, we could have a client-side aggregator (example: #84043) bucket these by node, and periodically flush them to disk. For queries, this could be somewhere on local storage, while jobs can insert a row into the new system.job_info table. These rows can then be retrieved+processed by whatever tool we decide to build for jobs observability. This is in line with how we plan to aggregate trace-driven statistics for jobs (example: #96968). I'll try to get a prototype for this going so that we can discover the pitfalls.

tbg · 2023-02-27T10:57:07Z

If we emitted the runtime traces as StructuredEvents

If we can bound the size of profiles acquired by a local root span (i.e. transitively with its children) then that seems like a great first step. It would be preferable to not buffer these in memory/send them around via gRPC trace responses at all, but this is a limitation we should be able to live with at first as long as we avoid unbounded memory consumption (I think a few MB of budget would be fine, dropping profiles after that and putting information about that fact in the span. Not needed for POC obvs).

I'll try to get a prototype for this going so that we can discover the pitfalls.

❤️

adityamaru · 2023-02-28T19:26:45Z

It would be preferable to not buffer these in memory/send them around via gRPC trace responses at all, but this is a limitation we should be able to live with at first as long as we avoid unbounded memory consumption

A second long runtime trace of a single node testserver is ~200KiB. That already blows past the size we allow a structured event to grow to. So emitting these traces as StructuredEvent payloads on the spans that subscribe to such runtime traces might be a non-starter.

This leaves us with the model of flushing the traces to disk in a directory of our choosing and then sticking the directory into the span to make its way back to the node that is coordinating the query/job. I'm not sure if the job profile or statement bundle machinery should then reach out to all these directories and try stitching the traces as a whole, or just leave it up to the user to open a trace visualizer for each outputted file.

adityamaru · 2023-03-07T15:10:48Z

I'm going to tentatively assign this to myself and put it on the DR board for planning purposes. I may or may not get this all the way to completion but I'm keen to have something we can use to collect query and job runtime profiles + traces.

blathers-crl · 2023-03-07T15:11:02Z

cc @cockroachdb/disaster-recovery

This change introduces a BackgroundProfiler service that is started during server startup on each node in the cluster. The BackgroundProfiler is responsible for collecting on-demand CPU profiles and runtime traces for a particular operation. The profiler can be subscribed to by an in-process listener. The first Subscriber initializes the collection of the CPU and execution trace profiles. While the profiles are being collected, only Subscribers carrying the same `profileID` are allowed to join the running profiles. The profiles are stopped and persisted to local storage when the last Subscriber unsubscribes. The `profileID` is a unique identifier of the operation that is being traced. Since only one profile can be running in a process at a time, any Subscribers with different `profileID`s than the current one will be rejected. The in-process listeners described above will be CRDBs internal tracing spans. This change introduces a `WithBackgroudProfiling` option that can be used to instruct a tracing span to subscribe to the BackgroundProfiler. This option is propogated to all local and remote child spans created as part of the trace. Only local, root spans that have background profiling enabled will Subscribe to the profiler on creation. As mentioned above only one operation can be profiled at a time. We use the first root span's `TraceID` as the BackgroundProfiler's `profileID`. All subsequent root span's that are part of the same trace will be able to join the running profile. Tracing span's unsubscribe from the profile on Finish(). Every Susbcriber is returned a wrapped ctx with pprof labels that tie its execution to the profile being collected by the BackgroundProfiler. These labels are used to post-process the collected CPU profile and filter out samples that only correspond to our subscribers. The end result is filtered CPU profile prefixed `cpuprofiler.` and a process wide execution trace `runtimetrace.` persisted to local storage. This change only introduces the infrastructure to enable on-demand profiling. The test in `profiler_test.go` results in a CPU profile with information about each labelled root operation collected on-demand: ❯ go tool pprof cpuprofiler.2023-03-08T14_51_52.402 Type: cpu Time: Mar 8, 2023 at 9:51am (EST) Duration: 10.11s, Total samples = 8.57s (84.77%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) tags 9171346462634118014: Total 8.6s 906.0ms (10.57%): op2 902.0ms (10.53%): op1 890.0ms (10.39%): op0 886.0ms (10.34%): op7 866.0ms (10.11%): op4 866.0ms (10.11%): op5 854.0ms ( 9.96%): op3 806.0ms ( 9.40%): op8 804.0ms ( 9.38%): op6 790.0ms ( 9.22%): op9 Execution traces do not surface pprof labels in golang yet but a future patch could consider cherry-picking https://go-review.googlesource.com/c/go/+/446975. This allows the user to focus on goroutines run with the specified pprof labels: With this framework in place one could envision the following use cases: - stmt diagnostics requests get a new option to request profiling. When requested, any local root trace span (i.e. while any part of the trace is active on a given node) subscribes to profiles, and references to the profiles collected are stored as payloads in the span. They're then included in the stmt bundle. - even outside of diagnostics, could mark traces as wanting to capture debug info for "slow spans". Such spans on creation could set a timer that, once it fires, subscribes to (short) execution traces periodically as a way to snapshot the goroutine's actions. These could be referenced in the span for later retrieval. Informs: cockroachdb#97215

tbg · 2023-06-05T12:37:30Z

Some more thoughts from a recent escalation:

we could also more tightly integrate statement bundle requests with execution traces:

user can gets two addition checkboxes in stmt bundle request modal (I think both don't work at the same time, but exec trace contains a CPU profile afaik, so we can offer the checkboxes separately)
- ☒ collect cluster-wide execution trace
- ☒ collect cluster-wide CPU profile
- appropriate warnings etc about perf impact.
we enable X (where X=cpu profiling or execution tracing) whenever a span for the trace is active on a node, we take profile/trace for the duration of the span.
the stmt bundle contains all of these spans, named with time bounds and node of origin.
ideally we're also using execution tracer "regions"¹ to highlight the goroutines belonging to the spans.

https://pkg.go.dev/runtime/trace#Region ↩

tbg · 2023-06-07T15:43:07Z

In another recent (and currently ongoing) escalation¹², we are seeing a demonstration of the power of execution traces. Starting from a trace that showed a slow span in DistSender, we were able to use execution traces to narrow this down to a slow (700ms) syscall reading off the TCP connection (with a similar analysis being in-process on the server side, outcome yet unknown). This level of visibility is quite powerful and allows us to discriminate between issues in "our code" vs the OS/networking layer. This is more evidence that making it easier for customers to collect matching trace/exec trace data is worth our while.

tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels Feb 15, 2023

blathers-crl bot added the A-observability-inf label Feb 15, 2023

joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Feb 27, 2023

exalate-issue-sync bot removed the T-sql-observability label Mar 1, 2023

adityamaru self-assigned this Mar 7, 2023

blathers-crl bot added the T-disaster-recovery label Mar 7, 2023

adityamaru mentioned this issue Mar 8, 2023

[WIP,DNM] server,tracing: integrate on-demand profiling with CRDB tracing #98235

Draft

exalate-issue-sync bot removed the T-observability-inf label Mar 9, 2023

tbg mentioned this issue Jun 16, 2023

rpc,sql: propagate pprof labels for SQL queries to KV #86012

Closed

maryliag removed the A-observability-inf label Nov 3, 2023

adityamaru removed their assignment Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

obs: integrate on-demand CPU profiling / Go execution tracing with CRDB tracing #97215

obs: integrate on-demand CPU profiling / Go execution tracing with CRDB tracing #97215

tbg commented Feb 15, 2023 •

edited by exalate-issue-sync bot

Loading

adityamaru commented Feb 26, 2023 •

edited

Loading

tbg commented Feb 27, 2023 •

edited

Loading

adityamaru commented Feb 28, 2023

adityamaru commented Mar 7, 2023

blathers-crl bot commented Mar 7, 2023

tbg commented Jun 5, 2023

tbg commented Jun 7, 2023

obs: integrate on-demand CPU profiling / Go execution tracing with CRDB tracing #97215

obs: integrate on-demand CPU profiling / Go execution tracing with CRDB tracing #97215

Comments

tbg commented Feb 15, 2023 • edited by exalate-issue-sync bot Loading

Footnotes

adityamaru commented Feb 26, 2023 • edited Loading

tbg commented Feb 27, 2023 • edited Loading

adityamaru commented Feb 28, 2023

adityamaru commented Mar 7, 2023

blathers-crl bot commented Mar 7, 2023

tbg commented Jun 5, 2023

Footnotes

tbg commented Jun 7, 2023

Footnotes

tbg commented Feb 15, 2023 •

edited by exalate-issue-sync bot

Loading

adityamaru commented Feb 26, 2023 •

edited

Loading

tbg commented Feb 27, 2023 •

edited

Loading