Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

obs: integrate on-demand CPU profiling / Go execution tracing with CRDB tracing #97215

Open
tbg opened this issue Feb 15, 2023 · 7 comments
Open
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-disaster-recovery

Comments

@tbg
Copy link
Member

tbg commented Feb 15, 2023

Is your feature request related to a problem? Please describe.

Traces can answer some questions today, but not others. In particular, unexpected delays between trace events like Go scheduler latency, lock contention, high CPU usage in a subsystem such as pebble, etc, lead to time-consuming and often fruitless investigations. A "big hammer" that would allow us to get to the bottom of most of them would be invaluable.

Describe the solution you'd like

A big hammer whose centerpiece are automatic CPU and/or Go execution profiles (we suggest implementing both since they're complementary and behave the same from an API point of view).

Only a single CPU profile/execution trace (henceforth called profile interchangeably for simplicitly) can be active per node at any given point in time. We add a service on each node that multiplexes profile requests onto slots of duration <=1s long, that is, profiles are taken in 1s or shorter chunks and multiple (local in-process) listeners can subscribe to profiles for however long they wish, which means they'll get notified of any 1s chunks overlapping their tenure. No profiles are taken in the absence of listeners. Profiles get stored on node-local storage and have an identifier with which they can be retrieved at a later time. (There are retention policies, tenant access rules, and everything you need to make this work for UA, etc, details TBD).

We can use this to enable the following use cases:

  • stmt diagnostics requests get a new option to request profiling. When requested, any local root trace span (i.e. while any part of the trace is active on a given node) subscribes to profiles, and references to the profiles collected are stored as payloads in the span. They're then included in the stmt bundle.1
  • even outside of diagnostics, could mark traces as wanting to capture debug info for "slow spans". Such spans on creation could set a timer that, once it fires, subscribes to (short) execution traces periodically as a way to snapshot the goroutine's actions. These could be referenced in the span for later retrieval, dumped into logs (similar to heap_profiler), whatever feels right. (cc @dt, @adityamaru)

For a practical example where we'd loved to have CPU profiles with traces, see https://github.com/cockroachlabs/support/issues/2033.
For Go exec traces, they would be invaluable for the NUMA issues we've seen, and many latency investigations that come with unexplainable gaps in the trace in non-CPU intensive parts. There are likely sched latencies at play and an execution trace would prove it. But we only have pprof-loop which is hard to use and isn't included in the initial outreach, so we spend lots of time asking for (and often not getting) this additional info. Customers spend lots of time putting it together too.

Describe alternatives you've considered

Can't really think of much

Additional context

Jira issue: CRDB-24848

Epic CRDB-32402

Footnotes

  1. note that if the diagnostics request has a "min latency" bound, this might in effect turn on profiling continuously on some or all nodes in the cluster, so we need to validate that this is OK, perhaps using suitable RBAC to gate it only to admins, or only allow CPU profiles with a lower profiling rate, etc

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels Feb 15, 2023
@adityamaru
Copy link
Contributor

adityamaru commented Feb 26, 2023

I've been thinking some more about your proposal and I'm bullish that we have almost all the parts to put something together. If we emitted the runtime traces as StructuredEvents for each of the spans subscribed to the profiler, we could have a client-side aggregator (example: #84043) bucket these by node, and periodically flush them to disk. For queries, this could be somewhere on local storage, while jobs can insert a row into the new system.job_info table. These rows can then be retrieved+processed by whatever tool we decide to build for jobs observability. This is in line with how we plan to aggregate trace-driven statistics for jobs (example: #96968). I'll try to get a prototype for this going so that we can discover the pitfalls.

@tbg
Copy link
Member Author

tbg commented Feb 27, 2023

If we emitted the runtime traces as StructuredEvents

If we can bound the size of profiles acquired by a local root span (i.e. transitively with its children) then that seems like a great first step. It would be preferable to not buffer these in memory/send them around via gRPC trace responses at all, but this is a limitation we should be able to live with at first as long as we avoid unbounded memory consumption (I think a few MB of budget would be fine, dropping profiles after that and putting information about that fact in the span. Not needed for POC obvs).

I'll try to get a prototype for this going so that we can discover the pitfalls.

❤️

@joshimhoff joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Feb 27, 2023
@adityamaru
Copy link
Contributor

It would be preferable to not buffer these in memory/send them around via gRPC trace responses at all, but this is a limitation we should be able to live with at first as long as we avoid unbounded memory consumption

A second long runtime trace of a single node testserver is ~200KiB. That already blows past the size we allow a structured event to grow to. So emitting these traces as StructuredEvent payloads on the spans that subscribe to such runtime traces might be a non-starter.

This leaves us with the model of flushing the traces to disk in a directory of our choosing and then sticking the directory into the span to make its way back to the node that is coordinating the query/job. I'm not sure if the job profile or statement bundle machinery should then reach out to all these directories and try stitching the traces as a whole, or just leave it up to the user to open a trace visualizer for each outputted file.

@adityamaru
Copy link
Contributor

I'm going to tentatively assign this to myself and put it on the DR board for planning purposes. I may or may not get this all the way to completion but I'm keen to have something we can use to collect query and job runtime profiles + traces.

@blathers-crl
Copy link

blathers-crl bot commented Mar 7, 2023

cc @cockroachdb/disaster-recovery

adityamaru added a commit to adityamaru/cockroach that referenced this issue Mar 8, 2023
This change introduces a BackgroundProfiler service that is started
during server startup on each node in the cluster. The BackgroundProfiler
is responsible for collecting on-demand CPU profiles and runtime traces
for a particular operation. The profiler can be subscribed to by an in-process
listener. The first Subscriber initializes the collection of the CPU and execution
trace profiles. While the profiles are being collected, only Subscribers carrying
the same `profileID` are allowed to join the running profiles. The profiles
are stopped and persisted to local storage when the last Subscriber
unsubscribes. The `profileID` is a unique identifier of the operation that is
being traced. Since only one profile can be running in a process at a time,
any Subscribers with different `profileID`s than the current one will be rejected.

The in-process listeners described above will be CRDBs internal tracing
spans. This change introduces a `WithBackgroudProfiling` option that can be used
to instruct a tracing span to subscribe to the BackgroundProfiler. This option
is propogated to all local and remote child spans created as part of the trace.
Only local, root spans that have background profiling enabled will Subscribe to
the profiler on creation. As mentioned above only one operation can be profiled
at a time. We use the first root span's `TraceID` as the BackgroundProfiler's `profileID`.
All subsequent root span's that are part of the same trace will be able to join
the running profile. Tracing span's unsubscribe from the profile on Finish().

Every Susbcriber is returned a wrapped ctx with pprof labels that tie its execution
to the profile being collected by the BackgroundProfiler. These labels are used
to post-process the collected CPU profile and filter out samples that only correspond
to our subscribers. The end result is filtered CPU profile  prefixed `cpuprofiler.`
and a process wide execution trace `runtimetrace.` persisted to local storage.

This change only introduces the infrastructure to enable on-demand profiling.
The test in `profiler_test.go` results in a CPU profile with information about
each labelled root operation collected on-demand:

❯ go tool pprof cpuprofiler.2023-03-08T14_51_52.402
Type: cpu
Time: Mar 8, 2023 at 9:51am (EST)
Duration: 10.11s, Total samples = 8.57s (84.77%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) tags
 9171346462634118014: Total 8.6s
                      906.0ms (10.57%): op2
                      902.0ms (10.53%): op1
                      890.0ms (10.39%): op0
                      886.0ms (10.34%): op7
                      866.0ms (10.11%): op4
                      866.0ms (10.11%): op5
                      854.0ms ( 9.96%): op3
                      806.0ms ( 9.40%): op8
                      804.0ms ( 9.38%): op6
                      790.0ms ( 9.22%): op9

Execution traces do not surface pprof labels in golang yet but a future
patch could consider cherry-picking https://go-review.googlesource.com/c/go/+/446975.
This allows the user to focus on goroutines run with the specified pprof labels:

With this framework in place one could envision the following use cases:

- stmt diagnostics requests get a new option to request profiling.
When requested, any local root trace span (i.e. while any part of the trace is
active on a given node) subscribes to profiles, and references to the profiles collected
are stored as payloads in the span. They're then included in the stmt bundle.

- even outside of diagnostics, could mark traces as wanting to capture debug info
for "slow spans". Such spans on creation could set a timer that, once it fires,
subscribes to (short) execution traces periodically as a way to snapshot the goroutine's
actions. These could be referenced in the span for later retrieval.

Informs: cockroachdb#97215
@tbg
Copy link
Member Author

tbg commented Jun 5, 2023

Some more thoughts from a recent escalation:

we could also more tightly integrate statement bundle requests with execution traces:

  • user can gets two addition checkboxes in stmt bundle request modal (I think both don't work at the same time, but exec trace contains a CPU profile afaik, so we can offer the checkboxes separately)
    • ☒ collect cluster-wide execution trace
    • ☒ collect cluster-wide CPU profile
    • appropriate warnings etc about perf impact.
  • we enable X (where X=cpu profiling or execution tracing) whenever a span for the trace is active on a node, we take profile/trace for the duration of the span.
  • the stmt bundle contains all of these spans, named with time bounds and node of origin.
  • ideally we're also using execution tracer "regions"1 to highlight the goroutines belonging to the spans.

Footnotes

  1. https://pkg.go.dev/runtime/trace#Region

@tbg
Copy link
Member Author

tbg commented Jun 7, 2023

In another recent (and currently ongoing) escalation12, we are seeing a demonstration of the power of execution traces. Starting from a trace that showed a slow span in DistSender, we were able to use execution traces to narrow this down to a slow (700ms) syscall reading off the TCP connection (with a similar analysis being in-process on the server side, outcome yet unknown). This level of visibility is quite powerful and allows us to discriminate between issues in "our code" vs the OS/networking layer. This is more evidence that making it easier for customers to collect matching trace/exec trace data is worth our while.

Footnotes

  1. https://github.com/cockroachlabs/support/issues/2351

  2. https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1685818830723099

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-disaster-recovery
Projects
None yet
Development

No branches or pull requests

4 participants