-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
obs: integrate on-demand CPU profiling / Go execution tracing with CRDB tracing #97215
Comments
I've been thinking some more about your proposal and I'm bullish that we have almost all the parts to put something together. If we emitted the runtime traces as StructuredEvents for each of the spans subscribed to the profiler, we could have a client-side aggregator (example: #84043) bucket these by node, and periodically flush them to disk. For queries, this could be somewhere on local storage, while jobs can insert a row into the new |
If we can bound the size of profiles acquired by a local root span (i.e. transitively with its children) then that seems like a great first step. It would be preferable to not buffer these in memory/send them around via gRPC trace responses at all, but this is a limitation we should be able to live with at first as long as we avoid unbounded memory consumption (I think a few MB of budget would be fine, dropping profiles after that and putting information about that fact in the span. Not needed for POC obvs).
❤️ |
A second long runtime trace of a single node testserver is ~200KiB. That already blows past the size we allow a structured event to grow to. So emitting these traces as StructuredEvent payloads on the spans that subscribe to such runtime traces might be a non-starter. This leaves us with the model of flushing the traces to disk in a directory of our choosing and then sticking the directory into the span to make its way back to the node that is coordinating the query/job. I'm not sure if the job profile or statement bundle machinery should then reach out to all these directories and try stitching the traces as a whole, or just leave it up to the user to open a trace visualizer for each outputted file. |
I'm going to tentatively assign this to myself and put it on the DR board for planning purposes. I may or may not get this all the way to completion but I'm keen to have something we can use to collect query and job runtime profiles + traces. |
cc @cockroachdb/disaster-recovery |
This change introduces a BackgroundProfiler service that is started during server startup on each node in the cluster. The BackgroundProfiler is responsible for collecting on-demand CPU profiles and runtime traces for a particular operation. The profiler can be subscribed to by an in-process listener. The first Subscriber initializes the collection of the CPU and execution trace profiles. While the profiles are being collected, only Subscribers carrying the same `profileID` are allowed to join the running profiles. The profiles are stopped and persisted to local storage when the last Subscriber unsubscribes. The `profileID` is a unique identifier of the operation that is being traced. Since only one profile can be running in a process at a time, any Subscribers with different `profileID`s than the current one will be rejected. The in-process listeners described above will be CRDBs internal tracing spans. This change introduces a `WithBackgroudProfiling` option that can be used to instruct a tracing span to subscribe to the BackgroundProfiler. This option is propogated to all local and remote child spans created as part of the trace. Only local, root spans that have background profiling enabled will Subscribe to the profiler on creation. As mentioned above only one operation can be profiled at a time. We use the first root span's `TraceID` as the BackgroundProfiler's `profileID`. All subsequent root span's that are part of the same trace will be able to join the running profile. Tracing span's unsubscribe from the profile on Finish(). Every Susbcriber is returned a wrapped ctx with pprof labels that tie its execution to the profile being collected by the BackgroundProfiler. These labels are used to post-process the collected CPU profile and filter out samples that only correspond to our subscribers. The end result is filtered CPU profile prefixed `cpuprofiler.` and a process wide execution trace `runtimetrace.` persisted to local storage. This change only introduces the infrastructure to enable on-demand profiling. The test in `profiler_test.go` results in a CPU profile with information about each labelled root operation collected on-demand: ❯ go tool pprof cpuprofiler.2023-03-08T14_51_52.402 Type: cpu Time: Mar 8, 2023 at 9:51am (EST) Duration: 10.11s, Total samples = 8.57s (84.77%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) tags 9171346462634118014: Total 8.6s 906.0ms (10.57%): op2 902.0ms (10.53%): op1 890.0ms (10.39%): op0 886.0ms (10.34%): op7 866.0ms (10.11%): op4 866.0ms (10.11%): op5 854.0ms ( 9.96%): op3 806.0ms ( 9.40%): op8 804.0ms ( 9.38%): op6 790.0ms ( 9.22%): op9 Execution traces do not surface pprof labels in golang yet but a future patch could consider cherry-picking https://go-review.googlesource.com/c/go/+/446975. This allows the user to focus on goroutines run with the specified pprof labels: With this framework in place one could envision the following use cases: - stmt diagnostics requests get a new option to request profiling. When requested, any local root trace span (i.e. while any part of the trace is active on a given node) subscribes to profiles, and references to the profiles collected are stored as payloads in the span. They're then included in the stmt bundle. - even outside of diagnostics, could mark traces as wanting to capture debug info for "slow spans". Such spans on creation could set a timer that, once it fires, subscribes to (short) execution traces periodically as a way to snapshot the goroutine's actions. These could be referenced in the span for later retrieval. Informs: cockroachdb#97215
Some more thoughts from a recent escalation: we could also more tightly integrate statement bundle requests with execution traces:
Footnotes |
In another recent (and currently ongoing) escalation12, we are seeing a demonstration of the power of execution traces. Starting from a trace that showed a slow span in DistSender, we were able to use execution traces to narrow this down to a slow (700ms) syscall reading off the TCP connection (with a similar analysis being in-process on the server side, outcome yet unknown). This level of visibility is quite powerful and allows us to discriminate between issues in "our code" vs the OS/networking layer. This is more evidence that making it easier for customers to collect matching trace/exec trace data is worth our while. Footnotes |
Is your feature request related to a problem? Please describe.
Traces can answer some questions today, but not others. In particular, unexpected delays between trace events like Go scheduler latency, lock contention, high CPU usage in a subsystem such as pebble, etc, lead to time-consuming and often fruitless investigations. A "big hammer" that would allow us to get to the bottom of most of them would be invaluable.
Describe the solution you'd like
A big hammer whose centerpiece are automatic CPU and/or Go execution profiles (we suggest implementing both since they're complementary and behave the same from an API point of view).
Only a single CPU profile/execution trace (henceforth called profile interchangeably for simplicitly) can be active per node at any given point in time. We add a service on each node that multiplexes profile requests onto slots of duration <=1s long, that is, profiles are taken in 1s or shorter chunks and multiple (local in-process) listeners can subscribe to profiles for however long they wish, which means they'll get notified of any 1s chunks overlapping their tenure. No profiles are taken in the absence of listeners. Profiles get stored on node-local storage and have an identifier with which they can be retrieved at a later time. (There are retention policies, tenant access rules, and everything you need to make this work for UA, etc, details TBD).
We can use this to enable the following use cases:
For a practical example where we'd loved to have CPU profiles with traces, see https://github.com/cockroachlabs/support/issues/2033.
For Go exec traces, they would be invaluable for the NUMA issues we've seen, and many latency investigations that come with unexplainable gaps in the trace in non-CPU intensive parts. There are likely sched latencies at play and an execution trace would prove it. But we only have pprof-loop which is hard to use and isn't included in the initial outreach, so we spend lots of time asking for (and often not getting) this additional info. Customers spend lots of time putting it together too.
Describe alternatives you've considered
Can't really think of much
Additional context
Jira issue: CRDB-24848
Epic CRDB-32402
Footnotes
note that if the diagnostics request has a "min latency" bound, this might in effect turn on profiling continuously on some or all nodes in the cluster, so we need to validate that this is OK, perhaps using suitable RBAC to gate it only to admins, or only allow CPU profiles with a lower profiling rate, etc ↩
The text was updated successfully, but these errors were encountered: