runtime: diagnostics improvements tracking issue #57175

mknyszek · 2022-12-08T17:51:15Z

mknyszek · 2022-12-08T18:17:46Z

2022-12-07 Sync

Attendees: @mknyszek @aclements @prattmic @felixge @nsrip-dd @rhysh

Notes:

pprof labels in execution traces
- Michael K: I need to follow up on the CL and issue.
- Michael P: Have you considered runtime/trace regions?
- Nick: Yes, but it doesn't quite hit our use-cases.
- Rhys: Could use logs instead. Is the point to follow goroutines?
- Felix: Yeah.
- Rhys: Started parsing the trace format for some internal tooling.
- Felix: Prefer labels because inheritance. Mostly like it for profile tools on top, but maybe doesn't matter for tracing tools. The parent goroutine gets recorded so things like request IDs can be tracked in post-processing.
  - Also useful for attributing CPU time.
- Michael P: Tasks are inherited, but inheritance is processed downstream.
  - Would be nice to have pprof labels just so you don't have to think about which one to use.
- Michael K: Useful even just to bridge pprof and runtime/trace.
- Austin: Agreed. We can one day get to the point of deprecating the old APIs as well.
- Rhys: RE: Attributing CPU time, can see at better than 10 ms of granularity already (even if Ps aren't CPU time, it's still time the rest of the app couldn't run).
- Michael P: There's an issue about measure CPU time on-line. (TODO: Find it.)
Trace parsing API
- Michael K: How important is this? Priority?
- Felix: Important for the community, but we can't make use of it in our case. Will the trace format change in between releases?
- Michael K: I think we can always guarantee a trace format for a release.
- Michael P: How high level should this API be?
  - cmd/trace has two levels:
    - Low-level that understands format and events.
    - Higher-level that understands relationships between goroutines, etc.
- Michael K: Page trace splits this into "parser" and "simulator." The latter is more stateful.
- Felix: Intuitive feeling toward lower level API.
- Rhys: +1 to low level.
- Austin: Scalability of processing traces.
  - Currently not in a good state in low or high level format (currently requires the whole trace).
  - Can fix trace wire format for low-level parsing scalability issues, but it's much less clear how to do this for the high-level format.
- Austin: Flight recorder idea.
  - Interacts interestingly with simulation. Current trace snapshots everything.
  - Solved this in debuglog; reads its own tail and keeps local state updated.
  - Complicated trade-offs in this space.
- Felix: We use a lot of JFR, one thing that's nice is it's broken down into self-contained chunks.
Michael K sent out a very half-baked trace format revamp. (Thanks for the comments! Far from ready to share more widely.)
- The next step is to measure the actual current overhead.
  - Maybe add a mode to Sweet?
  - Rhys: Have been collecting CPU profiles and execution traces. 20% of CPU time during execution trace is for execution trace itself. 95% of overhead is collecting stack traces.
    - Collect 1 second every 5000 seconds and no one complains. People do complain about goroutine profiles every 2 minutes.
  - Michael K: Shooting for KUTrace overhead, so making stack traces optional/faster is just step 1.
  - Felix: Trace effect on tail latency.
    - Rhys: Traces are my view of tail latency.
  - Felix: Benchmark for pathological cases and worst case.
  - Austin: Linked to trace overhead issue, Dmitry proposed switching to frame pointer unwinding.
  - Felix: At some point implemented frame pointer unwinding in userland and it was 50x faster (link).
  - Rhys: Not sure what kind of tool you could build without stack traces in an app that doesn't set pprof labels, tasks, regions, trace logs, etc.
  - Michael K: Integration with structured logging?
  - Michael P: It does add yet another way to add things to runtime/trace.
  - Rhys: The standard library (e.g. database/sql) doesn't currently use runtime/trace at all, maybe it should.
  - Michael K: This connects to deciding what goes into a trace. I think this is a very good idea.
  - Felix: +1. Java world and JFR does this.
Deciding what goes into a trace
- Disabling stack tracing / reduce stack trace depth
- Filtering by pprof labels
- Specific event classes
- Standard library events
- Rhys: I've made this decision for my organization. Expected that you do profiling for a long running service. No opportunity for app owners to express opinions. People who complained forked the package, turned it off, and now coming back. I kind of want everything.
- Felix: I would love to be in a place where we can do that, but we get pushback from users when the overhead is too high.
- Rhys: The question is how close we get within 1% overhead. My choice was to get everything, but less often.
- Felix: Desire to get more of everything is in conflict with adding more kinds of things in the trace.
- Michael P: Agreed. Ideally we have tracing that's sufficiently fast that we have on all the time, but if libraries are allowed to add new traces then it could be a problem. It would be nice to turn that off without forking a library.
Before next sync:
- Michael K: Unblock pprof labels patch and benchmarking trace overhead.
- Felix: I can contribute a worst case benchmark.
  - Currently blocked on pprof labels in trace.
- Felix: Started to work on gentraceback. Might work on it over the holidays.
  - Trying for bug-for-bug compatibility.
- Michael P: Austin has been working on this too.

felixge · 2022-12-09T09:46:41Z

I'll miss the Dec 22nd meetup because I'm traveling for the holidays. That being said, if I find time I might also look into #57159 . Getting a proof of concept for Perfetto UI integration (ideally using their protocol buffer format) is probably more important than the gentraceback refactoring at this point. I just tried to work with a 300 MB (15s of prod activity) yesterday, and it was a real eye opener to the way the current UI struggles.

tbg · 2022-12-09T09:58:40Z

I don't know if it's relevant (probably nothing new for the folks on this thread), but I had similar problems with the go tool trace viewer where it would freeze on me all the time, esp. in the per-goroutine view (/trace?goid=N). I figured out you can download perfetto-compatible JSON data from /jsontrace?goid=N. (/jsontrace gives the default view). This can then be uploaded to ui.perfetto.dev. This doesn't show all the information in the trace so it's not as great, but I was glad to have something that worked.

thediveo · 2022-12-09T16:11:06Z

would the pprof labels also show up in goroutine traces?

qmuntal · 2022-12-09T20:58:25Z

I'm working on a PoC that improves native stack unwinding on Windows by adding additional information to the PE file. This will help debugging with WinDbg and profiling with Windows Performance Analyzer. Would this work fit into the effort tracked by this issue?

mknyszek · 2022-12-09T21:37:04Z

@thediveo I think that might be a good question for #56295, or you could file another issue. Off the top of my head, that doesn't sound like it would be too difficult to do.

@qmuntal Oh neat! That's awesome. I think it's a little tangential to the work we're proposing here, unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback). Then again, if one of the goals is better integration with the Windows Performance Analyzer that's certainly more in the same spirit. Do you have an issue for tracking that already?

qmuntal · 2022-12-09T22:05:40Z

Do you have an issue for tracking that already?

I still have to prepare the proposal, I plan to submit it next week.

unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback).

Not for now, but once I finish this I want to investigate how feasible is too unwind native code and merge it with the Go unwinding, in case the exception happens in a non-Go module.

qmuntal · 2022-12-14T08:44:12Z

Do you have an issue for tracking that already?

I do now #57302 😄

gopherbot · 2022-12-21T22:46:06Z

Change https://go.dev/cl/459095 mentions this issue: sweet: add support for execution traces and measuring trace overhead

mknyszek · 2022-12-22T16:43:57Z

2022-12-22 Sync

Attendees: @mknyszek @aclements @prattmic @bboreham @rhysh @dominikh

Organizational stuff
- OK to record meetings?
- Meeting recorded with transcript this week (please ask if you would like to see it).
Trace overhead benchmarks
- https://go.dev/cl/459095

name                                old time/op            new time/op            delta
BiogoIgor                                      17.7s ± 3%             17.5s ± 4%     ~     (p=0.190 n=10+10)
BiogoKrishna                                   15.1s ± 4%             15.1s ± 4%     ~     (p=0.739 n=10+10)
BleveIndexBatch100                             5.78s ± 7%             5.76s ±11%     ~     (p=0.853 n=10+10)
BleveQuery                                     2.37s ± 0%             2.37s ± 0%   -0.26%  (p=0.016 n=8+10)
FoglemanFauxGLRenderRotateBoat                 16.9s ± 9%             16.9s ± 7%     ~     (p=0.796 n=10+10)
FoglemanPathTraceRenderGopherIter1             36.7s ± 1%             44.4s ± 2%  +21.01%  (p=0.000 n=10+10)
GoBuildKubelet                                 47.0s ± 2%             48.8s ± 3%   +3.72%  (p=0.000 n=10+10)
GoBuildKubeletLink                             8.89s ± 2%             8.88s ± 4%     ~     (p=0.720 n=10+9)
GoBuildIstioctl                                45.9s ± 1%             47.8s ± 2%   +4.09%  (p=0.000 n=10+10)
GoBuildIstioctlLink                            9.07s ± 2%             8.99s ± 2%     ~     (p=0.095 n=10+9)
GoBuildFrontend                                15.7s ± 4%             16.1s ± 2%   +2.45%  (p=0.043 n=10+10)
GoBuildFrontendLink                            1.38s ± 2%             1.37s ± 3%     ~     (p=0.529 n=10+10)
GopherLuaKNucleotide                           27.9s ± 0%             27.9s ± 1%     ~     (p=0.853 n=10+10)
MarkdownRenderXHTML                            256ms ± 2%             256ms ± 2%     ~     (p=1.000 n=9+9)
Tile38WithinCircle100kmRequest                 618µs ± 7%             657µs ±10%   +6.30%  (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest             722µs ± 6%             773µs ± 4%   +6.96%  (p=0.000 n=10+9)
Tile38KNearestLimit100Request                  508µs ± 3%             532µs ± 3%   +4.73%  (p=0.000 n=10+10)

name                                old average-RSS-bytes  new average-RSS-bytes  delta
BiogoIgor                                     68.8MB ± 2%            71.8MB ± 4%   +4.40%  (p=0.000 n=10+10)
BiogoKrishna                                  4.42GB ± 0%            4.42GB ± 0%     ~     (p=0.739 n=10+10)
BleveIndexBatch100                             194MB ± 2%             198MB ± 3%   +1.91%  (p=0.008 n=9+10)
BleveQuery                                     536MB ± 0%             537MB ± 1%     ~     (p=0.190 n=10+10)
FoglemanFauxGLRenderRotateBoat                 444MB ± 1%             446MB ± 0%   +0.41%  (p=0.035 n=10+9)
FoglemanPathTraceRenderGopherIter1             132MB ± 1%             142MB ± 4%   +7.61%  (p=0.000 n=10+10)
GoBuildKubelet                                1.75GB ± 1%            1.85GB ± 1%   +5.51%  (p=0.000 n=10+10)
GoBuildIstioctl                               1.35GB ± 1%            1.42GB ± 1%   +5.49%  (p=0.000 n=10+9)
GoBuildFrontend                                511MB ± 2%             543MB ± 1%   +6.31%  (p=0.000 n=10+9)
GopherLuaKNucleotide                          37.0MB ± 1%            40.4MB ± 2%   +9.24%  (p=0.000 n=9+10)
MarkdownRenderXHTML                           21.8MB ± 3%            24.0MB ± 3%  +10.14%  (p=0.000 n=9+8)
Tile38WithinCircle100kmRequest                5.40GB ± 1%            5.38GB ± 1%     ~     (p=0.315 n=10+10)
Tile38IntersectsCircle100kmRequest            5.72GB ± 1%            5.71GB ± 1%     ~     (p=0.971 n=10+10)
Tile38KNearestLimit100Request                 7.26GB ± 0%            7.25GB ± 0%     ~     (p=0.739 n=10+10)

name                                old peak-RSS-bytes     new peak-RSS-bytes     delta
BiogoIgor                                     95.9MB ± 4%            98.5MB ± 3%   +2.70%  (p=0.030 n=10+10)
BiogoKrishna                                  4.49GB ± 0%            4.49GB ± 0%     ~     (p=0.356 n=9+10)
BleveIndexBatch100                             282MB ± 3%             284MB ± 4%     ~     (p=0.436 n=10+10)
BleveQuery                                     537MB ± 0%             538MB ± 1%     ~     (p=0.579 n=10+10)
FoglemanFauxGLRenderRotateBoat                 485MB ± 1%             483MB ± 0%     ~     (p=0.388 n=10+9)
FoglemanPathTraceRenderGopherIter1             180MB ± 2%             193MB ± 3%   +7.19%  (p=0.000 n=10+10)
GopherLuaKNucleotide                          39.8MB ± 3%            46.0MB ±20%  +15.56%  (p=0.000 n=9+10)
MarkdownRenderXHTML                           22.1MB ± 3%            25.5MB ± 7%  +15.45%  (p=0.000 n=9+10)
Tile38WithinCircle100kmRequest                5.70GB ± 1%            5.68GB ± 1%   -0.45%  (p=0.023 n=10+10)
Tile38IntersectsCircle100kmRequest            5.93GB ± 1%            5.91GB ± 2%     ~     (p=0.631 n=10+10)
Tile38KNearestLimit100Request                 7.47GB ± 1%            7.46GB ± 0%     ~     (p=0.579 n=10+10)

name                                old peak-VM-bytes      new peak-VM-bytes      delta
BiogoIgor                                      802MB ± 0%             803MB ± 0%   +0.11%  (p=0.000 n=10+10)
BiogoKrishna                                  5.24GB ± 0%            5.24GB ± 0%   +0.01%  (p=0.001 n=10+10)
BleveIndexBatch100                            1.79GB ± 0%            1.79GB ± 0%   +0.05%  (p=0.000 n=8+8)
BleveQuery                                    3.53GB ± 1%            3.53GB ± 1%     ~     (p=0.237 n=10+10)
FoglemanFauxGLRenderRotateBoat                1.21GB ± 0%            1.16GB ± 4%     ~     (p=0.163 n=8+10)
FoglemanPathTraceRenderGopherIter1             875MB ± 0%             884MB ± 0%   +1.02%  (p=0.000 n=10+10)
GopherLuaKNucleotide                           733MB ± 0%             734MB ± 0%   +0.11%  (p=0.000 n=9+10)
MarkdownRenderXHTML                            733MB ± 0%             734MB ± 0%   +0.10%  (p=0.000 n=10+9)
Tile38WithinCircle100kmRequest                6.42GB ± 0%            6.39GB ± 1%     ~     (p=0.086 n=8+10)
Tile38IntersectsCircle100kmRequest            6.62GB ± 1%            6.61GB ± 2%     ~     (p=0.927 n=10+10)
Tile38KNearestLimit100Request                 8.16GB ± 1%            8.18GB ± 0%     ~     (p=0.649 n=10+8)

name                                old p50-latency-ns     new p50-latency-ns     delta
Tile38WithinCircle100kmRequest                  144k ± 3%              159k ± 3%  +10.56%  (p=0.000 n=9+9)
Tile38IntersectsCircle100kmRequest              215k ± 1%              232k ± 2%   +7.91%  (p=0.000 n=9+10)
Tile38KNearestLimit100Request                   347k ± 2%              373k ± 1%   +7.21%  (p=0.000 n=10+10)

name                                old p90-latency-ns     new p90-latency-ns     delta
Tile38WithinCircle100kmRequest                  908k ± 6%              956k ± 9%   +5.22%  (p=0.043 n=10+10)
Tile38IntersectsCircle100kmRequest             1.07M ± 4%             1.11M ± 5%   +4.33%  (p=0.001 n=10+10)
Tile38KNearestLimit100Request                  1.03M ± 3%             1.05M ± 4%   +2.64%  (p=0.011 n=10+10)

name                                old p99-latency-ns     new p99-latency-ns     delta
Tile38WithinCircle100kmRequest                 7.55M ± 9%             7.93M ±13%     ~     (p=0.089 n=10+10)
Tile38IntersectsCircle100kmRequest             7.81M ± 8%             8.39M ± 2%   +7.36%  (p=0.000 n=10+8)
Tile38KNearestLimit100Request                  2.03M ± 4%             2.08M ± 5%   +2.52%  (p=0.019 n=10+10)

name                                old ops/s              new ops/s              delta
Tile38WithinCircle100kmRequest                 9.73k ± 7%             9.16k ±11%   -5.83%  (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest             8.31k ± 6%             7.77k ± 4%   -6.55%  (p=0.000 n=10+9)
Tile38KNearestLimit100Request                  11.8k ± 3%             11.3k ± 3%   -4.51%  (p=0.000 n=10+10)

Introduction: Bryan Boreham, Grafana Labs
- Questions within the team about whether useful information has been derived from Go execution traces.
- Phlare: continuous profiling. Interested in linking together various signals (distributed tracing, profiling)
- Michael K: Interesting data point about usability.
- Michael P: Hard to link application behavior to trace.
- Bryan: Example: channels. Still don't really know where to find that data.
- Dominik: One of the reasons I started on gotraceui was to surface more information and do more automatic inference and analysis of the data.
- Rhys: Execution trace technique: get data out of them to find the interesting traces. Try to extract features that would be interesting up-front.
  - Starts with internal trace parser. Have code to find start and end of HTTP requests, DNS lookups, etc.
  - Tooling on the way to get open sourced.
Heap analysis plan (proposal: x/debug: make the core and gocore packages public #57447)
- Austin: Additional context is we're confident in the API we're planning to export, as opposed to tracing which we have nothing for yet.
https://go.dev/issue/57307 proposal: cmd/trace: visualize time taken by syscall
- Austin: Does Perfetto do better with instantaneous events?
  - Michael P: Yes, there's a 20px wide arrow but we have so many.
  - Rhys: Hold shift, draw a box. If you aim well, you get what you want.
- Rhys: Why is there only one timestamp on some events?
  - Austin: We can add another timestamp.
  - Michael P: Syscall fast path does a lot less.
pprof labels in traces
- Michael K: I think I've unblocked Nick. Michael and I are reviewing.
runtime.gentraceback cleanup
- Austin: Back and forth on the issue about making it an iterator, sent out CLs, not tested yet.
Next meeting: Jan 5th, Michael P and Michael K won't be here, so Austin will run it.
Action items:
- We're slowing down for the holidays, so no strong expectations
- Michael K:
  - Try to land execution trace benchmarking.
  - Might look into heap analysis stuff.
  - After break, might want to start working on trace format more seriously.
Happy holidays!

mknyszek · 2023-01-13T18:33:46Z

2023-01-05 Sync

Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham vnedkov @dashpole

Organizational stuff
- @mknyszek is out today, @aclements running the meeting.
- @dashpole from Google OSS Telemetry joining us.
Benchmarks: Can we add a goroutine ping pong example? (Felix)
- Tracer benchmarks all show relatively low overhead. Can we add a benchmark that demonstrates the worst case?
- Austin: Sweet probably isn’t the right place because that’s application-level. Maybe add to Bent?
- Felix: Next step on these benchmarks? Land MK’s trace benchmark support?
- Austin: It’s certainly fine to land. We don’t have a good way to integrate these “extra dimensions” into our automated benchmarking.
- AI(austin): Bring up monitoring extra benchmarking dimensions.
- Austin: “Unit benchmarks” would be the perfect place for a ping pong benchmark (we already have one in the main repo), but we never quite got to integrating these into automated monitoring.
Are only GC STW recorded? Would it make sense to record other STW events (read metrics, goroutine profile, heap dump)? (Felix)
- Rhys: You get ProcStop events
- Austin: Yeah, you’re right that we trace high level GC STW events.
- Rhys: Currently the GC traces the “best case” STW, which can be really misleading.
- Austin: We could definitely have a “stopping the world” and a “world stopped”. Maybe don’t need that for start.
- Felix: That would be great. We’re investigating rare long STWs right now.
- Rhys: Starting the world can take a while. Problems with heap lock contention. I would love to have more visibility into the runtime locks.
- Austin: Runtime locks are a bit of a mess. I also wonder if they should be “scalable”.
- Rhys: I’d love to discuss that. C&R office hours?
- Austin: Perfect.
- Conclusion: Let’s add events for all STWs and also separate “stopping” from “stopped”.
Updates on Perfetto UI (Felix and Nick)
- Add to UI CL: https://go.dev/cl/457716
- Felix: The JSON currently produced by the trace tool is basically compatible with Perfetto. Doesn’t let us open really large traces without splitting, which was one of the hopes. And it takes a while to load. I was able to use the command line tools to load a 280MB trace into a 9.8GB JSON trace and load that in Perfetto, but it took 20 minutes. Nick has been working on outputting proto directly, which will hopefully produce less data than JSON.
- Rhys: When I tried this a while ago, the connection of data flow wasn’t quite right.
- Felix: This CL doesn’t fix that. I’m hoping it’s an upstream issue, which they’re pretty responsive to. I’m hoping protobuf will just make it go away, since that’s their canonical input.
- Nick: Stack traces seem to be missing from protobuf, which we definitely want. We might need upstream changes to support that.
- Felix: I suspect there may be some long tail of issues. But the initial plan would be to keep both viewers until we feel this is solid.
- Austin: How does the streaming work?
- Felix: They have an in-memory column store with a SQL interface on top of it. Large traces would still be a problem because they’d need to be loaded fully into memory.
- Austin: In principle we could swap out that column store for our own streaming thing, but that sounds like a significant amount of work.
- Felix: On Go Time someone said they only use runtime trace when they’re really desperate and then they can’t figure it out anyway. Most people don’t think about their program from the perspective of the scheduler. I’d like to have different pivoting, like one timeline per G (or M). We sort of have that in the goroutine analysis, but that only shows on-CPU time. Dominick did that in gotraceui.
Updates on pprof labels (Nick)
- Nick: In MK’s recent comments on pprof labels CL, he wondered about a size limit on labels being recorded in the trace. Thinking about trace overhead. Users can also add arbitrary logs (limited by trace buffer size). My thought is that users are deciding to make these as big or as small as they want.
- Austin: My reaction is “do what the user said”
- Rhys: It seems like we already don’t have a limit on the pprof labels (number/length/etc) and maybe it would have been good to have a limit, but we already don’t.
- Bryan: For me it’s more important to be able to find out how much damage you’re doing with this data. Inevitably people want one more byte than the limit and will be frustrated.
- Felix: Two sides to this problem: how to get the data in the trace while keeping overhead low, and the other is keeping the memory usage low for keeping all these labels. For trace overhead, I’m thinking we want two or three levels of filtering: filter what events, filter events by properties (e.g., duration). JFR supports both of these. And potentially a way to modify events (maybe too far), like truncation. At some point you can almost guarantee fixed-cost tracing. E.g., turn off everything except profile events; now you have timestamps on profile events without all the other overhead.
- Austin: MK and I have definitely been thinking in that direction. The current trace viewer is almost purpose-built for analyzing the scheduler and needs to understand how a lot of events relate. But if we open up reading traces, the trace viewer becomes just another tool and maybe it’s fine for it to say “I need these events” (kind of like “perf sched” or similar).
- Felix: I can ask my Java colleagues about how this works in JFR.
- Rhys: Curious how you’re thinking about filtering.
- Felix: Simpler is better. You could imagine a callback, but that’s not simple. Probably something like runtime/metrics where you can discover the events and select.
- Rhys: Definitely need a header saying which events are included.
- Felix: Agreed. Also nice for viewers so they don’t have to hard-code all of the events.

mknyszek · 2023-01-19T16:53:57Z

2023-01-19 Sync

Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham @mknyszek @prattmic @dominikh @dashpole

Felix: gentraceback iterator refactoring
- Felix: What's the progress?
- Austin: Made progress. Running into issues with write barriers and trying to knock down all the write barriers one by one. Big open question of testing; so many horrible corner cases. No good answers.
- Felix: Tried to do it incrementally instead of all at once; also painful. RE: testing, would it be useful to have the ability to instrument a PC and do a traceback from there?
- Austin: That would help. The worst parts are weird though, like signals. If we had a good way to inject a symbol, like a breakpoint, that would help a lot.
  - Idea: could use hardware breakpoints via perf-event-open (Linux only, but at least architecture-independent) which could get enough coverage for Austin to be happy.
  - Could potentially synthesize other signal tests from a single signal.
- Felix: I'll give it a shot.
- Michael K: What work could we do in parallel?
- Felix: Could write a frame pointer unwinder separately for tracing just to get an idea of the overhead.
  - Austin: +1. Tricky things include logic in gentraceback for filtering out frames. Maybe it doesn't matter for the trace viewer (i.e. don't filter). Also inline unwinding. Trying to totally separate inline unwinding in gentraceback. Once its its own separate thing, it'd be straightforward to plumb that into a frame pointer unwinder.
  - Michael K: Could we skip inline unwinding for the experiment?
  - Austin: Yeah.
  - Michael P: +1 to separating out inline unwinding. Already "runtime_expandFinalInlineFrame" in the runtime which is a good reference point for this.
  - Felix: Also all the complexity with cgo traceback, but we should just ignore that for the experiment.
  - Michael K: The cgo traceback tests are also really flaky, and if we could have better testing around that that would be great.
Felix: Perfetto UI blues … (timeline bug, link bug, stack traces, large traces, small screens, protocol buffer format) … gotraceui w/ wasm? Having an online tool with independent release cycle is tempting?
- CL out that makes Perfetto work. Limitations:
  - Limited for very large traces as-is.
  - Doesn't seem easy to make it work as well as go tool trace (bugs). e.g. timelines not named correctly. Events not connected correctly.
    - Harder: getting stack traces to show up. Nick has tried to make it work. Protobuf format doesn't have an obvious stack trace format?
    - Nick: Not a one-to-one mapping between Catapult format and Perfetto. Can stick a single location in the Perfetto format, but not a full stack trace. Little things in the protobuf format that aren't well-documented. e.g. string interning only works if you include a number in the header.
    - Michael K: MP and I looked into this. Perfetto knows how to do this for some traces, but it’s built into a C++ library, so we’d have to rewrite that in Go or call into it from Go. I’m not sure it even has strong backwards compatibility.
    - Michael P: There is the Perfetto tool that runs the RPC server. (trace_processor.) That loads into a SQLite in-memory DB, but does do better than the fully in-browser implementation. It can do bigger traces, though is still limited. That seems like enough of an improvement to me.
    - Felix: I have a 280MB trace that gets split into 90 parts for 15 seconds on a busy server. Maybe we should start with deciding what size trace we want to have a good experience for.
    - Michael K: I think 280MB is a big trace, though it’s only 15 seconds. I think we should be targeting bigger than that. It’s easy to get a 1GB trace. But we can start with Perfetto as long as it’s better and work toward that.
    - Austin: Is that better with Perfetto?
    - Felix: I think it would be better. Maybe 5x better, so a second at a time (don’t quote me on that).
    - Michael P: The trace_processsor is better, but still limited by the in-memory SQLite DB. Presumably that could be on disk. I don’t know if the trace loading is also linear in the trace size.
    - Rhys: What do you even do with an execution trace that large? How do you get value out of that?
    - Felix: This trace was from a colleague from an instance that was struggling with pauses. It looked like a straggling procstop. It was debugging the behavior of a whole application that was behaving poorly.
    - Rhys: So you were looking for behavior that was pretty zoomed-out.
    - Felix: Yeah.
    - Michael K: Part of the problem with existing traces is the usability of this. I think it’s a valid question about whether big traces are all that useful. Sometimes you’re not even really sure what you’re looking for. Say I wanted to run a full trace on every invocation of the compiler. You don’t necessarily know what you’re looking for to improve compiler speed.
    - Austin: I bet if you were to profile the space of large trace file, the vast majority of that would not be useful to you looking at it at a high level. Suggests a solution here for filtering is to just reduce what goes into the trace.
    - 280MB Trace Size Breakdown
    - Michael K: Maybe just proc start/proc stop for what Felix was describing.
    - Rhys: But once you find the problem, you want more detail. It's hard to catch the end of a garbage collection cycle because of the rules of starting a trace during a GC cycle.
    - Michael K: Fixing the mark phase issue should be easier than before.
    - Austin: Awesome breakdown!
  - User group said "please don't do this" because Perfetto isn't nice to small screens.
  - Felix: gotraceui
    - Viewing timelines for goroutines is great.
    - Would like Dominik to talk about gotraceui some more.
    - I want to be intentional about choosing Perfetto.
    - Michael K: I think the dependency on gio was a concern.
    - Dominik: Gio (the UI library I use) supports wasm, so it should be fairly
      straightforward to have gotraceui run in the browser if we want to go
      down that road.
    - Dominik: I still rely on loading entire traces into memory (but using significantly less memory than current go tool trace), but with the upcoming format changes, streaming data might be possible. We currently load everything into memory because when the user zooms out far enough, we need all events to compute what we display. But we could probably precompute these zoom levels, similar to mipmaps.
    - Dominik: For the current trace format, gotraceui needs roughly 30x the size of the trace in memory. so a 300 MB trace needs 9 GB.
    - Michael K: I have been thinking about an HTML UI that does something like Google Maps tiles to scale. We could skip a lot of work if we could take gotraceui as the UI, but port it into something more portable than Gio. OTOH, it’s even more work to build something from scratch.
    - Dominik: WRT gotraceui's use of Gio, there'll be pretty rich UI, and I don't fancy writing UIs in HTML/JS. But all of the processing of trace data could live externally
    - Michael P: It’s not necessarily a hard requirement that the Go project itself ship a trace viewer. We have to now because there’s no API. But if we shipped an API, it wouldn’t be a hard requirement. Much like we don’t ship a debugger.
    - Michael K: One option is that we ignore the UI situation entirely and build something that you can parse separately and ship something really bare later. In the meantime, point at a little tool that will shove it into trace_processor and point people at Perfetto. For a brief time, stop shipping our own. It’s very convenient that you only need a Go installation to view these traces, but I think you’re right that we could stop shipping a UI. We could also keep the existing UI working/limping while we do other things in parallel.
    - Felix: Is Dominik looking for contributors? (That comes with its own overheads)
    - Dominik: I'm usually not big on contributions in the form of code; but ideas and feedback are hugely appreciated
    - Michael K: We don’t have to make a decision on using Perfetto now. Maybe we should plug along for two more weeks (with Perfetto) and figure out if we can fix the issues without too much effort, and then make a hard decision on what to do at the next meeting.
    - 👍
Felix: traceutils anonymize & breakdown and ideas: (flamescope, graphviz, tracer overhead)
- Implemented anonymization of traces. Breakdowns, too.
- Tracer overhead tool that uses profile samples in the trace to identify overheads.
Felix: Format: Consistent message framing, remove varint padding for stacks
- 4 different cases for how an event can be laid out.
- Maybe a way to skip messages and layouts it doesn't understand.
- Austin: self-descriptive header giving lengths for each opcode
- Michael K: Any state in the trace makes things hard to push it up into OTel, since that’s completely stateless.
- Felix: We’re actually trying to do two things in OTel. Including binary data blobs, like pprof and JFRs. And something to send stateful things like stack traces, etc, where you can refer back to them efficiently.
- David: For trace I wouldn’t expect a stateful protocol to be introduced any time soon. But for profiling it may be a possibility.

mknyszek · 2023-02-02T16:53:11Z

2023-02-02 Sync

Attendees: @aclements @felixge @nsrip-dd @thepudds @bboreham @dashpole @mknyszek @prattmic

Felix: Discuss results from frame pointer unwinding experiments (blog, sweet results) and next steps
- Targeted ping-pong example, worst case. Worth noting that the stack depth in that benchmark is 2. Went from +773% -> +30%, apparently doing 50% more work too!
- Sweet: 10% -> 2% overhead!
- Michael K: Michael P mentioned missed cases.
- Michael P: Inlined frames are one example. Maybe we just accept slightly less accurate traces in the tracer.
- Austin: +1 to missing inlined frames, but we can also expand that after the fact.
- Michael K: Do you need the binary for that?
- Austin: Today, yes.
- Felix: The tracer already de-duplicates stack traces. If we do inline expansion at the end, there's probably not that much work to do.
- Michael P: Other avenue, do we need stack traces on every event? Maybe remove stack traces for some events?
- Michael K: Where does the rest of the time go?
- Felix: In the blog post. Frame pointer unwinding is only 9% of the trace overhead. 28% is cputicks. 21% is stack put.
- Austin: Shocked that cputicks is 28%. It's one instruction. I guess that's a good sign?
- Austin: (FP unwinding is also relevant for #53286. In that case it’s the kernel’s FP unwinder, but it means our FP data is going to have to be high quality for both.)
- Thepudds: Or maybe an option for sampling of stack traces?
- Michael K: I think it depends. As traces are used today, you probably want 100% sampling. For larger scale aggregation, I think it's a solid option.
- Michael K: Dream of nanotime to line up clocks.
- Austin: It might not be that bad. RDTSC is serializing so the extra math in nanotime might not make much of a difference in overhead.
- Michael K: We should definitely pursue this, at least for tracing.
- Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.
- Austin: Status of FPs is better than it once was. Saving grace of the tracer is you often don't have an assembly frame on the stack. Talked about making vet complain if you clobber the frame pointer in assembly code. Would be surprised if there are problems in the compiler generated code; worry much more about assembly.
- Felix: Worried about stack shrinking / relocation. Growing shouldn't happen while in unwinding, but not sure about shrinking.
- Austin: I think you always see a fully formed stack.
- Felix: There's no chance of seeing the stack mid-move?
- Austin: The goroutine that's getting moved has to be stopped.
- Nick: If unwinding happens asynchronously then it's a problem, like CPU profiling. We could use gentraceback in the difficult cases.
- Felix: Plan on working on better unwind testing. That machinery could be used to harden frame pointer unwinding as well.
- Michael K and Austin: Not a blocker to have the testing.
- Austin: FP on x86 is specified as part of the Go internal ABI. If the compiler is messing that up that's a violation of the ABI and definitely a bug. Doesn't apply to hand-written assembly.
- thepudds: One of the older CLs mentioned its approach depended on the stack not being copied while walking the frames, along with the comment “currently ok, but won't be if we preempt at loop backedges”... but maybe that old concern is not a current concern....
- Michael K: I think loop backedges aren't a concern, and async preemption as it exists shouldn't be an issue.
- Michael P: Traceback itself would just disable preemption just for consistency, but just because it's in the runtime package, we won't ever async preempt.
- Austin: I'm not sure why loop backedges would be a concern.
- Michael K: I don't think we should block on inline expansion, but maybe cgo tracebacks.
- Austin: As an intermediate step, use gentraceback for if there’s a cgoTracebacker and cgo on the stack. Will work for 99% of our users.
- Felix: Like the idea of making it the new default, but with the ability to switch back.
- Michael K: We could add a GODEBUG flag
Felix: Flight recorder / ring buffer mode
- Felix: We’d like to capture traces of slow spans. Wait for a p99 response and then get the last N MB of trace. I’m currently working on an experiment to see if this can be done in user space.
- Michael K: I think finding the oldest batch is an O(N) operation. Ordering the batches is difficult because we assume everything will show up eventually.
- Austin: The tracer is really stateful, so it's really difficult to actually manage a ring buffer. debuglog is a ring buffer, and what it does is consume its own format in order to manage a snapshot of the state.
- Felix: I’d be okay with getting a non-perfect trace at the end. At least understand what the goroutines are doing. Maybe we could every once in a while emit a “synchronization” event. If a complete redesign of the format is required, [flight recorded mode] is something we’d be interested in.
- Michael K: I’d like to find out what the overhead of writing the trace is. Say you have no stack traces, where is the rest of the time going? That’s important information for redesigning the trace format. I’ve already been thinking about redesigning the format. At the cost of using more space, it has to end up less stateful. Regularly synchronizing is one way to do that. That’s kind of where I was going: a “trace” is really a collection of self-contained traces. With the tooling able to be more resilient at the edges. Synchronation wouldn’t necessarily be STW, but you have a ragged barrier across the Ps that sync them all to the next trace chunk. That gets complicated in a ring buffer. I was thinking of gathering the requirements for a new trace format. Because there’s so much state, it’s hard to make it completely stateless without ballooning the trace.
- Felix: JFR does that ... splitting the stream up into self-contained chunks.
- Michael K: We’re definitely on the same page [wrt flight recorder]. The Go team arrived at this, too. We’re also trying to make ELF core dumps the source of truth for heap analysis. Ideally we’d be able to pull the ring buffer out of a core dump so you can see exactly what was happening before crashing.

qmuntal · 2023-02-02T17:00:34Z

Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.

FYI: #57302 is hitting this as well, as I'm implementing SEH unwinding using the frame pointer. Whichever is the fix for that, would be good to take SEH also into account.

This change does a lot at once, but it's mostly refactoring. First, it moves most of the profile abstraction out of benchmarks/internal/driver and into a new shared package called diagnostics. It also renames profiles to diagnostics to better capture the breadth of what this mechanism collects. Then, it adds support for turning on diagnostics from configuration files. Next, it adds support for generating additional configurations to capture the overhead of collecting diagnostics, starting with CPU profiling. Lastly, it adds support for the new Trace diagnostic. (This change also fixes a bug in go-build where Linux perf flags weren't being propagated.) In the future, core dumps could easily be folded into this new diagnostics abstraction. For golang/go#57175. Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754 Reviewed-on: https://go-review.googlesource.com/c/benchmarks/+/459095 Run-TryBot: Michael Knyszek <[email protected]> Reviewed-by: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

mknyszek · 2023-02-16T16:59:26Z

2023-02-16 Sync

Attendees: @mknyszek @aclements @felixge @nsrip-dd @prattmic @dominikh @thepudds @pmbauer @dashpole @rhysh

468301: runtime: delete gentraceback
- Austin: Needs more testing.
- Austin: Nice things to do as a result, listed in the issue. e.g.
  - Simpler defer processing
  - CPU profiles have a low limit on frames it'll capture.
    - Iterator makes this much more tenable to fix.
  - Years-old squirrely bug in the race detector.
- Felix: I’m happy to look into testing using perf, but I’m not sure when I can get to it.
- Rhys: If there are more frames than you want to record, could you add some context by including N outermost frames and M innermost frames. Maybe a “runtime._Elided” frame in the middle.
- Michael P: We’ve thought about doing that for panic tracebacks.
463835: runtime: frame pointer unwinding for tracer Felix Geisendörfer: wip, but almost ready for review
- Are slight stack trace differences acceptable?
  - Michael K: I think that’s fine. As we move toward letting people parse the format, I think lining up traces with stacks from other sources could become more of a problem.
  - Felix: The current patch passes most of the tests of tracebacks in traces.
- Should it use an unwinder interface similar to austin’s patches?
- Could systemstack be changed to push frame pointers? Otherwise the caller frame is lost. Naive attempts to make this change caused crashes.
  - Austin: Yes please.
- Weird issue with syscalls on BSDs losing a frame.
  - Austin: That’s probably lazy assembly.
  - Felix: Another option is to only enable FP unwinding on Linux for now.
  - Austin: As long as it works on Linux, Windows, and Darwin I’m happy.
- Cgo unwinders
  - Austin: It’s fine to take the slow path if the current goroutine has cgo frames and there’s a cgo unwinder.
- Felix: I got inlining to work (when traces are finalized). Benchmark numbers are still holding.
Michael K: Once all of the backtrace stuff is settled, I want to try using the monotonic clock (nanotime) rather than CPU ticks.
- Nick: Could you record nanotime at the beginning of a batch and then CPU ticks after that.
- Michael P: To do that safely, you’d need to know when you migrate CPUs. Linux’s restartable sequences can get you that.
- Michael K: There might not be a performance gap between nanotime and cputicks.
- Austin: If there’s even a performance gap, you could push more of the nanotime computation into the trace reader.

$ benchstat -col '.name@(CPUTicks Nanotime)' /tmp/bench
goos: linux
goarch: amd64
pkg: runtime
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
    │  CPUTicks   │              Nanotime               │
    │   sec/op    │   sec/op     vs base                │
*-8   10.75n ± 0%   16.11n ± 0%  +49.88% (p=0.000 n=20)

runtime: copystack doesn't adjust frame pointers on arm64 · Issue #58432 Felix Geisendörfer
- It was relatively easy to fix once I understood what was going on, but there appear to be dragons there.
- Boolean in the runtime does a double check of FPs on stack copies.
- Would like to treat arm64 as a separate issue, but I plan to get to it.
460541: runtime: reduce sysmon goroutine preemption (Felix Geisendörfer)
- Michael P: There are likely issues here with special cases in the scheduler. Not sure they're easy to fix.
cmd/pprof: macOS 12.6.1 (M1) profile overcounts system calls (again) #57722 (Felix Geisendörfer)
- Michael P: C reproducer and handing off to Apple (if it works) seems like a reasonable next step. No guarantee we'll get a fix though.
proposal: runtime: add per-goroutine CPU stats · Issue #41554 (Felix Geisendörfer)
- Felix: Initial justification was along the lines of billing, which seems better served by pprof. Then it shifted to fast control loops to throttle users. It seems better to have scheduling priorities, but barring that it seems good to let user space do priorities.
- Michael P: We’ve been discussing having tracing that’s cheap enough to have on all the time, and a parsing library. Maybe a user could do this by enabling tracing and parsing their own trace. Is this generally the right approach to user throttling at all?
- Rhys: I think a cheap trace that can be parsed in the app is good and flexible. I’m not sure per-goroutine stats is the right approach. E.g., if I use the net/http client, there are a bunch of goroutines involved that I don’t control but I want to understand the latency of.
- Felix: One trade-off of the trace route is the latency of reading your own trace.
- Rhys: It would be useful if the app could say, “I need a ragged barrier ASAP and I’m willing to take some performance hit.”
- Michael K: The other complication is how fast we can make the parsing. That might add unacceptable latency.
- Felix: I think the “explain analyze” case is not the best example. The most difficult is trying to throttle a user of the database that’s doing something you don’t want. In that case you don’t know ahead of time, so you’d be doing the ragged barrier all the time.
- Michael P: I think that’s a good argument for actual priorities in the scheduler. If you have some background goroutine watching for bad behavior, that might not get scheduled if there’s bad behavior.
- Austin: Swirling around problems that people have been thinking about for decades. Would love to see a summary of the current state-of-the-art is here.
- Michael K: Probably only OS APIs.
- Austin: That's not a bad thing. If it's a good API, we can consider replicating it.
AIs
- Michael K: Writing down trace requirements in earnest
- Michael K: Testing for x/debug
- Michael P: Need to review Austin's CL stack.
- Michael P: debug/gosym proposal.
- Felix: Clean up the tracer FP unwind patch (for amd64) to get it ready for review.
- Austin: Try to keep moving along gentraceback stack. Think about test-hook-coverage aspect.

mknyszek · 2023-03-03T16:41:35Z

2023-03-02 Sync

Attendees: @mknyszek @prattmic @felixge @nsrip-dd @aclements @thepudds @rhysh @bboreham

Michael K: I'm 70% of the way to a trace v2 (producer, consumer, trace format), and 40% of the way to writing it up.
- Most of the time is being spent detangling the existing tracer, documenting it, and using that to justify next decisions. Hopefully I'll have a draft to share before next time.
- [Michael K proceeds to go into way too much detail about this. Highlights below. A public document will follow.]
  - Let's use the system clock (e.g. clock_gettime) instead of RDTSC (for a number of reasons).
  - There are a very small number of places where you really need to understand the exact order of events. The current tracer takes advantage of that and I believe we need to retain this. Timestamps aren't enough.
  - Attach traces to Ms, not Ps. There’s a lot of complexity around GoSysExit racing with trace start. Thinking about ragged start and making the parser robust to that.
    - This choice forces us into avoiding a stop-the-world.
  - Trace binary format ended up being more about consumer efficiency than producer efficiency, but still more efficient on both sides.
    - Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings.
    - Trace events are sequences of 4-byte words whose internal structure respects byte boundaries and field alignment, to allow encoding/decoding events to just be memcpys and state management.
    - Using Felix's 280 MiB trace breakdown as a motivating example. By my calculations the current design woukld use around 10% more. Personally that seems acceptable for the other gains.
    - Every G event has an explicit G ID, but it's derived from a "G context" event. G IDs are also compressed.
- Michael K: We could make the stack table faster by only checking the hash instead of an exact match. Small chance of error.
- Rhys: Let's be cautious about making sure that traces actually work.
- Michael K: That's a good point. We should put an explicit bound on the likelihood of error. If it's astronomically small, is that fine?
- Rhys: Astronomical is good.
- Rhys: Would the new trace format still enumerate every goroutine? Currently can get stuck in many-millisecond STW waiting for tracing to enumerate all goroutines.
- Michael K: My plan was no. Goroutine profiles if you want that?
- Rhys: That's good. Yeah, you should be able to correlate a goroutine profile with a corresponding STW event in a trace. Happy about no STW in general for traces too.
- Rhys: RE: correlating things with traces, do we want to keep things orthogonal in general? Thinking about CPU profile events in traces.
- Michael P: I see where you're coming from in that you might want just the CPU profile events from a trace (with timestamps) and it's weird to get the whole trace and throw most of it away. We discussed having an API for configuring the trace and which events get emitted, so that might be a good place for that.
- Austin: There's a relevant issue about making CPU profiles more configurable as well, so maybe that's a good place for it too?
- Michael P: I think there are a lot of API questions here. Do you configure CPU profile in tracing at the CPU profile side or at the tracing side? The most natural way sounds like the tracing side because that's your actual output format, but I'm not sure. And then it gets complicated if you turn on CPU profiling in the tracing API and then you separately turn on CPU profiling, is that allowed? Right now you can't turn on profiling twice. And that's even more complicated, if we let you figure the sampling rate and they're not the same.
- Rhys: One of the difficulties that I've had in using execution traces and CPU profiles at the same time is that even though the CPU profile doesn't exactly stream its output while it's going. It's tricky to juggle two different output formats. At the same time that I'm trying to put into a single zip file to upload to blob storage. A single buffer would be handy.
- Michael P: A single buffer is ideal, but we don't have a converter that could pull a CPU profile out of a trace. We're missing information.
- Rhys: For one, we're definitely missing goroutine labels, though there's a patch out for that. We're also missing /proc/<pid>/maps for binary/symbol information.
- Austin: It occurs to me that Linux perf basically starts with /proc/<pid>/maps.
- Michael P: Perhaps we should also dump build information. We've been brainstorming about including this information for PGO.
- Michael K: There's room for as much as we want at the beginning of the trace, basically, so I'm all for adding more there.
Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.
- Felix: Would the old trace format fit in the new parser?
- Michael K: That was my goal. We'd have to do some retrofitting, but the old parser already exists. Caveat: parsing old traces would still have the same overall properties as the trace parser currently does.
Felix: Frame pointer unwinding patch for tracer is ready to review. It’s only amd64 for now and a bit rough around the edges. We should discuss what needs to be done before landing. Cgo is still missing, but I’m working on adding that.

dominikh · 2023-03-04T01:39:59Z

Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings

Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.

Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.

I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.

mknyszek · 2023-03-06T16:56:46Z

Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings

Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.

It does not. It only cares about the initial state of all Ms (including goroutines running on them), and generally only mentions goroutines that actually emit events. For goroutines that aren't running, there are only two cases where we actually care about the initial state of a goroutine: whether it was blocked, or whether it was waiting. In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.

The trace still needs to indicate if a goroutine (and M) is in a syscall or if it's running. In the new design, this information is emitted together at the first call into the tracer by that M for that partition. The timestamp needs to be back-dated to the start of the partition. There's some imprecision with this back-dating but it's only relevant at the very start of a trace. The worst case is that a goroutine may appear to have been running or in a syscall at the start of a trace for longer than it actually was. The amount of imprecision here is bounded by the time delta between the global (serialized) declaration of a new partition and when an M has it's buffer flushed and/or is notified (via an atomic) that tracing has started, which I expect in general to be very short and non-blocking. (We can also explicitly bound the time by telling the M what time it was contacted for a new partition.)

Note that the details above imply that when a new partition starts, a running M may have been in a tight loop and so hasn't emitted any events for the last partition, in which case we need to preempt it to have it dump its initial state. Generally, moving partitions forward doesn't have to even involve preemption.

Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.

I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.

That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).

dominikh · 2023-03-06T17:48:09Z

In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.

That has two implications, however:

goroutines that don't unblock during the trace will be unaccounted for
the states of all goroutines can't be determined without looking at the entire trace

I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view? However that wouldn't really fit into an M-centric format…

That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).

I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates Link fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)

mknyszek · 2023-03-06T18:34:49Z

That has two implications, however:

goroutines that don't unblock during the trace will be unaccounted for

the states of all goroutines can't be determined without looking at the entire trace

I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view?

Both of those things are good points.

Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated). At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.

However, I can definitely see that getting the information you describe has utility and we don't want to lose that. In the last meeting we discussed how goroutine profiles could be used to fill this gap. As a baseline, it should be fairly straightforward to correlate a goroutine profile's STW timestamp with a STW event in the trace. Taking that one step further, we could explicitly mention that the STW was for a goroutine profile in the trace. (In theory we could also dump the goroutine profile into the trace, like we do with CPU samples. I am not opposed to this, but I probably wouldn't do it to start with.)

You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable? Perhaps I'm missing some use-case that's totally missed. FTR, I fully recognize that we're losing something here in the trace, but I argue the net benefit is worth that cost.

Also I just want to disclaim the design details in the last paragraph: subject to change in the first document draft. :) That's just where my head's at right now. It may turn out that the per-M synchronization I have in mind is too complex.

However that wouldn't really fit into an M-centric format…

I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?

I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates Link fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)

That's another good point. To be clear, I do plan to have an API with some level of abstraction and not quite just []byte-to-type. :) Events will be opaque and fields will be accessed through methods, so we have a lot of wiggle room. However, something like the Link field I think requires keeping the whole trace in memory, because you never know when someone might want to access an event from a long long time ago (though I haven't thought this through). In theory an accessor can be arbitrarily complicated and even re-parse the trace to find the event, I suppose. :P

My general hope and expectation is that the vast majority of users should never have to look at the API at all, and instead rely on tools built with it. And those that do use the API don't need to understand the file format, just the execution model it presents (which I think is somewhat unavoidable).

dominikh · 2023-03-06T18:58:36Z

Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated).

I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?

At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.

One use case of looking at execution traces as they are now is debugging synchronization issues. Imagine having an N:M producer/consumer model using goroutines and channels, and we're debugging why producers are blocking. The reason might be that all of the consumers are stuck, which is only evident if we can see them be stuck. If they're already stuck at the beginning of the trace then they would be invisible in the new implementation.

More generally speaking, a lot of users aren't interested in the per-P or per-M views and instead want to see what each goroutine is doing (see also the per-goroutine timelines in gotraceui.) It turns out that per-G views are useful for debugging correctness and performance issues in user code and that traces aren't only useful for debugging the runtime.

You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable?

In theory that sounds fine, assuming goroutine profiles are proper STW snapshots? Otherwise it would probably be difficult to synchronize the trace and the profile.

At least this would give people the choice if they want to tolerate STW for more detailed traces.

However that wouldn't really fit into an M-centric format…

I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?

Probably, yeah.

dominikh · 2024-01-18T18:17:49Z

If anyone has time to spare or thoughts on how we can make this easier and cleaner, please let me know!

I don't have ideas regarding overall design yet. However, since there'll be no way around reading old traces entirely into memory to sort them, I'd like to suggest using the trace parser of Gotraceui. It started out as a copy of go tool trace's parser, but has been optimized to use less memory, put no pressure on the GC, and it uses a smarter way of merging batches that lowers CPU time. It's how Gotraceui is able to load traces in a fraction of the time and memory needed by the old go tool trace.

It's at https://github.com/dominikh/gotraceui/tree/8fbc7cfaeb3cebed8890efbc030a62a7f1ff3f81/trace — since it started as a direct copy of Go's parser, the git log for the folder should show all relevant changes, and includes some benchmarks.

There are some changes that might have to be reverted if we want to support very old traces. My parser dropped support for Go 1.10 and older and it doesn't handle EvFutileWakeup, which haven't been a thing since Nov 2015.

The only change that might not be great is 2c5675443eebc969ee07cd4f2063c2d7476f7b9b which removes support for trace formats older than Go 1.11. That change could be reverted if need be, but traces that old wouldn't benefit from my improvements to batch merging.

Edit: I've nearly completed an implementation of the conversion.

mknyszek · 2024-03-09T00:32:12Z

2024-02-01 Sync

Attendees: @prattmic @felixge @dominikh @nsrip-dd @rhysh @bboreham @mknyszek

Felix: RE: Parsing old traces in the new API. I see that Dominik has implemented it. We've tried it and it seems to work really well.
- Michael K: On my TODO list to review, I will get to it soon. And thank you again to Dominik for working on that!
Felix: RE: Panics in tests cause broken execution traces.
- Got reviews. Thanks!
Felix: RE: Tried PGO, found increase in stack memory usage.
- Have a profiler tool that takes a goroutine profile and a binary that produces a plausible stack frame size profile.
- Worth a proposal as a profile type?
- Michael P: Could be another sample type in the goroutine profile, possibly. Your proposed way of estimating the stack size should be pretty close. It probably omits some wrapper functions that add a few bytes. If we do make this a real profile type, we can put in the real stack size. The runtime has to look at the goroutine anyway.
- Rhys: How common is it for goroutines to idle with a small number of stack frames, but grow their stack pretty large transiently. That could hide stack space with this way of estimating the stack size. RE: expanding the goroutine profile, could we add extra little debug things to execution traces to get this data? For example, when stack growths happen.
- Felix: Lots of good ideas. RE: the concern about goroutines that grow and then shrink again, are you worried about capturing the HWM of stack size?
- Rhys:
- Felix: Wouldn't this be the same issue with a live heap profile?
- Rhys: I think there's a bound on the live heap.
- Michael P: I think there are two types of growing being discussed here. One is how much of the stack is being used, and the other is how much stack space is allocated. Rhys's case suggests that stacks shouldn't shrink too often, since we need a full GC to shrink. I think the case where the runtime directly records the allocated size handles that.
- Michael K: I think Felix's approach would still work well if you take many samples. Eventually you'd catch the spikes. You're going into the problem already knowing your stack memory usage is high.
- Felix: Another reason this would work well in practice is that you only care about it when you have a lot of goroutines, because that's when you have a lot of stack space.
Michael K: Flush generation on panic and accept broken trace tails
- Michael P: In theory if you have a core dump, it shouldn't be too difficult to simulate what traceAdvance does, and finish up the trace. That way you can fully extract the trace out of the core dump.
- Rhys: Does this only work well if the trace is being directed to an os.File, or does it work well in the general case? What happens when the io.Writer goroutine needs to execute?
- Michael P: When a panic starts we freeze the world. RE: what Michael was talking about, we were thinking more about
- Michael K: I was thinking more about crashes as somewhat isolated, so that most of the application could still continue running. Maybe that's wrong.
- Michael P: We could investigate that, but we'd have to not freeze the world.
Michael P: Just merged SetCrashOutput, it lets you specify an additional FD to write out panic string in addition to stderr.
- You could imagine the flight recorder has a similar API that writes out to an os.File, which will be a lot more robust than a general io.Writer.
Michael K: Still need to look into the slight regressions we've seen due to recent diagnostics work. It's only been microbenchmarks, so we expect this to not actually impact real applications. Rolling out at a bigger scale hasn't shown any significant regressions. See https://perf.golang.org/dashboard/?benchmark=regressions&unit=&repository=go&branch=master&days=180&end=2024-02-01T16%3A37 for the benchmark list we have.
Rhys: Did a bit of investigation as well: runtime: ~5% regression in ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-8 sec/op at 450ecbe #64455.

mknyszek · 2024-03-09T00:36:00Z

2024-02-15 Sync

Attendees: @nsrip-dd @rhysh @mknyszek @bboreham @felixge @prattmic @dashpole

Sysmon Preemption
- Michael P: Felix’ patch came up when somebody reported a go1.22 regression that might have been caused by preemption. https://go.dev/issue/65647
- Felix: Came up on my end this week as well. I found the preemptions useful to debug OS latency, seeing 500ms goroutine time slices in the execution trace gave me confidence that OS scheduling latency was high.
- Schedviz tool: MP started prototype to combine kernel scheduler traces and Go execution traces, effectively add a CPU dimension to the execution trace.
Michael K: Default metrics set for OpenTelemetry
- Following up on a previous discussion.
- Michael K: Exporting all metrics might be too much. I wrote up a proposal for an API to give a default set of metrics that are widely useful. Michael P convinced me that this should maybe start as documentation rather than an API.
- Michael K: API would take a go language version and give back a set of metrics. If you ask for go1.17, you get the defaults for that version.
  - func Default(toolchain string) ([]Description, error)
- Bryan: There is a prometheus client golang issue (link?). Maybe we (grafana) could dig up information on which go metrics are collected by our customers in the cloud via prometheus. We can scan dashboards and queries for 30d and report on this in aggregate.
- Michael K: That would be useful to have. I came up with a default set in my doc, but I’m not super happy with it. I could share the doc, but feels like it might get deprecated right away.
- Michael K: The default set would have strict compatibility guarantees. Anything in the set would be supported effectively forever. You could unconditionally rely on those metrics to still exist (but they might become 0).
- Rhys: Questions about RSS, VSS, why heap profile is ½ of RSS come up often. Maybe the GC guide could address some of this. I don’t understand why the API to get good metrics needs to be in the core rather than a 3rd party thing.
- Michael K: Yeah, could also go into /x/ as a middle ground.
- Rhys: I don’t know what it means for a metric to be a default. Different apps have different needs. This is also coming up b/c metrics are expensive. They are small relative to heap profiles which many folks collect.
- Michael K: I had the same thoughts – why is this a problem at all.
- Bryan: We charge $8 / month for a thousand time series. I think runtime/metrics has ~100 metrics, that would be 80 cents per month. This is multiplied by the number of processes you run.
- Felix: Usually what's expensive is metrics with tags with high cardinality. runtime/metrics are typically exceptions to this because they're process-global. I think it's cheap-ish, but different people have different opinions.
- Rhys: I agree that putting them as time series in practice gets expensive, but there's a lot of data we can get out of the runtime that we wouldn't turn into a time series. Like heap profiles and execution traces. In practice I'd like to have all the data sitting around.
- David: Our motivation is primarily around stability. Unless it's very very expensive, and it sounds like it's not, that's probably not much of a concern for us. We would like to be able to say "here's 1.0, we won't break you."
- Bryan: In Prometheus there will be 3 options: the 2017 set, all of them, recommended set. Aren't heap profiles just as large? (e.g. you have 100s of stacks and frames, if you have a bunch of heap profiles, isn't that the same)
- Rhys: I'm not sure we're looking at this the same way. For stability, could we have a string translation API?
- David: That seems OK. The library would hide this translation from users. The harder question is: is there a set of metrics that we think will continue to be useful in Go 1.30?
- Rhys: I would expect most runtime/metrics to not change. Recently, the STW metric changed.
- Michael P: I see where you're coming from David. I think to Rhys' point that most metrics don't change. But there are also details about, for example, memory classes. New classes can get added, and some might not make sense as the GC changes.
- Felix: I think the argument of stability resonates with me. I recently built a dashboard that breaks down memory and it could break. I also think I got what Bryan was saying earlier. I think you can see the heap profiles similarly to metrics, but the blob profile has different performance characteristics. Also, they tend to get taken less frequently.
- David: It's also good to have some space for metrics to be introduced that are in development.
- Michael K: Shall we move forward with documenting metrics? Would this work for you David?
- David: What this would look like is we would have a documented set that depends on Go's documented set. This would be fine to unblock us. Going forward, it would be good to have a way to make this more maintainable and detect breakage from the Go side.
- Felix: Last I looked it seemed like the Go metrics support in OTel was incomplete. What's the plan?
- David: We plan to throw away what we have and start from scratch. We have two approaches in progress: one would be to have a stable set, and the other would be to have some configuration so that users that want different metrics or more metrics don't have to wait for us.
- Felix: What's the best way to follow along this stream of work?
- David: There is an issue on the semantic conventions repo: Semantic conventions for go runtime metrics open-telemetry/semantic-conventions#535
- David: There are also WIP discussions/proposals in the Go sig: runtime instrumentation replacement proposal
OpenTelemetry OTEP 239 (pprofextended format proposed as new profiling signal format) - Felix Geisendörfer
- I wanted to call some attention to this and it's in the last call stage. It will get merged soon.
- At this point the spec could still be changed, but the big picture will probably stay the same.
- pprofextended: open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2
- Michael K: I'll bring it up to the team.
- Michael P: Alexey has been following this, which gives me confidence that there will be some alignment.
- Felix: Yeah, he basically co-designed this with Dmitri from Grafana. But AFAIK he was not the one that could make the call on what could ultimately happen with the pprof project. It's more about whether OTel determines pprof's future.
- Michael P: Yeah, it seems like the question is how they're going to be kept in sync (or not).
- Felix: Divergence would be highly undesirable because great pains were taken to make pprofextended to maintain compatibility.
Felix: Deterministic Testing for Go
- Felix: A colleague approached me with questions on how to make the Go runtime more deterministic for testing.
- Michael K: We have considered adding scheduler hooks or more deterministic scheduling.
- Michael P: runtime: scheduler testing controls #54475
- Rhys: There are some places where the Go project intentionally randomizes.
- Felix: I think my colleague was actually trying to randomize but with a fixed seed.
- Michael K: Race mode already randomizes a whole bunch of things. You could imagine setting the seed for that.
- Felix: A seed that is used for everything would be interesting.
- Michael P: Some of the randomness isn't explicit, like two threads racing for the same work. It needs explicit support to become deterministic.
- Rhys: You could imagine a record-replay type system.
- Michael K: It occurs to me that an execution trace might have the information you need for that already. It's not designed for it, but it might work.
- Michael P: That's interesting, it might work. There will still be some weird things you need to do, for example to make an async preemption land at exactly some instruction.

This change does a lot at once, but it's mostly refactoring. First, it moves most of the profile abstraction out of benchmarks/internal/driver and into a new shared package called diagnostics. It also renames profiles to diagnostics to better capture the breadth of what this mechanism collects. Then, it adds support for turning on diagnostics from configuration files. Next, it adds support for generating additional configurations to capture the overhead of collecting diagnostics, starting with CPU profiling. Lastly, it adds support for the new Trace diagnostic. (This change also fixes a bug in go-build where Linux perf flags weren't being propagated.) In the future, core dumps could easily be folded into this new diagnostics abstraction. For golang/go#57175. Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754

This change does a lot at once, but it's mostly refactoring. First, it moves most of the profile abstraction out of benchmarks/internal/driver and into a new shared package called diagnostics. It also renames profiles to diagnostics to better capture the breadth of what this mechanism collects. Then, it adds support for turning on diagnostics from configuration files. Next, it adds support for generating additional configurations to capture the overhead of collecting diagnostics, starting with CPU profiling. Lastly, it adds support for the new Trace diagnostic. In the future, core dumps could easily be folded into this new diagnostics abstraction. For golang/go#57175. Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754

For golang/go#57175. Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754

mknyszek · 2024-06-20T18:13:19Z

2024-02-29 Sync

Attendees: @mknyszek @prattmic @bboreham @rhysh @nsrip-dd @dashpole Arun (from DataDog)

Deterministic simulations in Go – Arun (from Datadog)
- Arun: Testing distributed systems. We have used formal methods like TLA+ and lightweight simulation like SimPy, but they have problems. The goal is to inject failures and run the runtime in a deterministic way. FoundationDB is an example of applying the determinism concept.
- Arun: The Go runtime doesn't provide determinism. Goroutine scheduling, map iteration order, channel selects, etc.
- Arun: Exploring trying to side-step making Go deterministic. Using Hermit from Facebook. Containerizes a program and intercepts syscalls.
- Michael P: Interesting idea, we (the runtime team) have discussed testing runtime internals with more deterministic scheduling. For you, this is more about in-user applications which seems a bit easier to solve. The Go runtime could run everything on a single thread and pick a specific goroutine execution order. It wouldn't be a small feature in the runtime, but it seems feasible.
- Arun: Could we have a seed that captures the order in which the scheduler decides to do things?
- Michael P: It comes down to the implementation not taking determinism into account. There's a lot of intentional racing and time measurement that is inherent to the scheduler's implementation. At a baseline we'd need only cooperative preemption, and we'd need to enforce that we always preempt at the same time. Limiting to a single thread helps.
- Michael P: Question about Hermit: sounds promising, but how does it deal with the same sort of scheduling issues? Like Linux scheduling determinism? Does it deal with that at all?
- Arun: Not sure, but in testing spinning up multiple goroutines and running them in the same order, it seems like Hermit does actually seem to run them in the same order.
- Michael P: While I say that it feels implementable in Go, a third party project is probably better to invest in. For example, the advantage of also running your C code (in cgo cases) deterministically.
- Arun: cgo doesn't seem to work. Hermit seems to crash on a vfork call in the runtime. Can the Go runtime do anything about C code?
- Michael P: Not really.
- Bryan: I saw the blog post from antithesis, but both that time and today I'm missing something. I get that determinism is useful in trying to track down a problem, but I don't get how taking away that problem is enough to root cause analysis.
- Arun: Determinism solves part of the issue, but test cases are needed as well, for testing distributed systems.
- Michael K: Is the missing link record-replay?
- Bryan: That would do it.
- Michael P: You could fuzz the scheduler order, and use that as a regression test. Have these projects done anything to solve perturbation of deterministic orders due to source changes? That seems like a substantial problem. Does the determinism need to be defined at a higher level?
- Arun: Not aware of anyone who's solved that problem. You can definitely at least capture higher-level operations, like RPCs, and record and replay those.
- Rhys: You mentioned you haven't been able to check the model against real code execution, but I've heard of groups that run a simplified version of their model as assertions. Agree with Michael P that making little changes and determinism is set at too low of a level, that'll throw off regression tests.
- Arun: There's a Rust library that lets you run model checking against your implementation, but I'm not sure if we'll still run into challenges with determinism in Go.
- Michael K: Bringing back up idea from two weeks ago about traces for this.
- Michael P: The main missing feature though is still that you don't know how long to run before yielding, and that's not in the trace.
- Michael K: True, you'd need a function call trace or something. Didn't Jane Street have a thing for function call tracing?
- Michael P: Yeah, it uses Intel PT.
- Michael K: With the disclaimer that I haven't read your doc yet, I think we'd be open to a prototype. We would like to be able to use this ourselves anyway.
- Michael P: Logistically, it would be interesting with respect to the faketime implementation, in terms of making the Go playground more deterministic.
- Michael P: https://go.dev/issue/54475 scheduler testing issue
Bryan B: Topic: profiling maps
- Idea comes from watching Matt Kulukundis' talk on Swiss Tables
- Some portion of every map is randomly chosen for sampling
- Fetch the size, capacity, few other things.
- thepudds: There was a recent issue about map and slice presizing with PGO.
- Nick: The heap profiler today does capture map growth, but somebody made the decision to hide runtime functions at the tail end of the call stack.
- Bryan: It's hard to figure out if the problem is lots of small maps, or a few big maps. Something explicit for maps would take away a lot of the guesswork.
- Felix: It would be interesting to profile channels as well. We've gained significant capabilities to figure out when channels are contended with traces, but we don't really know why we block on the channel, or whether the channel's capacity is an issue. "How full are my channels" would be another interesting profile.
- Michael K: You could imagine that in the execution trace we just dump more information when blocking on channels.
- Felix: It's really useful to know when you're not blocked yet. Or never will be (e.g. when using select to send).
- Rhys: Knowing when you're not blocked in the execution trace would be really interesting. Sometimes you don't get to see that two goroutines don't have an interaction at all.
- Michael K: You could imagine extra trace events that log every channel operation.
thepudds: Escape analysis could be exploited further to drop old slice backing stores.
- Michael K: Definitely something we're thinking about, just reusing memory in general. TODO: Add in the rest.
- thepudds: Getting the runtime and compiler to collaborate on this seems like it would be the most beneficial.
- Michael K: +1.
Michael K: Have not gotten around to creating a proposal for a documented set of recommended metrics, I'll try to get on that.
- David: Reached out to someone on the dashboards team internally to try and figure out what metrics from Prometheus people are actually using. Reached out to other people in the OTel group to get similar information from other companies.
- Felix: We can see what we can get. Customers tend to just use MemStats, so the analysis will be harder for us. We are internally using runtime/metrics though. Did an analysis of MemStats and I can at least tell you what's definitely not being used there.
Michael K: Also an update on more goroutines in traces: I sent a CL out. There's one case left, but it's unfortunate we're playing whack-a-mole. We should perhaps do something more principled.

mknyszek · 2024-06-20T18:14:29Z

2024-03-14 Sync

Attendees: @mknyszek @prattmic @felixge @dashpole @nsrip-dd @rhysh

Nick: Checking out viewcore lately. Wondering if anybody has any experience with something like viewcore but with a UI that we can learn from.
- Felix: Maybe JMC (Java Machine Control) has something?
- Nick: Yeah, it's called "visual VM" which has a ton of stuff but I haven't tried it myself yet.
- Nick: I was able to recreate what my colleague did with root tracking with tracking a heap profile from a core dump.
- Michael P: One thing we wanted to do is to make the internals of viewcore a nice public API. We haven't gotten to it yet. We have to decide what layer that API operates at. RE: extracting a heap profile, is that something the API should be able to express, or .
- Michael K: I think the API might need to know about that. I was thinking that you could also grab a trace out.
- Michael P: You would effectively do traceAdvance from outside of the running program. I don't think it would be too difficult to do.
- Rhys: Seems useful to think of it in terms of "implement an API similar to what the runtime does, for accessing the runtime's internal datastructures".
- Michael K: I'm pretty sure viewcore is still broken for Go 1.22 (sorry…).
- Nick: Yeah, I was using Go 1.20 and it seems to work. I can try to fix it, but no promises.
- Michael K: Let me know if you need any help. I'm pretty sure Michael P tried and the main issue he ran into was that it tries to collect all GC metadata up-front. It would be simpler to type the heap up-front.
- Michael P: It's also really crufty. It would definitely be less crufty if we could delete old codepaths.
- Michael K: Been a bit of a struggle to prioritize viewcore.
- Nick: Maybe what it needs is a strong use-case?
- Felix: Would there be reviewer bandwidth for viewcore?
- Michael K: Yeah definitely. I really should just fix Go 1.22. But the main work here is that someone needs to sit down and think about the API surface.
- Michael P: We were also thinking of landing it in x/exp so it's not final the moment it lands. That way we can gather feedback on the API.
Michael K: Ran into some issues with the tracer trying it out on some big services. The issues seem old.
- Rhys: I've run the execution tracer many many times for 1 second. It usually works. There's still a low rate of not being able to create a good trace (1.20 era). I haven't seen anything that makes me think O(n^2) behavior with the old tracer.
- Felix: We have used the tracer similar to Rhys, 5 MiB or 60 seconds, whichever comes first. We also haven't tested with Go 1.22 much yet.
- Michael K: Also dumping stacks was non-preemptible, that might be new. I have a fix out for that.
- Rhys: Is there an interaction also with CPU profiling?
- Nick: Asynchronous preemption can introduce more unique stacks, too.
- Michael P: One more unique stack case is that Google does a lot of protobufs, so you end up with lots of slight variations.
Michael K: FYI, blog post going out probably today (https://go.dev/cl/570603).
Michael K: Haven't made progress on default metrics, but I should just do that. Does it need a proposal?
- Michael P: Technically not, but we are kinda making stronger guarantees.
- Rhys: Would a test be sufficient?
- Michael K: I think so?
- Rhys: You could imagine viewcore is also able to extract metrics and validate them. That could be the test. You could also look at all the Ms and see if you're in the runtime on any of them, and refuse to proceed if they are.
- Michael P: I was thinking the same thing, specifically with holding locks.
Michael K: So… what's next for diagnostics? Just brainstorming. (We already talked about viewcore, which is one thing.)
- Felix: Known papercuts with stack trace size. Relatively low stack depth specifically. In terms of novel things I still intend to create a proposal for a stack profiler. viewcore and anything to help with memory leaks and OOM kills is going to be very interesting going forward.
- Rhys: I'll agree that OOM kills are an underserved need. I have unfinished business with the execution tracer, but TBA. Perhaps with post-hoc analysis.
- Michael K: On my end, I intend to add new event types to the tracer for runtime work. Another new use-case that came up recently was possibly adding timer-related events.
- Michael P: Perfetto uses SQL queries to do post-hoc analysis.
- Felix: At some point we discussed the fact that different profile types are overlapping with each other. We discussed having an API for choosing what goes into the trace.
- Michael K: Definitely on my mind.
- Rhys:
- Michael P: Post-hoc analyses seem potentially fragile since they depend on implementation details (for example, some net functionality has to actually enter a syscall to see a stack trace in the trace itself).
- Felix: Would people here be interested in trying out something SQL based for execution traces?
- Michael P: It seems like in theory you should be able to do it already by importing JSON traces into Perfetto and running SQL queries, though you lose some information. Rhys: You said it's difficult to decide when a connection was created.
- Rhys: There can be in-bound keep-alive connections and there isn't

mknyszek · 2024-06-20T18:15:57Z

2024-03-28 Sync

Attendees: @felixge @mknyszek @prattmic @nsrip-dd @rhysh @thepudds

Michael K: Default metrics proposal.
- Felix: Being able to discover metrics programmatically is nice, but I'm concerned about guaranteeing metrics exist and return zero.
- Michael K: I see what you're saying. Returning zero is supposed to be a last resort.
- Felix: I think
- thepudds: Zero is well-defined / equivalent to “no data” be for some metrics like throughput or count, but it's annoying downstream to have zeroes show up meaning “no data” for other types of metrics (e.g., an average latency)
- Michael K: That's fair.
- Felix: Something I would like to ensure is that the totals always correctly represent the total of whatever "subdirectory" they represent.
- Rhys: I still don't totally understand this need. Having a single field that's just a boolean seems too simplistic. What about tags? Also concerned about the battles waged over which metrics should be default or not. Things that the Go team says matters a lot in strange ways.
- Michael K: The Default boolean felt incremental enough that it would be OK.
- Bryan: I find the heap goal metric really useful but I noticed it was cut.
- Michael K: Yeah, I might go back on that.
- thepudds: One thing that would go a long way would be to take the godebug list out of the main list, since it tends to clutter it. Being able to just get one metric is useful too, and it's fairly annoying today. Perhaps an example modification could help here. Also, I noticed blog posts are still using runtime.ReadMemStats.
- Michael K: runtime.ReadMemStats still has the advantage of being able to be used for measuring individual allocations, because the runtime metrics don’t flush the needed stats.
- thepudds: maybe a new runtime/metrics example calling runtime.GC and measuring individual allocations? (Calling runtime.GC might be what the testing package does as part of reporting allocations; not sure).
- Michael P: On the topic of godebug metrics and the proposal, I'm not sure
- Michael K: I'd understood the godebug metrics live indefinitely.
- thepudds, Michael P: Nope, 2 years for many GODEBUGs. (Some GODEBUGs might be supported longer or even indefinitely).
- thepudds: I don’t know if there’s been an explicit statement about what happens to a ‘/godebug/…’ metric if the associated GODEBUG ages out.
- Michael K: Summed up the changes. Are we OK with programmatic access?
- Michael P: I think programmatic access is OK, but I worry that the "Default" field is a bit too binary and restrictive.
- Rhys: The MemStats struct kinda was the default, and was simultaneously too big and too small for many years.
- Felix: Another thought on the Default boolean flag: what are the chances of new metrics becoming default? I suspect it would be low, and the main utility of this flag for me would be discovering new metrics.
- Michael K: All good points.
Felix G:
- Increase stack depth for mem/block/mutex to 128 (issue, CL)
  - Michael K: Clever. I'm supportive.
  - Michael P: Also supportive. However, I know that changing the [32]uintptr to a slice does theoretically break compatibility, but maybe nobody is using it.
  - Felix: My concern is using the record as a map key. I went out of my way to make sure that works.
  - Michael P: We can check the open source ecosystem to see if anybody's doing that.
  - Rhys: I'm paranoid about the hidden slice containing different information from the PC array if someone modifies it.
  - Felix: Currently that might get confusing, yeah. An alternative could be to have a "stack remainder" instead of having the hidden slice contain the full stack, since that would solve that problem. Also, the Stack method always returns a copy.
  - Michael K: It's not clear to me that changing the API would help that much, and this wouldn't have to go through the proposal process.
  - Michael P: I don't have a concrete alternative, but the API is also very weird. Maybe not that weird, since the array is already called Stack0.
  - Felix: If we do change our mind if we want to change the array to a slice in the future, we could still do that.
  - Rhys: This might affect CPU profiles, but I see this isn't for CPU profiles.
  - Felix: Yeah, it's not. I think it's OK to generally increase it because the memory use increase only affects those with truncated stacks, who I suspect are willing to pay the cost given how annoying it is to deal with.
  - Rhys: How far are we from being able to set this with GODEBUG?
  - Felix: I haven't tried yet, but I think this is pretty easy to wire through a GODEBUG.
  - Rhys: It couldn't be a constant if it's a GODEBUG.
  - Felix: We can hash it out in the CL.
- Stack memory profiler proposal (issue, CL)
- Reduce pprof label overhead (CL)
  - Felix: This is fun. The idea is to represent the labels as a LabelSet directly internally and to merge LabelSet lists with a merge sort. It turns out to be a lot faster. There are some applications that set this a lot and it costs a lot of CPU.
  - Rhys: I'm a fan of setting them eagerly and often for requests.
  - Felix: The only remaining overhead in terms of allocations is it's not possible to set goroutine labels without creating a new context, which forces an allocation. We'd need a different API. I would first try this CL.
Michael K: Annotating heap profiles further: GC cost proxy (scannable bytes * GCs survived as a counter per stack) and precise type information (as a tag, will split stack samples in the format, probably not a lot unless there's a lot of very dynamic reflection).
- thepudds: Do we have precise type information for all size classes?
- Michael K: Yeah, at the point of allocation. We can write that down.
- Felix: heapage

mknyszek · 2024-06-20T18:17:29Z

2024-04-11 Sync

Attendees: @nsrip-dd @rhysh @bboreham @cagedmantis @dominikh @felixge @mknyszek

Michael K: Updated recommended metrics proposal.
- Please take a look. Might be two proposals in one, and might have to break it out. If anyone has ideas on other possible future tags we might want to add to metrics, please comment on the doc! Thanks.
Rhys: Open-sourced the tracing and profiling tools I've been using for ~6 years.
- autoprof: produces, tags, and zips up profiling data for collecting at scale. Nice for teams that don't have existing continuous profiling infrastructure. Goal is to make profiling less "one-off."
- go-tracing-toolbox: a collection of tools for poking at execution traces. etgrep, which allows for grepping trace data in all sorts of fun ways. Other tools like using CPU profile samples that happen in between two events and build up an idea of what is happening between those two events.
Felix: CL for getting deeper stack traces.
- Backstory: Investigating service using GOMEMLIMIT and we noticed that heap profiles were getting truncated.
- From last time: dealing with public API being limited to 32 frames. Went with Rhys' suggestion to hide the extra frames inside the struct.
- Tried to make the max depth configurable at runtime, had all of it working, but it's probably overkill. What we could do instead is just make it configurable at startup via GODEBUG.
- Dominik: On the note of stacks, my biggest problem with the truncation in the context of tracing is not knowing that truncation happened, or how many levels got truncated.
- Michael K: Yeah, we can always do the full unwinding and add a dummy frame indicating how many stacks were lost.
- Michael K: Might be worth pooling the stack frame buffers.
- Rhys: How to decide when to attach to the M or the P.
- Michael K: Generally we try to prefer Ps because there's a fixed number and usually fewer in number than Ms. But diagnostics stuff is moving to the Ms because we want visibility into what the runtime is doing. Ps are generally about user code executing.
- Felix: Adding a new buffer to PCs to the M. I noticed there was already one there and considered just using it.
- Rhys: Yes, adding a new stack to the mprof hash map involves taking a lock, so I think it's hazardous to use that buffer.
Michael K: Core dumps!
- Would love to get this done but need help. I'll try to get Go 1.22 and Go 1.23 working after the freeze (6 weeks from now).
- Byran: I can help. I've been trying to get an answer to questions I have about a service and what memory is being kept live. Uses a very specific data structure that is staying live long, and my hypothesis is that it's referenced from a slice in a pool. Manually cleaned out elements in pools and that still didn't fix it.
- Rhys: Does viewcore integrate with heap profile data? Such that if you get lucky, it'll tell you the callstack that led to an allocation.
- Nick: Yup, I did this. I can send a patch, but the first order problem is unbreaking viewcore.
- Felix: The service I mentioned earlier, it was really about tracking down live memory.
- Bryan: viewcore makes assumptions that are only correct for x86 (didn't work on macOS). (Frame pointers.)
- Nick: I know that it only has logic for ELF core dumps, so it might be that macOS core dumps just aren't supported.
- Bryan: That seemed to work.
- Nick: I'm happy to help too!

mknyszek · 2024-06-20T18:18:34Z

2024-04-25 Sync

Attendees: @mknyszek @felixge @dashpole @rhysh @prattmic @nsrip-dd

Rhys: Mutex Contention Patches
- Freeze is in 4 weeks.
- Working on runtime mutex contention patches. Struggling with testing. Don’t want the tests to be flaky.
- Michael K: Making the mutex bigger is probably fine, especially if it doesn't bump a size class.
- Rhys: That might make it simpler. Trade-off in code complexity vs. performance costs?
- Michael P: It's subjective.
Felix:
- OpenTelemetry for Go Profiling / Metrics / Traces … ? (gh issue) (quote: “Could we integrate into the new “hot thing”, OpenTelemetry?”)
  - The idea would be something like having the runtime (or runtime/metrics), via an environment variable, export data via OTLP or some vendor-neutral protocol to an endpoint.
  - Michael P: I'm not sure how this very different from a package you import.
  - Felix: A big barrier with adopting observability, specifically adopting across a large number of repositories, users don't want to go and make a change to, for example, a package.
  - Rhys: Could we make importing packages easier as an alternative?
  - Felix: We're looking into that, but lots of organizations are dysfunctional in that way where adding an environment variable is much easier than modifying code across the org. It's a second-best option.
  - Michael P: We can't pull gRPC into the runtime, for example, but marshalling a few protos manually is probably OK.
  - Rhys: Is there something simpler and file-based we could do?
  - Felix: I think that could be a middle-ground, since something else on the host could ship them around.
  - Rhys: Go's profiles are really good for understanding Go programs, and execution traces are unusual, but good for Go programs. OpenTelemetry is applicable widely, but not designed for Go or Go's strengths. That makes the idea of having something pick up the Go-specific files and do the conversion or pass them on seem better.
  - Michael P: Worried a bit about folks being upset about binary size if every Go binary includes net/http. There may be workarounds.
  - Felix: That makes sense. Any standard interface would be nice. Integrating all the things (e.g. runtime/metrics in expvar) and issues with manually passing distributed tracing information through contexts.
  - Michael P: Worked on gVisor. We'd export diagnostic information over a Unix domain socket. Doesn't Java have something similar?
  - Michael K: Java has everything. Maybe you're thinking of JVMTI?
  - Michael P: If your binary is not a server, maybe we don't care? For example, automatically included if you're using net/http.
  - Felix: Sounds like it's not a definite no? If we were to take the ideas and writing up a proposal would it be worth it?
  - Michael K: I'm in support of anything that brings all the diagnostics together and exported in an easy way. Maybe this will go the direction of memory profiles.
  - Michael P: If I recall correctly, memory profiles are disabled in the linker if memory profiles aren't ever referenced.
  - Rhys: What if the API is "do a core dump" or "attach a debugger"? Is there anything there?
  - Michael P: That sounds really fun. The API for "attach a debugger" is "ptrace attach." The debugger can do a core dump. The runtime could do a core dump, and we've discussed it as making it the new heap dump format.
  - Felix: I really want to see more stuff being done with memory dumps, but it might be too high overhead for the use-cases I have in mind.
  - Rhys: The out-of-the-box experience could just be attaching to the process for debugging, but just doing a very simple thing.
  - Michael P: It would be very nice if we had a plan9 style interface where a process can export a new filesystem. Like /proc//go.
  - Rhys: My read on why keeping memory profiling always on is OK is that it's because the memory profile is just metadata on heap memory. But it means that other profiles just aren't used because they're disabled by default because the overhead is unclear and nobody knows what to set the profiling rates to.
  - Felix: I like the idea of turning it on by default. It's definitely hard to decide what a good value is for block and mutex profiles.
  - Michael K: I saw Nick had CLs out to make these profiles cheaper. Maybe cheap enough to be on by default?
- Workaround PGO inline issue by dropping samples? (pr)
  - The workaround is to hack up the pprof profiles passed to PGO to prevent profiling of the problematic frames in the meantime.
  - Michael P commented on the PR.
- Ready for high-level review: Configurable stack depth CLs for heap/block/mutex/goroutine profile.
  - Michael K: I'll take a look.

mknyszek · 2024-06-20T18:19:34Z

2024-05-09 Sync

Attendees: @prattmic @rhysh Jon B (from DataDog) @nsrip-dd @bboreham @cagedmantis

Freeze soon: code reviews?
- Carlos: removing cmd/trace/v2, internal/trace/v2 v2 components. CLs almost ready.
- Nick: FP for block/mutex, ready but needs submit.
- Felix: Deeper stacks CL requests a proposal from Cherry. Follow-up CL sent yesterday may avoid the need.
- Rhys: Fixing runtime mutex contention profile semantics. Is this too subtle for this point in the cycle?
  - MP: I’m not too concerned, but haven’t seen the CL yet.
Jon: Looking to add profiling/tracing without changing the source code.
- See notes from last week.
- Jon: Been playing with Go plugins. If the plugin is available, the net/http, etc may log via the plugin.
- Jon: Trying to bridge the gap with other language platforms.
- MP: Feels like we want a .so provided via environment variable, but Go doesn’t support that well.
- Bryan: https://github.com/grafana/beyla is a eBPF kernel-based approach. Maybe Go should make it easier for these kinds of tools to find the data they need?
- Rhys: In the past, we’ve collected periodic profiles/execution traces. Then do post-hoc analysis of the trace to determine the higher-level events. Perhaps net/http, etc, could add to execution traces?
- Jon: We’d like to combine this with distributed tracing, so we’d like request ID, etc metadata.
- MP: We’ve discussed adding high-level events to trace before. Adding context is a bit tricky, as we need some sort of standard of what to include.
- Rhys: It may be a bit much to automatically include certain headers, etc to traces automatically.
- Jon: OpenTelemetry has standardized HTTP headers for tracing, so that could serve as a baseline of context to include.
- MP: If we want this one automatically (via environment variable), there could also be an option for extra http tracing opt-in.
- Rhys: There is a debug call API.
- Jon: ptrace is slow, is that problematic?
- MP: Debug call is used by Delve. It doesn’t technically need ptrace, perhaps an LD_PRELOAD shared object could call it? I’m not sure what the ABI is?
  - https://cs.opensource.google/go/go/+/master:src/runtime/asm_amd64.s;l=1843;drc=a8ba163695f605c72156c90de97a06b3e40566ff has the ABI/requirements. I don’t think this would be usable from a shared library as it more-or-less depends on goroutines being frozen.
- Rhys: Perhaps a debug call into a common package to register a socket/listener/etc to receive events? Maybe this is no better than an environment variable?

mknyszek · 2024-06-20T18:21:07Z

2024-05-23 Sync

Attendees: @jba Jon B (from DataDog) @prattmic @rhysh @felixge @nsrip-dd @mknyszek Milind (from Uber)

Rhys: Got in an update to runtime-internal lock contention before the freeze, but it's a bit slow, as it turns out.
- Felix: Does that collect stack traces? I mostly see the "missing" indicator from the last release.
- Rhys: It currently can capture stack traces behind a GODEBUG, but the stack trace doesn't really match the mutex profile exactly (kind of halfway between block and mutex profile). The change that I have in is to plumb through the amount of delay the current lock holder is causing is all attributed to the right call stack. That is, making it one kind of profile, which means we can get rid the GODEBUG. However it looks like it made the lock critical section ~100 ns slower and it's causing regressions.
- Michael K: The regression is minimal at 16 cores, but significant at 88 cores.
- Michael P: Did you confirm it was actually locks failing to acquire when they should be acquire because of some bug in the lock vs. legitimately failing to acquire because of a longer critical section.
- Rhys: What I've seen so far is consistent with the latter. I haven't seen data that makes me worried that it'll get slower and slower with bigger machines. What I think is going is that apps have some lock that is often the most popular lock. My idea is that it reaches some threshold and everything falls apart at that point.
Automatic Instrumentation for Go (Jon B)
- JVMs allow users to attach tracing at runtime. OpenTelemetry currently uses eBPF tricks (in a way they're not really meant to be used) to achieve something similar for some programs.
- Built a prototype where net/http will load a Go plugin and start generating trace IDs if an environment variable is set. It just writes to a log. Go plugins probably won't actually work because it's not a stable ABI, but if there's any way we can set a flag and let the Go runtime start writing out data to a UDS, that would be awesome. Don't want to be too prescriptive, but this is the problem we want to solve.
- Jonathan A: Only interested in tracing, not metrics?
- Jon B: There's ways to get metric information out already, but there isn't a great way to get distributed
- Jonathan A: You'd get runtime metrics, but not other kinds of metrics. Like from net/http.
- Jon B: Want to get the trace ID from a header and then propagate it and send it back to whatever is collecting spans.
- Jonathan A: Ah, so it's about leaf services (final span).
- Jon B: Or if it's part of a chain of requests. Each server would have this kind of tracing support in it and each server would report back and the collector would correlate them.
- Jonathan A: What would the server write to the UDS? What is the format?
- Jon B: For now it was just a little custom binary blob.
- Jonathan A: Would it be OK if it was JSON?
- Jon B: It's not a lot of data for each span, so probably. There was some discussion of including OTel in the runtime, but it's probably too big.
- Jonathan A: We wouldn't vendor the OTel library, but we could just follow the spec for serialization.
- Felix: I want to correct something from earlier, we're interested in not only tracing, but also metrics and profiling. We want to get all kinds of stuff out.
- Jonathan A: I was thinking of something similar to expvar but the output format would be OTel JSON or something.
- Rhys: It sounds like what you're describing is a read-only API where the process dumps information out. But you've also been talking about chains of services that need to know who called them. It sounds like it's not entirely read-only. It sounds like there's a need to read a header and apply it to the outbound calls.
- Jonathan A: That's actually really easy. We just put it in a context and make sure outbound requests put it in the headers. We should probably do it.
- Rhys: You're saying that's something we'd be willing to have in the standard library for all requests?
- Jonathan A: Seems like a reasonable thing to me, but I don't speak for the whole team.
- Rhys: I haven't found that there's a clear answer that we should definitely trust the headers we're getting. But I'm also hearing that there's a desire not to have the application owner have any dependencies on packages that would make these explicit choices. Am I understanding that right?
- Jon B: Yes, with an asterisk. If the trace generates a header that we're generating trace information
- Rhys: It seems like there are many components that need to line up for this to work, and it's not clear to me that it can be done without the application owner having a say in it. If the application owner is going to have to import a package anyway, why not make that the entire design?
- Jonathan A: I don't see a need to require an import except if the user wanted to add more.
- Rhys: There are security implications here.
- Jonathan A: The good news is that the user is going to opt in, and every server in the chain has to opt-in.
- Felix: In terms of the opt-in, the ideal state is that it could be set at runtime (e.g. via an env var). With net/http/pprof you have to make a small code change in main().
- Michael K:
- Felix: +1.
- Michael P: Reiterate and +1 Rhys' concern about some of the automatic trace ID propagation through net/http. That may be the most worrisome point. That one does feel potentially problematic from a security perspective. I know we said you have to opt-in, but we're not so much thinking about individual applications opting, but a cloud provider opting everyone in. I don't know if it's really a problem, but it's something we definitely have to think about more.
- Michael K: We have a bunch of different ways of getting data out of Go processes, and there are a bunch of different formats for that and a bit of a discoverability problem. [more.] Maybe it’s time for an issue, for active discussion and planning?
- Jon B: I can do that.
- Michael K: Or a draft design. This is genuinely exciting!
Work on metric for number of “runnable” goroutines (proposal). (Felix)
- Some renewed interest to get a runnable goroutine metric with the idea to do load shedding based on it. It's a little bit more direct than scheduling latencies. Just wanted to call dibs on working on this early in Go 1.24.
- Rhys: I think the idea of this metric for the purpose of load shedding has been on the issue tracker for a long time, and it existed as a patch, but it didn't really work for Google. Totally normal programs can spin up a zillion goroutines that instantly finish.
- Felix: There are internal users who insist that runnable goroutines are the answer. My idea is to just give it to them and let them experiment.
- Rhys: +1 that the scheduling latency histogram is useful.
- Felix: Want to check the temperature on implementing the proposal.
- Michael K: I think it's fine to have the metrics. Go for it!
Milind: Talking of runnable goroutine metric: we have the need to get a metric of blocked goroutine with blocked wall-clock time in pprof. This helps detect deadlocked goroutines.
- Michael K: This has also been on the issue tracker for a while but we haven't come back to it.
- Rhys: +1 for collecting goroutine profiles every few minutes. Being able to connect them back to the process that created them is incredibly useful.
- Milind: So we may not need to put block time in the profiles, and there's a different way to get the same information?
- Rhys: Yeah, that'll show up in goroutine profiles.
- Milind: Those that leak slowly are harder.
- Michael K: In theory if you made goroutine profiles inconsistent, you could drop the STW.
- Rhys: In the design I found that a consistent snapshot is important.
- Nick: There is a way to get a goroutine profile with waiting time. It is one of debug mode levels. Is that good enough?
- Milind: I'll take a look.
Milind: Is this issue I created some time back in the scope of this group's area of work? proposal: cmd/compile,runtime: -race should report the allocation site of the object involved in the data race #67221
- Felix: Yeah, I think so.
- Michael K: +1. I think we could do it optimistically based on whether heap profiling is on. Collecting stacks on each allocation is quite expensive.
- Milind: Yeah, so if someone wanted to catch it with high certainty, they can turn up the sample rate.
- AI(mknyszek): Comment on the issue.

mknyszek · 2024-06-20T18:22:19Z

2024-06-06 Sync

Attendees: @rhysh @felixge @nsrip-dd @bboreham @cagedmantis Milind (from DataDog) @dashpole @aclements

Felix: Any thoughts on shadow stacks? Obviously not the hacky thing I did, but about a future where operating systems provide us access to hardware shadow stacks?
- Felix: Inspired by Bytehound’s profiler’s cache. Caches the walked stack and leaves a trampoline on a return address in the stack to invalidate that cache. Tried it out in Go with frame pointer unwinding and in the best case it’s 8x faster! Even if the stack is constantly changing, it helps for really deep stacks. I don’t intend to really propose this because it’s scary and FP unwinding is already really fast.
- Felix: The hardware is already capable of maintaining shadow stacks. If in the future the OS provided them (there are Linux patches in flight), would we be interested in using them? Maybe FPs would no longer be needed.
- Austin: I think it all depends on cost. In particular, if it makes it expensive to switch goroutines, then I can’t see it being an overall win. But if not, then sure I think we’d be open to it.
Nick: coroutines in execution traces
- Nick: One of the biggest changes for execution traces is that we now have coroutines. We’re looking at how to represent these in the trace UI. It looks like it’s implemented as a lightweight goroutine switch. In our current UI, we have lanes for each G. With coroutines, I think this will look like a rapid ping-pong between goroutines. Has anyone thought about whether that’s good or if there are better ways to represent this?
- Felix: I think I brought this up in the previous meeting. My concern was more on the data volume size. If this becomes popular, it could bloat traces. In terms of visualizing, if we chose not to record this data for size reasons, there’s nothing to visualize.
- Austin: We’re hoping that this doesn’t get used all over the place.
- Felix: Are we leaning toward not using iterator patterns all over std?
- Nick: I think the idea is to use the push pattern, which doesn’t involve coroutines.
- Austin: Right. We do plan to add iterators to many places in std, but those just use yield functions, not coroutines.
- Felix: I think it’s probably okay to just represent these as goroutine switches.
- Nick: If the expectation is that these won’t be common, then maybe we just leave it alone unless people complain.
- Austin: I will note that these coroutines are first-class, so most of the time it’ll be a ping-pong between two goroutines, but you can pass them to other goroutines and then things get really weird.
Bryan: pprof labels for memory profiles? go/issues/23458
- Bryan: A colleague asked about this. Is there any recent thinking? If you did have “tell me all the memory allocations in the last 30 seconds”, you could do something like CPU profiles.
- Rhys: I would like to see memory profile sample events in the execution trace. You can kind of see which goroutines are allocating memory from the “heap in use” change events. If we had a trace event for when an allocation is profiled, that would be really useful.
- Austin: Ah, then you could construct a labeled memory profile from a trace.
- Bryan: I guess I am proposing memory delta profiles.
- Felix: I would love to have that option in the execution trace, as long as its an option. I’m worried about data volume. At some point we need to think about an API for controlling what goes into a trace.
- Nick: I think we can trade one for the other. I think we might get more events from the current “heap in use” events than you would from memory profile events.
- Rhys: Heap in use also goes down, so I don’t know if we could do a direct trade. At the current default memory profile rate, you wouldn’t get many events, but if you set that to 1 you’ll be sad.
- Rhys: A few years ago Austin talked about log-based profiling with offline aggregation. Is the execution tracer the way to do that, with some API to filter?
- Austin: I think that would be a good way to do that. We’ve already gone down that path a bit! I don’t think we can change the pprof format, though.
- Rhys: A notion of time in profiles is amazing. See the Firefox profiler.
- Felix: OpenTelemetry has had a lot of discussions about timestamps in their profile format.
- Austin: Also it’s not that hard to write a perf.data file!
- Austin: Back to memory profile labels. I think we could have labels on the “inuse” profile, but not the “alloc” profile.
- Bryan: We’re attaching trace IDs to allocations. I think they’d have to age out.
- Austin: The other thing is memory delta profiles. I think the execution trace approach is a good way to do that. Or we could add an API specifically for a delta profile. “Allocations between point A and point B.”
- Felix: To me goroutines can be grouped together by what initial function they execute in. You might be able to infer heap labels from labels in a CPU.
- Bryan: The pathological case is you have something allocating a 1 MiB buffer. It won’t spend much time doing that, versus something allocating lots of 8 byte objects.
- Felix: What use case are you trying to solve? OOMs?
- Bryan: I’m interested in the things allocating and freeing a lot of memory, which is driving GC.
- Rhys: I think execution traces are really good. If you have a profile you’re trying to solve, you’re going to want other data from the execution trace to pin that down once you know the general area that’s in. Also, if tracing is too high overhead, but this is happening a lot, you can sample when you’re tracing.
- Bryan: We have a distributed trace with head sampling and we can attribute CPU profiles to the trace, but not memory profiles.
- Austin: And you’re using labels on the CPU profile to correlate it with the distributed trace spans?
- Bryan: Yes.
- Austin: And an “inuse” profile doesn’t make much sense for attributing to a distributed trace. You really want a memory delta profile with labels so you can say “this span allocated stuff.”
- Bryan: Yes.
- Rhys: Maybe if we can really trim down the events recorded by the execution tracer, the overhead would be low enough that you could use it for this.
- Felix: +1 to being able to collect execution traces with just CPU samples. You get timestamps and you can afford to send the data somewhere. I would love to just do that.
- Rhys: And then you also don’t have the mysterious 200ms in stopping the CPU profile before you can start a new one. runtime/pprof: collecting a CPU profile takes ~200ms of wall time #63043
- Bryan: So memory allocation events would have to be added to the execution trace, and there would have to be a filtering API. It seems like nothing else we talked about really solved the problem.
- Austin: We want that filtering API anyway. That’s the hard part but very desirable.
- Felix: Another complication is that we currently don’t include labels in the execution trace.
- Felix: For the next release, if there’s interest in working on a filtering API, we’d be interested in proposing something.
- Austin: +1.
Rhys: clocks .. why do some platforms have cputicks==nanotime, is there a reason we don’t (or should start to) use cntvct_el0 on arm64 (relates to runtime-internal mutex contention profile, and perhaps also the execution tracer)
- Rhys: I assume cputicks is supposed to be really fast. Is there a reason we haven’t used the instruction on ARM?
- Austin: How tightly are the clocks synchronized between CPUs? And are they constant rate or do they fluctuate with the clock rate?
Milind: is it possible to get the full ancestry of goroutines in data race reports?
- Austin: Those stacks are tracked and printed by TSAN itself, so I think this would require suppose on the TSAN side. It unfortunately doesn’t just call back into the runtime for these.
- Rhys: If you set gotracebackancestors to a large value, you might be able to correlate the output of the race detector with the Go traceback.

mknyszek · 2024-06-20T18:23:17Z

2024-06-20 Sync

Attendees: @rhysh @bboreham @cagedmantis @dominikh @mknyszek @prattmic

Rhys: is a mutex contention profile really the best way to find problematic contention (especially for runtime-internal locks), or is there a way we could measure the amount of time that a particular lock is held, to provide an earlier warning? (relates to scalability of apps where a single runtime-internal lock is the bottleneck, either sched.lock or in a channel)
- Runtime-internal lock contention is not really a problem until it's a terrible problem. We had a problem with sync.Mutex in Go 1.9 (queuing slow path) and Go 1.13 (priority boost when unlocking).
- Looking for prior art on identifying locks with heavy load before contention.
- Michael K: Worried about whether such a metric would be ignored by default.
- Rhys: But for runtime-internal contention, this could be really useful for maintainers.
- Michael K: Interesting. Is this apparent in CPU profiles?
- Bryan: Had fun with a read-write lock. One pending writer holds up all future pending readers and people don't know this. There are a few PRs in Prometheus related to this. It would be great in these situations to know better when these things are happening.
- Michael K: Maybe this is extractable from a trace? Maybe we already have this (or a weaker version of it) in the trace viewer?
- Michael P: You could imagine a mutex viewer in the trace tool that indexes by mutex.
- Rhys: We don't have the address of the sync.Mutex in the execution trace, and we also don't see mutex operations that didn't cause blocking/unblocking.
- Rhys: There's an interesting instrumentation technique from a paper Milind shared that could be used to annotate CPU profile samples. Could be an easy way to get data on what's going on in critical sections for runtime-internal locks.
- Bryan: Observation: traces are not a tool we're reaching for, and historically it was expensive and not that great. A different but related thing is a tool that would warn you that you used an RWMutex when you should have used a Mutex.
- Michael K: RE: RWMutex, maybe we should put a warning in the docs. This is definitely not the first time someone was burned by it.
- Rhys: What information do you want in a bug report?
- Michael K: It really depends. Maybe we need a flow chart?
- Michael P: RE: CPU profiling to estimate time in critical sections. Could the lock code itself sample the duration of critical sections even in uncontended sections? The downside is having to make the uncontended fast path slower. It would give us numbers that we could be more confident in.
- Rhys: Should we put mp.locks in the CPU profile sample? And also, could we get a measure of lock load before they get contended. And yeah, we also have to be careful about adding anything to the critical section.
- Rhys: What other tools would be helpful? For example, would better bpf support (tell a large swathe of users to run a bpf program to get some data) help in making decisions? This is in the same vein as a Go build cache experiment to understand how people were actually interacting with the cache. Not perfect, but it's still useful data.
- Michael K: Runtime diagnostics face a bigger hurdle than build diagnostics because production environments tend to be hard to get data out for.
- Rhys: Continuous profiling/tracing helps. For example, getting a page trace out is hard but getting a trace with certain fields set is much easier because of policy.
- Michael K: The page trace is now replaced with some "experimental events" that you can parse out of a trace: https://pkg.go.dev/golang.org/x/exp/trace#ExperimentalBatch.
- Michael P: https://go.dev/cl/588696
- Dominik: Do the alloc/free experimental events have stack traces?
- Michael K: Not at the moment because I was worried about overhead, but you could imagine stealing the heap profiling stack when we're taking it anyway as a low-cost way.
- Dominik: One user just said they'd be happy with opt-in stacks that have a 90% slowdown, for the sake of good data. It's useful in dev environments, at least.
Michael K: Related to adding mp.locks to CPU profile samples, I personally would really like to see # of GCs survived on heap profile samples.
- Rhys: Are you thinking something like being able to see a histogram for a heap profile sample of how long it survived?
- Michael K: Yeah.

mknyszek · 2024-08-05T20:36:01Z

2024-07-18 Sync

Attendees: @mknyszek @felixge @rhysh @nsrip-dd @bboreham @cagedmantis

Felix: New Trace Viewer
- Working on a nicer trace viewer with minimal JS dependencies to maybe hopefully make it upstream. Anything you all would want out of this?
- Michael K: Awesome! No concurrent plans on our side.
- Rhys: Grouping related goroutines or a subset of them. Also composability of features and deep-linking of UI elements back to trace data.
- Michael K: Would like the trace viewer to be more all-encompassing (so features are more discoverable) and for the trace viewer to scale to arbitrarily sized traces with help from the server.
- Rhys: Could the experience be good even if the execution trace is not on the same machine?
- Felix: Do you mean live streaming of the trace?
- Rhys: I just mean the file being remotely accessed and relatively quick to load. So that you could build a web service to view execution traces. Also to clarify viewing related goroutines, I mean interactions with other tools (so, controlling what's viewed on screen via external tooling).
- Felix: Were you imagining a view that's goroutine-centric as opposed to the default proc-centric view?
- Rhys: Ideally I'd have a view that's centered on the right goroutines. Probably can't be done in the general sense, but fine for investigating specific issues.
- Michael K: I wonder if there's a good high-level data model for visualizing trace data (something higher level than what the parser spits out).
- Felix: Perfetto uses SQL, and I played around with that a bit, but it was pretty hard to make work nicely.
- Rhys: I recently built a tool that extracts state machines from individual goroutines, so if you have an HTTP/1.1 client then the execution trace data + this tool can produce that into a graph SVG with a call stack at each node. A step toward a call graph that spans goroutines describing what it takes to serve an RPC. 
- Bryan: The arrows just represent the next thing that happened regardless of how long it took.
- Rhys: Right, there's no concept of time. The thickness of the line describes frequency of the event, however.
- Michael K: Really cool. Could put a distribution of time spent in each state on each box or something.
- Felix: +1, really cool. This is always the same goroutine right?
- Rhys: This is a couple thousand of the same goroutine. But it's also hard to say "same goroutine" because it's hard to name a goroutine. Sometimes it's easy (where did it start) but with things like errgroup and goroutine pools, it's much harder.
- Felix: I did something similar trying to map communication across goroutines.
- Nick: I was confused by some of these state transitions. If a goroutine is runnable, but shouldn't it always be running after?
- Rhys: The transition from runnable to waiting was running and the execution trace tells us how long, but we don't get any call stacks, so we don't know exactly what it's doing. Also in the tool if you mouse over any arrow it'll give you an example goroutine ID and a timestamp. I also did a whole bunch of filtering out.
- Bryan: RE: finding/naming goroutines, there are goroutine (pprof) labels, and then there was proposal: runtime: permit setting goroutine name to appear in stack dumps #35178. JetBrains also came up with a workaround.
- Felix: Why is Running sometimes there?
- Rhys: Unblock events.
- Michael K: You could put in dummy events to retain the original state machine.
- Rhys: Yeah, we could do that, but it was hard to name them for dot. RE: naming goroutines, names can be situational
Michael K: How did GopherCon EU go re: execution tracing?
- Felix: I seemed to have gotten more people excited about execution traces, so success!
- Felix: Talked with @jba about runtime observability out of the box.

mknyszek · 2024-08-05T20:37:50Z

2024-08-01 Sync

Attendees: @mknyszek @prattmic @bboreham @felixge @nsrip-dd @rhysh Milind (from Uber)

Felix: Trace Viewer Demo.
- Super fast, renders a ton of goroutines quickly.
- Immediately get data when you open the trace – streaming.
- Felix: It would be nice to have a public API to skip over the trace quickly.
- Michael K: Agreed. You can do it, it's not that much code, but it is annoying to have to write it out and depend on the wire format.
- Michael P: +1, nice to have the public API for skipping batches for things like trimming (simple, high-level operations).
- Felix: Working on streaming a trace. Basically, you connect the tool to the online trace
- Michael K: Did you run into any issues with CPU usage in the x/exp/trace parser? I saw 25 MB/s, which isn’t as fast as I’d like.
- Felix: Right now it seems good. 10 MB trace in 300 ms in the Go program, plus 300 ms more in the browser.
- Michael K: There are probably low-hanging fruit. Lots of places where maps were convenient, but maybe we could do better with specific datastructures.
- Felix: As long as the new tool is 10x faster than what we have today, it’s a win.
- Rhys: Really cool! When can we use it? :) Going back to two weeks ago, can we configure the goroutine sorting functionality you depicted? For example, custom clustering.
- Felix: Short answer yes, long answer is that I tried to come up with a query language but quickly realized that was a rabbit hole. But what would be really easy is to just let someone hook into the JavaScript objects containing all the goroutine data before presenting it.
- Michael P: Perfetto exposes a SQL query box on their page which is horrible but that's how they do it at least. A more serious suggestion that might work well with Go users: the tool is importable (app.Main(...)) where you can configure with custom sorting on the server side in Go.
- Felix: Not opposed, but I prefer to do a lot of the heavy lifting on the browser side because serialization and sending data around is expensive. I was thinking about dynamically loading parts of the trace under memory constraints, but maybe it won't be necessary – I think I can get the JavaScript object encoding down even smaller.
- Michael P: I was suggesting a JS text box, not actually SQL
- Rhys: Could just the sorting be pluggable? And the grouping.
- Felix: Already have some ideas for grouping. Open to contributions.
- Felix: What dependencies might be allowed if this is to become part of the Go project? It’s lightweight today, a couple MIT/Apache2.0 libraries.
- Michael K: There’s potential. We have several other JS dependencies. It would need to go through proposal review.
- Felix: Currently I don't really need a framework (just use canvas) but for more traditional UI elements and tracking their state, a framework would streamline things a lot.
- Michael K: May want to reconsider the rest of the UI as well.
- Felix: Yeah, I'd want to also drop users directly into something useful instead of asking them to make a choice up-front. We'd probably want to keep some things, like the MMU views. We'd also want some new stuff, like a flame chart view on a specific goroutine. Or like Rhys' graphviz visualization of a goroutine's "state machine".
- Rhys: Some stuff was already broken in cmd/trace like the pprof. How do we know if it's being used?
- Michael K: Telemetry (the new opt-in kind)? We already have it for invocation counts.
- Felix: When tools are broken, it's hard for inexperienced users to know that there's something wrong, rather than they just did something wrong. Absence of bug reports != absence of usage. These tools also tend to be considered lower priority, and so get fewer bug reports and also less attention.
- Michael K: RE: trust in tools, a good testing strategy would go a long way. cmd/trace never had a great testing story.
- Michael P: You can check telemetry.go.dev for what's configured. Non-compiler non-go-command tools aren't currently collected in the default configuration. All the tools are instrumented, though.
  - https://telemetry.go.dev/config
- Felix: RE: testing, I've thought about it while writing this, but my initial approach has been not to test anything yet. UI testing is hard. Testing might freeze the implementation and inhibit exploration. We may even need to burn it to the ground and build it back up. Already done so 3 times following some dead ends.
- Michael K: +1, it's more of a "this would be a big win" when it comes to the proposal to replace the existing cmd/trace. Definitely a long-term concern.
Michael P: Public gocore / gobinary packages
- https://go.dev/issue/57447 Proposal from a while ago
- Mostly a heads-up.
- There are two parts to it: analyzing binaries, and analyzing core files (with the binary). We'll probably start on analyzing binaries since that's making things harder for other people. For example, govulncheck manages a fork of debug/gosym. Delve has a fork of govulncheck's fork.
- Michael K: Taking a different approach, will probably try to come up with a new API. Getting rid of the old cruft may also make work happen faster.
- Michael P: We also plan to do things in pieces, and not necessarily try to design the entire API at once.

mknyszek · 2024-08-15T17:49:47Z

2024-08-15 Sync

Attendees: @mknyszek @rhysh @felixge @nsrip-dd @cagedmantis

Rhys: I’ve been thinking about runtime-internal locks, and wondering if we should consider having a different implementation we use for important singletons like “sched.lock” or “mheap_.lock” than the one we use for, say, the zillions of chan values an app might have. I expect we’ll want to collect data to help make that decision. The question for this group is: Where/how should the instrumentation to collect that sort of data exist? I’m thinking of something that measures the length of critical sections and reports a sort of histogram of that via the mutex profile, or the high-water mark of the number of locks held by the critical section (once m.locks returns to 0). Leave it in a CL that gets updates for a couple releases and then goes to rot? Or something like a GOEXPERIMENT or build tag that’s usually compiled out completely, but is more ready to use? Or, send code first and we’ll discuss it case by case :)
- Michael K: I think this could work as any of GODEBUG, GOEXPERIMENT, or a build tag. Feel free to just send the code and we can hash it out. Sounds useful, I've definitely written one-off things in the past to dump critical section time histograms.
Felix: Small updates on trace viewer and chats with Lalit Maganti (Google) about Perfetto.
- Made backwards progress; taking everything apart and putting it back together. Data transfer between Go and the browser was slower than it should be by an order of magnitude. I was batching JSON-marshalled messages, but now I'm using a streamed varint-encoded (but higher level) trace format. Zooming out for big traces also has framerate issues, so I'm also working on making summaries.
- Got an email from Lalit Maganti who is interested in trying to spend some time to drive adoption of Perfetto in open source projects, and is interested in Go. I asked if there is a way to stream the data in and it turns out no. There's also no stack trace support and Perfetto has no concept of associating a time slice with a stack trace. You can kind of align it with a separate track, but it's not first-class.
- Michael K: Thanks!
- Rhys: I had some practical experience with what I feel is a real need for rendering a remote trace. I had a 600 MiB trace and cmd/trace needed 51 GiB of memory to work. Launched the server on a separate machine but viewed the trace remotely.
- Felix: I intend to support this use-case. But is cmd/trace's approach good enough, or are you looking for something different?
- Rhys: It worked, so I think so! I was trying to find GCs and it was really annoying to try to find them.
- Michael K: It might be straightforward now, thanks to Felix's refactoring of the trace viewer API, it's much easier to add new views. We can have a simple GC-only view in the existing go tool trace.
- Felix: Here's a sketch of how this would look like for this use-case. My vision for the trace viewer is that you should be able to look at a high level summary and smoothly look across the trace.
- Rhys: I have a tool kind of like go tool trace -d but allows filtering. Would this be useful to send upstream?
- Michael K: We'd be happy to take that contribution. Just file an issue to get others aware and to state intent. I don't think it needs a proposal, the trace CLI is (IIRC) not covered by the compatibility guarantee.
- Felix: +1 for Rhys' tool. I usually use awk but it's cumbersome.
Felix: Nick wrote a blog post: https://nsrip.com/posts/oneinstruction.html
Nick: Clarifying comments on the flight recorder issue (proposal: runtime/trace: flight recording #63185 (comment))
- Michael K: See proposal: runtime/trace: flight recording #63185 (I described this in the meeting).
- Rhys: Maybe call it a Limit and don't specify upper or lower? :P I think this could benefit from worked-out examples. Russ also said that
- Michael K: This is making me think the config should just always be global. It would be way too complex to have different subscribers have different configurations. I will sketch out a different API.
- Felix: +1 to both worked out examples and a global configuration. I'll note that this is consistent with how profiles work, too.

mknyszek · 2024-09-26T17:06:21Z

2024-08-29 Sync

Attendees: @prattmic @bboreham @felixge @nsrip-dd

Michael K: FYI I have a fix for gocore for Go 1.22+. It was easier than expected.
MP: Runtime lock contention on channels
- Fastly issue discussed at GopherconUK: https://exactly-right-airedale.edgecompute.app/Go%20Channels%20Slow%20Down.pdf
Felix: trace viewer updates
- Demo!
- Go program parsing is the bottleneck (1.3 GiB trace: 42s to load the first time, 20s second time).
- Generations could be parsed concurrently?
  - Parsing is papering over some differences. e.g., generations parsed independently start goroutines in undetermined state. The parser fixes this up.
Felix: Frame pointer cleanups
- Lots of different representations of stack traces (logical, physical), expanded or not. Felix would like to clean this up for 1.24.
Nick: Looking at getting rid of the 100ms sleep in the profile reader.
- Mostly working for standard reader. Trace reader still has difficulties with doing the wakeups at the right point.

mknyszek · 2024-09-26T17:08:36Z

2024-09-12 Sync

Attendees: @mknyszek @prattmic @felixge @nsrip-dd @chabbimilind

Michael K: WDYT about slightly expanding the scope of this meeting to include all runtime topics?
- Most other runtime topics are closely related to diagnostics, for example performance.
- The regular attendees seem like a good group for such discussions.
- Everyone in attendance seems supportive, we'll discuss next time and try to get more feedback on the topic from more attendees.
Michael K: See my latest comment on the flight recording proposal.
- TL;DR: I think we should stick with the original proposal, but decouple the internal rate of traceAdvance calls from the amount of trace data kept around for flight recording.
- If we want to expose the internal rate of traceAdvance calls, let's do so in a separate proposal.
- Felix: WriteTo calls traceAdvance.
- Michael K: That's true, Cherry and I were thinking
- Felix: What is the use case for instantiating multiple flight recordings?
- Michael K: Mostly about consistency across different diagnostic APIs. e.g., CPU profiling
- Michael P: How does the global and flight recording
- Felix: I think the rates make more sense to configure globally. We probably don't want to have different instances. Same for CPU profilers. Consuming the data should allow multiple subscribers, but if resources can be shared (flight recording buffers) that would be ideal. It would be neat if we could have packages realize there is a clash in settings, but I don't know how to do that.
- Milind: RE: concurrent entities trying to profile, we have services for continuous profiling (e.g. pyroscope) but we also have a use-case where users of the services want to take a one-shot profile. When the continuous profiler is running, these other tools don't get anything. But these one-shot profiles are higher-priority. Could it be a queue of subscribers? Having multiple subscribers probably solves that problem.
- Felix: That's exactly what I had in mind to support.
- Michael P: This feels like an important and common case. For CPU profiling all you can control is the rate, and AFAIK nobody modifies the rate. It gets complicated when there are other settings.
- Milind: With concurrent profile collectors, each profiler would probably want its own buffer to avoid seeing old samples.
- Michael P: We would have to direct samples to multiple buffers. It should be possible, but there may be some scalability concerns. 1000 profiler consumers would be problematic. 2 is fine.
- Milind: Is stack merging happening in the signal handling or is it done in a separate goroutine?
- Several people: Discussing some implementation details on how this would work. Basically, the profiler goroutine should split out and duplicate the samples, not the signal handler.
- Felix: RE: the central/global configuration issue, the semantics we want are kind of like RWMutex. Many readers, but only one "writer" that controls the settings. Only one owner can get control of the tracer/profiler/etc, but anyone can subscribe.
- Michael K: This can be done with the tracer as well.
- Michael P: It depends a bit on whether the consumer cares what the settings are.
- Michael K: Maybe the "writer" queues up, as Milind suggested earlier?
- Michael P: In perf, with specific events, you can say sampled is OK, and you might get a warning. If you want it to be enabled all the time, you'll actually get an error if perf cannot arrange that.
- Milind: In Intel PMU events you can say "no skid events" and it works similarly.
- Michael P: Most recent proposal: https://go.dev/issue/53286. It paused partially because figuring out new C-created threads is complicated.
- Milind: This is one I filed earlier on: https://go.dev/issue/36821.
- Michael P: perf just scrapes /proc/* racily.
- Milind: I think there's a way to do that not racily with a two-pass algorithm.
Nick: Mutex/block profiler frame pointer troubles. Recent issues for example: runtime/pprof: block and mutex profile stacks sometimes have "gowrap" root frames #69294 https://github.com/golang/go/issues/69335
- Trying to introduce frame pointer unwinding for mutex/block profiling caused a bunch of bugs, mainly that it doesn't quite work exactly like the old profiling. Adding lots of kludges and special cases to make it work.
- Michael K: I think we should keep it, if you've got time to keep fixing bugs. Diagnostics are hard – we're not going to get as good coverage around them as we do for other things. We (unfortunately) have to ship it to find out.
- Michael P: From a maintainer's perspective, reverting changes also comes with a risk.

mknyszek · 2024-09-26T17:09:50Z

2024-09-26 Sync

Attendees: @mknyszek @prattmic @cagedmantis @nsrip-dd @rhysh @thepudds @bboreham @chabbimilind

Michael P: Heap profiler memory usage increase in Go 1.23 (https://go.dev/issue/69590)
- Thinking about potential mitigations and couldn't decide why this wasn't already an issue. It would already be an issue if you had 32 random stack frames and generated all permutations of those stack frames. What seems to explain why this is not a problem now is that most of the time the bottom frames are identical.
- Nick: Agreed. In theory this is a problem today even before the Go 1.23 changes.
- Michael P: This makes it seem less concerning to me then.
- Michael K: Should we do more about deduplicating? We only deduplicate whole stacks.
- Michael P: Thinking about bounding the maximum size by dropping samples. That means the profiling rate would basically change.
- Milind: Could use reservoir sampling.
- Michael P: Right, that's what I was thinking of.
- Rhys: One thing that makes this tricky is that it's not even the heap profile that's growing necessarily but the off-heap memory the runtime uses.
- Nick: The way we found out about this is someone reported a memory leak. We're working on getting memory stats in front of people for exactly this reason. We also report delta profiles, and we wouldn't notice in our own metrics because the delta is basically zero most of the time. If we had a built-in delta that would help us.
- Rhys: Could we have a dummy frame for off-heap memory in the heap profile?
- Michael K: We could actually take stack traces from sysAlloc, etc.
- Rhys: Should we have log-based profiles internally and then turn that into whatever we want? (RE: delta heap profiles.)
- Michael P: So it's not really a public API, but the way CPU profiles work is effectively that. They're events that get processed and grouped into buckets.
- Rhys: Block and mutex profiles also accumulate forever, but you have to ask for them. Teams haven't turned them on and off. But, execution traces let us get similar information and that is possible to enable for short bursts.
- Michael K: The memory profiling code is very old and if there's an opportunity to just eliminate the problem that started this discussion, that would be ideal.
- Michael P: What kind of deduplication do you have in mind? The random stacks case is highly incompressible.
- Michael K: Nothing too specific, possibly deduplicating smaller sections of stacks, but you have to be careful about where you set boundaries. I wonder where the rest of the world has gone on this in the meantime.
- Rhys: Many of the records are not "in use," as in not part of the in-use profile. While they correspond to an “in use” memory allocation, they represent 512kiB of allocations (on average, by default), so we can afford to use a small fraction of that for this metadata. But when the memory is freed, the metadata in the “allocs” profile no longer is a small fraction of the memory, since the “live” part of the allocation is now 0 bytes.
- Michael P: What if we had a “reset” mode on WriteHeapProfile that reset all allocation counts to zero? Then it could throw away records from anything not in use anymore.
- Rhys: Resetting the global state makes me nervous because it's global state. Changing the heap profile rate once the program has really gotten started is not something I've seen anyone do. We've had similar discussions about the flight recorder and execution traces. It's fine if only one part of the application cares about the data, but without complete coordination, there's always a concern.
- Michael K: It's interesting that this discussion applies to heap profiles too. The current heap profile API also has some really annoying implementation restrictions in malloc. I also found a bug (not that anyone would run into it in practice).
- Rhys: One way to get programs to coordinate could be to set the baseline heap profile sample rate to zero. CPU profiles and traces are somewhat easier to coordinate.
- Michael P: It's early for Friday afternoon bad ideas, but the compiler could rewrite writes to MemProfileRate to function calls.
- Rhys: Is there a way to make sure MemProfileRate is assigned to its own page and forces a fault on access.
Michael K: Continue meeting scope discussion.
- Seems all are in favor.
- Rhys: If we're still not sure, we can still give priority to diagnostics topics.
Rhys: If the above is approved, I’d like to discuss strategy around landing mutex changes
- TLA+ explained the bug I fixed in PS 10 of https://go.dev/cl/601597
- Using Xchg8 requires per-arch support. Or, we could use Xchg (Xchg32) and support all 64-bit architectures right away
  - Michael K: Supporting just what's needed to start is fine.
- Is using futex a good idea? The semaphore-based implementation gives a linked list of waiters, which is very convenient!
  - All: shrug Maybe worth an experiment?
  - Michael P: Might be worth investigating the futex API which lets you pick which thread to wake.

mknyszek · 2025-01-16T15:55:07Z

2024-10-10 Sync

Attendees: Michael K, Tim, Michael P, Bryan, Rhys, thepudds, Nick, Felix, Carlos, Milind

Michael P: Thank you to Nick for digging really deeply into that frame pointer issue!
- Nick: Some custom assembly code clobbered the frame pointer and PGO happened to inline the thing that called it and the clobbered frame pointer lived long enough to have frame pointer unwinding visit the bad pointer.
- Rhys: Followed the issue enough to see that it stemmed from machine-generated assembly, but generated from a tool outside the Go ecosystem. It seemed to have different ideas of which registers were available for general use. Avo seems to not have this issue.
- Nick: Found another unrelated crash on arm64 that doesn't know we save the frame pointer under the frame.
- Michael P: We get lucky with the external assembly generators with ABI0. I like Nick's idea of trying to improve vet to catch this.
- Felix: Nick and I were also discussing the differences in stack layout between amd64 and arm64. We were wondering what the reason for that was. Is it true that we can't change it because it's baked into ABI0?
- Michael K: I think so. I'll ask Cherry.
Felix: Go GC Regions draft design follow-up
- Michael K:
- Felix: As a summary of why we are interested in cross-goroutine regions is that it would fit very well in our model of sending data through a pipeline: the region could track the data passed through the pipeline. In theory we could rearchitect our code to use regions as-is, and they're probably better than arenas, so it still might make sense.
- Rhys: Wanted to confirm my understanding that you get the benefit of regions when you close the region.
- Michael K & others: Discussion about the state of the design draft and how to move forward.
- thepudds: If there was something like 100 performance sensitive applications, how many of them would the technique apply to?
- Michael K:
- Rhys: Re “server that processes 100 or 1000 requests per GC cycle”, do Memory Regions pay off when used with html/template or encoding/json, especially for simple/unoptimized apps? If this is an easier way of getting the low-hanging fruit (applies well for unoptimized apps, doesn’t do much for highly-optimized apps), is that a win?
Felix: Stability of -toolexec for DataDog/orchestrion and similar projects. E.g. alibaba/opentelemetry-go-auto-instrumentation and pijng/prep.
- We are very happy with toolexec for instrumentation, and now it's out there and propagating. At what point does it become a compatibility issue? We wanted to raise the issue early.
- Michael P: One of the problems of toolexec is that the Go module cache doesn't integrate with it, so you often have to tell it to rebuild everything. That feels like something that we should solve that if we want people to actually use toolexec. toolexec was originally intended for developers of Go.
- Felix: On our end, we encountered this problem for sure. I think if we could get buy-in on the problem statement of instrumenting code at compile time, we could create a proposal for something better. And also argue why other approaches to instrumentation are more painful and less useful.
- Michael P: I think describing the problem and starting discussion sounds like the right way to go to me. My gut feeling is that the discussion would veer toward arbitrary code execution at compile time as something we've always avoided, but at least in this case you're opt-ing into it.
- Felix: Where's the best place to start this discussion?
- Michael K: Filing a regular issue is a good start.
- Felix: This seems like it overlaps with linkname in that it's maybe using a tool in a way it wasn't quite intended, and we want to be good citizens and not be in the same position as linkname.
- thepudds: People have been using toolexec for a long time and I think it would make sense to loop in the tools team and maybe bring it up in the bimonthly tools call (https://go.dev/wiki/golang-tools#calls).
  thepudds: Also, here's an older proposal regarding caching issues for toolexec: proposal: cmd/go: allow -toolexec tools to opt in to build caching #41145.

mknyszek · 2025-01-16T15:55:39Z

2024-10-24 Sync

Attendees: Michael P, Rhys, Nick, Milind, thepudds, Michael K

Rhys: Still working on runtime.mutex, trying to address tail latency. Planning to add an additional layer of complexity to make direct handoff to another thread efficient without blocking the unlocker, or blocking other users of the lock.
- We still have some complexity budget remaining 🙂
Michael K: Very brief regions update.
- Plan to post publicly.
Milind: Goroutine deadlock detection via GC
- Milind: Goroutines leak in production. A goroutine is blocked on a channel, but the other end of the channel is no longer reachable. The goroutine (and its stack) can’t be collected. Enough leaks eventually lead to OOM.
- Milind: Liveness of the goroutine can be determined by the garbage collector. i.e., the concurrency primitive must be reachable from another live goroutine.
- Milind: Two options: just report leaks, or actually collect the goroutine.
- Rhys: Two goroutines both selecting on each other; I think GC could detect this case?
- Milind: Yes, cycles are not a problem.
- Rhys: Absolute goroutine counts hasn’t been super helpful, but goroutine profiles to break down what each goroutine is has been more useful.
- Milind: We do take daily goroutine profiles and look for suspicious cases, but there are lots of false positives/negatives. Time series analysis may be interesting, but not a sure solution.
- thepudds: Is the main goal reporting or collection?
- Milind: It’s an open question. Collecting is nice because it doesn’t cause outages, but if that is too complex, we’d get lots of value just from reporting.
- thepudds: Some previous discussions
  - https://github.com/golang/go/issues/19702#issuecomment-290223996
  - https://github.com/golang/go/issues/19702#issuecomment-290433926
  - https://github.com/golang/go/issues/19702#issuecomment-289825220
- Michael K: If we were to collect the goroutines, what about finalizers that only those goroutines reference? It’s not safe to run the finalizers, and it’s not safe to not run the finalizers. There’s a bunch of other details in those issues.
- Milind: 95% recall on microbenchmark tests, many unique bugs found in Uber unit tests.
- Rhys: You could add a request ID as a pprof label, and then when you collect a goroutine profile, note goroutines that have a request ID for a request that has already completed.
- Felix: It would be very helpful for goroutine profiles to include how long goroutines have been blocked. We had to work around by parsing runtime/debug.Stack (which incurred STW, better these days).
- thepudds: Built-in race detector has made writing concurrent code more approachable; a built in approach for this would be similarly nice.
- Rhys: goroutine IDs provide a very rough age, since they are sequential.
- Michael K: The patch is complicated, but resolvable in code review. Putting up a CL could help answer questions, such as cost when disabled.
- Michael K: Second question is about bang-for-your-buck. Goroutine profiles is less precise, but can help catch cases that would be false negatives with the GC approach (such as reachable via a global).
- Michael K: IIRC, it wasn’t very effective in production?
- Milind: Not true anymore. Newer experiments are working well. e.g., changing unbounded growth to flat.
- Michael K: We could reduce complexity by decoupling from the GC and building as a separate pass that is opt-in. Something that is run on-demand (e.g., a profile).
- Milind: I think people would be happy with that.
- Milind: New goroutine states may not be strictly necessary.
- thepudds: do you (Michael K) have an off-the-cuff guesstimate on latency impact of an on-demand stop-the-world based approach? (Sounds like it could be parallel but not concurrent w/ the GC, or similar?)
- Michael K: it would be on the order of milliseconds probably, which may or may not be acceptable
- Michael K: but the main question is whether it's acceptable to just do very occasionally. is such a pause acceptable once per day? I dunno.
- Rhys: This analysis could be done on core dumps, if they were easy to work with (both re tools and compliance).
- Milind: Too much PII in core dumps to touch them.
- Rhys: Maybe analyze the core dump on the machine? If there were a good way in the community to analyze core dumps in situ, that could address both compliance (doesn’t leave the machine/container) and tools (they’d get regular use, bug reports, and fixes) problems.

mknyszek · 2025-01-16T15:56:11Z

2024-11-07 Sync

Attendees: Nick, Michael P, Rhys, Felix, Milind, Bryan, thepudds, Tim

Michael K: Will post publicly about GC regions in a GitHub discussion sometime soon.
Michael K: Flight recording updates (gh issue)
- When there are multiple consumers, there’s tension between what they all want
- Thinking of taking the union of all requested events, the highest requested granularity, etc
- Trying to make this all composable, with multiple consumers not conflicting. Some may get more data than they requested, if a concurrent user asked for more (more events, finer granularity).
- Consumers can filter, the runtime can do the simple work of sending the same data to multiple consumers
- Michael P pointed out that reversing that decision is backwards compatible; we can decide later to make each consumer get exactly what they requested.
- Millind: Maybe there can be a stdlib package that does filtering?
- Michael K: Maybe. The trace parsing API is low level enough that users can build their own filtering [editor: did I get this right?]
- Michael K: CPU profile rate is an exception that needs some extra work in the runtime (no big deal). We’d have to set the internal sample rate to the max that’s requested, and downsample for each consumer.
- Felix: Are we sure it’s just a filtering thing? Picking different sample rates can mean different responses to bias, if we collect data at different times.
- Michael P: We’d thought most about tracing, might need to think more about profile samples.
- Michael K: The pprof protobuf format has a single field of sample rate, so we need to make sure that’s right when taking CPU profiles.
- Rhys: Argued against making SetCPUProfileRate because CPU profiling was still broken on Linux, but that's fixed now, so I retract that position. It's more subtle than just changing the sampling rate high to get a high resolution. If the kernel samples too often, it stops representing actual on-CPU time, and the profile becomes less representative.
- Milind: So you're concerned with scaling the value?
- Rhys: If one user asks for 10,000 Hz and one for 100 Hz, then nobody's going to get what they want, it exceeds the maximum rate that Linux will give you (for setitimer/timer_create-based profiling). But also again, at a rate that's too high, the CPU profile ceases to be useful as it interferes too much with execution.
Felix G: CLs to reduce pprof label overhead. Also helps eBPF profilers want to get trace id / span id from various runtimes (eBPF code).
- We use pprof labels for distributed tracing internally. Very useful to keep track of all the work that's being done on behalf of an endpoint.
- 20-30% of creating a span is going to setting pprof labels.
- 1,000+ CPU cores doing nothing but setting labels.
- CL switches from a map to a slice of key-value pairs.
- Made optimizations based on keeping the labels in sorted order.
- Michael P: Uncomfortably excited to learn eBPF is reading Go maps.
- Michael K: Let's get it reviewed!
Felix G / Nick R: Seeing issue where the execution tracer seems to interfere with scheduling.
- Nick: Goroutines all just run for extended periods of time, latency spike.
- Michael K: Sounds like the process getting descheduled, have seen this with Linux cgroups CPU quotas.
- Nick: Maybe, but we have other evidence. (We'll come back to this.)
- Nick: It corresponds to the execution tracer being active. It doesn’t correspond to an update to any software as far as we can tell.
- Nick: We have CPU profiling enabled at the same time, and when these goroutines appear as running for many tens of milliseconds (500ms?), we don’t see any CPU profile samples.
- Michael K: One of the expensive parts of tracing is traceAdvance. Is that happening? There’s also a forEachP.
- Michael P: Those goroutines look like they’re from the same root function, maybe they’re all independently trying to do something that’s hard to interrupt like some long assembly function. Does the M view of the execution trace show anything different? We’ve seen problems with getting to stop-the-world because of long assembly functions .. on the scale of seconds.
- Michael K: Another time I’ve seen this is from services that run in cgroups without enough memory, and the binary itself ends up thrashing (getting swapped in and out). Might be worth looking at disk I/O metrics.
- Rhys: Try changing the clock source on CPU profiles to be wall-time based. It's not a CPU profile, but might give you samples for sleeping threads, revealing some interesting structure. Also take a very close look at the very end of the slow times: how do those threads become un-stuck, what’s the exact timing, etc?
- Felix: It seemed like a lot of events were blocked on a mutex in the tracer.
- Micahel P: Maybe also look at CPU metrics in cgroups.
- Michael K: If one goroutine gets stuck with a runtime.mutex, the other goroutines that need that runtime.mutex will end up looking like that.
- Rhys: +1, but I haven't seen this sort of long lasting starvation with the current runtime mutex. This is on linux/arm64, which is interesting. I would be surprised if it's starvation. I would write a program that uses the x/exp/trace parser that dumps out the CPU profile samples for the time range during the time range that this is going on. I'd look to see if there are more samples than usual related to locks, or anything else.
- Felix: We'll take some of these ideas and try more things. We'll also try to anonymize some traces and post it to GitHub.
Felix G: Trace timeline viewer.
- Will hopefully make some more progress soon!

mknyszek · 2025-01-16T15:56:49Z

2024-11-21 Sync

Attendees: Carlos, Bryan, Rhys, thepudds, Nick, Michael K, Milind, Michael P

Day of the freeze!
- thepudds: Have a few things I would like to merge. Are tests and documentation OK during the freeze?
- Michael K: Yup!
- thepudds: Some of it is behavioral changes that improve testing. For example, changes that allow for more thorough testing.
- thepudds: A couple CLs for making the GC more stringent about bad pointers.
- Michael P: You're always welcome to send CL that we merge after the freeze lifts.
Rhys: Looked into some performance problems on arm64 with scalability a few months ago. Saw that when an application allocates a large byte slice it's not preemptible. runtime: non-preemptible zeroing in clear, append, bytes.growSlice, etc #69327
- Michael K: This is doable, but requires recovering why it's not OK to have preemption points in the clearing routines in mbarrier.go. For pointer-free memory, I do not see why it would be a problem. For memory with pointers, it might be OK to chunk the bulk write barrier with the clears.
Michael K: Posted publicly about GC regions finally.
- Bryan: What are the next steps?
- Michael K: Prototype and removal of arenas experiment, but the timeline is TBD. Hopefully next year, but probably not early next year.
Bryan: I want to retract my statement that I never got anything useful after execution traces. My position changed in the last two weeks. I tried out go tool trace and gotraceui.
- Bryan: When I added trace regions that showed up clearly in gotraceui.
- Bryan: It would be nice if trace.Region matched up with OTel spans.
- Bryan: It would be nice if everything in slog also appeared in traces.
- Michael K: Spoke about that with Jonathan but we never got to actually implementing it.
- Michael K: There are still barriers to using execution traces, but things have improved a little bit.
- Nick: For DataDog, we do correlate traces with APM spans. We use tasks instead of regions for this. You can also use trace.Log to put the span ID into a trace. We tend not to put logs into traces because logs are huge.
- Michael K: You can put logs into traces only behind a flag. For example, if the injection point is slog.Handler.
- Rhys: You can put different kinds of logs into the execution traces, more focused around operational events, and not including sensitive info.
- Bryan: Ran into some problems with gotraceui and Go 1.22 and Go 1.23. Dominik mentioned that it wasn't updated for Go 1.23.
- Michael K: There was only one notable update in Go 1.23 IIRC and that's for direct coroutine-style switches in iter.Pull, to emit fewer events. Michael P had the idea to use the same event for direct switches done by channels, if we ever do that. This required no changes to the high level API though, so it should be very easy for Dominik to upgrade, hopefully.

mknyszek · 2025-01-16T15:57:13Z

2024-12-05 Sync

Attendees: Michael P, Felix, Rhys, thepudds, Bryan, Michael K, Carlos

Felix: Quick updates on execution traces showing frozen goroutines (theory was execution traces are messing with scheduling) and flight recorder stuff
- Written a tool to turn execution traces into SQL that can be queried. (!)
  - Demo.
  - Michael P: Perfetto makes everything available via SQL
  - thepudds: Java heap analysis, etc is also queryable with SQL
  - Michael K: Plan to open source?
  - Felix: Yeah probably.
  - Michael K: Would love to have this on an "execution trace user manual" page.
  - Rhys: It's so fast! Do you know how big the original execution trace was?
  - Felix: Small execution trace, ~5 MiB, but DuckDB is far from capacity. 10x or 100x should be fine. Parsing the trace takes 1-2 seconds for a trace of this size. DuckDB has an optimized way of inserting bulk data.
  - Rhys: I love that there are more tools to programmatically analyze execution traces, to answer questions like “does this trace file include the rare event I’m looking for” and “if so, what goroutine at what time”. Figuring out deep linking would be super useful.
  - Felix: +1.
  - Michael K: Would like to have a seeking API, the current API forces you to go through the whole trace once.
  - Felix: Seeking would be very useful for my other project when dealing with large traces.
  - Rhys: For a deep link, if you start with a tool that tells you what the interesting thing is, that could go into a secondary tool that takes a minute-long execution trace and trims out just to the time range of interest. Then that may be faster to load.
  - Michael P: The deep link could be a query parameter on whatever URL is given and go tool trace wouldn't try to send all the data to the browser. It may have to parse the trace again, but wouldn't have to keep all the parts in memory.
  - Rhys: Re-parsing but at what level? If you have a 20 second execution trace and you're looking 15 seconds in then the execution trace tool could do a "coarse" pass on the first 15 seconds and a finer-grained pass on the last 5 seconds. We often end up with huge execution traces, hundreds of MBs, and it takes a minute or two to load that (plus a bunch of RAM). Maybe we can’t make that fast to do, but we could instead skip over big chunks of what would be expensive work.
  - Felix: Sounds doable. IIRC it should be easy to skip through the trace quickly by generation and each generation has the absolute timestamp it starts at (?).
  - Michael K: +1. Wall-clock timestamp?
  - Rhys: Protobuf protos have a wall-clock timestamp which is useful. It would be nice to have the timestamp available for traces.
  - Michael K: Could have the binary an optional thing to pass in. Could be useful for disabling on-line symbolization.
  - Rhys: Execution traces benefited a lot from symbolization by default.
  - Michael K: Agreed that symbolization is the right default.
  - Felix: Strongly agree. eBPF profiler has a problem where it gets profiling data but no symbols, and then it's a huge lift for people how to get symbols and upload them.
- Nick and I blocked a good amount of time next week to sit down and discuss it again.
Michael K: Flight recorder proposal looks OK, some discussion of minor details
- Most recent discussion is about e.g. SetMinAge failing if flight recording has already started. The new CPU profiling API returns an error in this case. But there was some discussion over making FlightRecorder immutable, and having to pass that information up-front.
Rhys: I’ve heard that there’s runtime scalability work coming up. Maybe from the anniversary blog post? Can you share the name of who’s working on that, so I can encourage people to tag them in issues I ask them to file when they have interesting-looking problems? Or maybe this is an issue label (like “Performance”, but specifically for runtime scalability).
- Rhys: Klaus Post’s short-lived goroutines with GMP=64 (or 128?)
- Michael P: A GitHub label would be helpful. Most of the people likely to work on it are in the room. I created a scalability label just now.
Michael K: viewcore progress (no promises, but looking better)
- Lots of issues with refactorings of e.g. runtime._type
- Does anyone in this room care about viewcore for Go <1.22.
- Felix: I sometimes look at older Go cores, but if it helps viewcore work it's not a big deal.. I could always use an older version of viewcore.
- Rhys: I wanted to clarify whether viewcore should no longer support versions that the Go team does not support.
- thepudds: Starting with just supporting tip is also great.
- Michael K: We've discussed supporting older versions as long as they didn't get in the way. "Good effort."

mknyszek · 2025-01-16T15:57:39Z

2024-12-19 Sync

Attendees: Michael K, Rhys, Mike D, Nick, Carlos, Felix, Bryan, thepudds

Mike D: Working at Odigos on eBPF. Interested in Swiss maps, profiling work, etc.
Felix: I published an early version of sqlprof. Feedback welcome!
- Felix walked us through the use-cases in the README. Looks great!
- Rhys: What are the output formats? The SQL text or is it possible to do analysis in SQL and a gzipped pprof proto back out?
- Felix: Not implemented but open to it. For now you can do CSV output.
- thepudds: Very cool, and thanks for publishing it! The name made me think it was for profiling SQL execution. I had this idea to work in a lot of contextual knowledge like that showcased in the examples (like categorizing the GC stacks) into the tools that ship with the Go distribution. For example, there could a JSON file that is maintained and shipped by Go that pprof knows how to read to report categories, so pprof does not have a prior knowledge of categorization but easy for Go to teach it.
- Felix: Not married to the name, ideas welcome. RE: categorizing stack traces, it's something I'm super interested in. The general problem is hard, because I'd ideally like to break down the trace into even more categories automatically. Like serialization/deserialization, in addition to low-level hard-coded things. Subtle things like mallocgc/gcAssist make things complicated. I don't think the idea of proposing to the pprof tool of providing a JSON file of categories is a bad one. I'll even propose it myself if you don't get to it.
- thepudds: A category like "other runtime" or "this is in the runtime but we don't know what exactly it is" is useful both to an end-user troubleshooting, but also to help triage issues on Go issue tracker
- Michael K: I like shipping this JSON file with Go. It's testable! Make sure the symbols still exist.
- thepudds: You can write a benchmark too to make a given category pop to double-check results (more end-to-end) to automate the testing of some correctness.
- Felix: Correcting myself from last time: the execution trace parser is definitely not the bottleneck.
Felix: Should the GC guide mention the write barrier? Especially in the identify cost section? Datadog spends > 1k cores on it.
- Michael K: +1, please feel free to send a patch. Small caveat that we should document: the fast path likely does not get attributed to the write barrier.
- Michael K: Maybe we could have the compiler claim that the instructions that write pointers into the buffer slots are part of a special function that was inlined?
- Felix: If there's a lot of subtleness to documenting this properly, I'd prefer if you did it.
- Michael K: OK no problem. :)
- Rhys: RE: write barriers and making profiles easy to understand, I'm curious if the contributors and maintainers think there's a benefit to have a way for Go users with some level of experience to find out what about their programs are unusual. Like, 1M goroutines is within limits, but unusually high. Or a high GOMAXPROCS.
- Michael K: I like the idea and approach, but I'm not sure where to go with it.
- Rhys: Another approach is to put feelers in the community for programs that are unusual in some ways. For example, more than 1M goroutines, >1TiB heap, etc.
- Michael K: Putting out feelers has traditionally been hard.
- Rhys: Wouldn't want to give the impression that "Go is not suitable for X." If you can convince a user to apply some instrumentation and give back some data, that's certainly a good start.
- Felix: +1 that helping people understand what is unusual is an important challenge to overcome. Especially for people in stressful situations who are trying to fix an active problem.
Felix: Has the Go runtime/compiler ever considered providing support for USDTs (implementation details)?
- Linux uprobes: rewrite an instruction to an int3 to trap when entering a function to track execution. Used with eBPF. USDTs are like a safer version of uprobes. Just put a nop instruction somewhere and put data in the binary that describes data associated with that instruction. You still patch the nop instruction.
- thepudds: it’s more stable over time because it’s in the source code, and can be placed in the middle of a a sensitive function, which is harder to do and less stable over time w/ a uprobe.
- Rhys: Years ago we were interested, but didn't know exactly what we'd use them for.
- Mike D: +1 to what Felix was saying. We try to instrument all sorts of things and we patch specific functions, but we also read and write values.
- Michael K: How would this look?
- Felix: A magic function like "Probe(fn)" which normally compiles to a nop but can be patched with a trap that then executes the function.
- Mike D: This is definitely just gaining some traction now.
- Rhys: This is an opportunity to run eBPF code while a thread is paused in the runtime. This could be a way to add programmability to the runtime, if part of the contract for the particular tracepoint is that you write out some value to some stack offset.
- Felix: I do not know the answer for sure but I think you have to basically chase down the pointer if the probe is active or not.
- thepudds: Maybe I didn't understand the question, but if there's a pointer that's available, it's not going to dereference until the probe fires.
- Rhys: It sounded to me like to have a stable API for a named probe is that you take all the information that you guarantee all the information that's available to the problem is still available across releases.
- thepudds: In dtrace, which this is based on to some degree, you describe what you want to happen, and it only happens if the probe is executed. You can have a little C-ish expression that walks through pointers and that only happens if probe is enabled. (Caveat -- I’m more familiar with dtrace mechanics than the low-level eBPF mechanics).
- Rhys: Depends on the language you have to describe the probe, and I don't know that at all.
- Felix: I brought it up for the same use-case that Mike D has. It would be great if Go binaries could be as easily observed as other languages.
- thepudds: In the world of eBPF there was a more fundamental problem with Go: return probes due to stack moves. There are a few issues there. Commercial eBPF vendors I think work around it, but I think it’s a problem for other developers just trying to do basic things on their own with eBPF and Go. Has that changed?
- Mike D: I don't think we've hit too much of a problem with our workarounds. If we could solve the return probe problems that would help.
Rhys: Are we meeting Jan 2?
- Nope!

mknyszek added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 8, 2022

mknyszek added this to the Go1.21 milestone Dec 8, 2022

mknyszek added this to Go Compiler / Runtime Dec 8, 2022

prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 8, 2022

prattmic moved this to In Progress in Go Compiler / Runtime Dec 14, 2022

felixge mentioned this issue Dec 14, 2022

cmd/trace: support for perfetto #57315

Open

irfansharif mentioned this issue Dec 15, 2022

proposal: runtime: add per-goroutine CPU stats #41554

Open

tbg mentioned this issue Feb 15, 2023

obs: integrate on-demand CPU profiling / Go execution tracing with CRDB tracing cockroachdb/cockroach#97215

Open

felixge mentioned this issue Mar 27, 2024

proposal: runtime/pprof: add goroutine stack memory usage profile #66566

Open

DarrylWong pushed a commit to DarrylWong/go-benchmarks that referenced this issue Apr 22, 2024

sweet: add support for execution traces and measuring trace overhead

715fba1

For golang/go#57175. Change-Id: I999773e8be28c46fb5d4f6a79a94d542491e3754

cforce mentioned this issue Sep 17, 2024

hostmetrics featuring ebpf - resource efficient scraping open-telemetry/opentelemetry-collector-contrib#32446

Open

prattmic mentioned this issue Sep 27, 2024

runtime: significant heap profiler memory usage increase in Go 1.23 #69590

Open

runtime: diagnostics improvements tracking issue #57175

runtime: diagnostics improvements tracking issue #57175

Comments

mknyszek commented Dec 8, 2022 • edited Loading

Runtime tracing

Heap analysis (see #57447)

mknyszek commented Dec 8, 2022

2022-12-07 Sync

felixge commented Dec 9, 2022

tbg commented Dec 9, 2022

thediveo commented Dec 9, 2022

qmuntal commented Dec 9, 2022

mknyszek commented Dec 9, 2022

qmuntal commented Dec 9, 2022

qmuntal commented Dec 14, 2022

gopherbot commented Dec 21, 2022

mknyszek commented Dec 22, 2022 • edited Loading

2022-12-22 Sync

mknyszek commented Jan 13, 2023 • edited Loading

2023-01-05 Sync

mknyszek commented Jan 19, 2023 • edited Loading

2023-01-19 Sync

mknyszek commented Feb 2, 2023

2023-02-02 Sync

qmuntal commented Feb 2, 2023 • edited Loading

mknyszek commented Feb 16, 2023 • edited Loading

2023-02-16 Sync

mknyszek commented Mar 3, 2023 • edited Loading

2023-03-02 Sync

dominikh commented Mar 4, 2023

mknyszek commented Mar 6, 2023 • edited Loading

dominikh commented Mar 6, 2023

mknyszek commented Mar 6, 2023

dominikh commented Mar 6, 2023

dominikh commented Jan 18, 2024 • edited Loading

mknyszek commented Mar 9, 2024

2024-02-01 Sync

mknyszek commented Mar 9, 2024 • edited Loading

2024-02-15 Sync

mknyszek commented Jun 20, 2024

2024-02-29 Sync

mknyszek commented Jun 20, 2024

2024-03-14 Sync

mknyszek commented Jun 20, 2024

2024-03-28 Sync

mknyszek commented Jun 20, 2024

2024-04-11 Sync

mknyszek commented Jun 20, 2024

2024-04-25 Sync

mknyszek commented Jun 20, 2024 • edited Loading

2024-05-09 Sync

mknyszek commented Jun 20, 2024

2024-05-23 Sync

mknyszek commented Jun 20, 2024

2024-06-06 Sync

mknyszek commented Jun 20, 2024

2024-06-20 Sync

mknyszek commented Aug 5, 2024

2024-07-18 Sync

mknyszek commented Aug 5, 2024

2024-08-01 Sync

mknyszek commented Aug 15, 2024

2024-08-15 Sync

mknyszek commented Sep 26, 2024 • edited Loading

2024-08-29 Sync

mknyszek commented Sep 26, 2024

2024-09-12 Sync

mknyszek commented Sep 26, 2024

2024-09-26 Sync

mknyszek commented Jan 16, 2025

2024-10-10 Sync

mknyszek commented Jan 16, 2025

2024-10-24 Sync

mknyszek commented Jan 16, 2025

2024-11-07 Sync

mknyszek commented Jan 16, 2025

2024-11-21 Sync

mknyszek commented Jan 16, 2025

2024-12-05 Sync

mknyszek commented Jan 16, 2025

mknyszek commented Dec 8, 2022 •

edited

Loading

mknyszek commented Dec 22, 2022 •

edited

Loading

mknyszek commented Jan 13, 2023 •

edited

Loading

mknyszek commented Jan 19, 2023 •

edited

Loading

qmuntal commented Feb 2, 2023 •

edited

Loading

mknyszek commented Feb 16, 2023 •

edited

Loading

mknyszek commented Mar 3, 2023 •

edited

Loading

mknyszek commented Mar 6, 2023 •

edited

Loading

dominikh commented Jan 18, 2024 •

edited

Loading

mknyszek commented Mar 9, 2024 •

edited

Loading

mknyszek commented Jun 20, 2024 •

edited

Loading

mknyszek commented Sep 26, 2024 •

edited

Loading