-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: runtime/trace: flight recording #63185
Comments
I agree with all of the advantages of having flight recording in the runtime. As for
I think that program logic will be the far more useful trigger for the majority of use cases, and certainly easier to make use of. On top of that, this flight recorder will be much more straightforward to integrate with existing telemetry solutions, compared to external solutions, which will either not be designed for that (Gotraceui) or will have to be written first. |
Just to say the most frequent "I wish I knew what happened just before" for me is OOM, which isn't (AFAIK) something you can trigger on once the kernel makes its decision. Maybe we could trigger on hitting GOMEMLIMIT? |
Nice! How can we get CPU profile samples to show up in the flight-recorder data stream? The existing APIs for execution traces and CPU profiles have a start time and an end time, so those two pair well together. For this it seems that we'd need a way to ask for The panics in
You've said that the data from |
It turns out that's quite difficult to do because Linux provides no opportunity to dump any information when an OOM occurs; the OOM killer simply SIGKILLs processes. Unfortunately With that being said, there are still some best-effort things we might be able to do. Programs that don't use Either way though, it's still best-effort. Unless the snapshotting fully stops the world, the program may continue executing and OOM before the trace actually gets fully dumped. (Even then, it's still possible (but less likely) that it OOMs before the runtime successfully stops every running goroutine.) Once upon a time there was a patch proposed to Linux to allow for halting a process when it hit container limits, so that another process in a tiny memory reservation could inspect it. One could imagine that if this other process created a core dump of the halted process, we could write some code to extract any active trace buffers from the core dump into a trace. I'm not sure where this leaves us. Perhaps it suggests that |
This comment was marked as off-topic.
This comment was marked as off-topic.
If you happen to have CPU profiling running, it'll just work, but you make a good point that there's no good way to have it included all the time (i.e. no intention of producing a CPU profile). It seems to me like that should maybe be another option, either on the
Hm, that's a good point. I'll update the API to return an error on
There's currently no list of all the Ms per generation because we don't have M events, but yes, everything within a partition is namespaced by that partition's "generation number." The proposed trace parsing API exposes partition changes as a FWIW, the trace parsing API already does this exact kind of "stitching." Every partition is an entirely self-contained trace, which means all goroutines (and Ps) and their statuses get named in every partition. The trace parser uses this information to validate the stitching: new partitions' goroutine statuses need to match where that goroutine left off in the previous partition. |
This proposal has been added to the active column of the proposals project |
I think this is waiting on an implementation and experience from using that implementation. |
That's correct. I plan to have one in golang.org/x/exp/trace soon. |
I made a mistake in the commit message so gopherbot couldn't connect the CL to this issue, but the experimental API has now landed via https://go.dev/cl/550257. It is available in golang.org/x/exp/trace for Go 1.22 only. There are a few caveats with this implementation that will not be true with a real runtime implementation.
I don't think any of these are significant enough to detract from the usefulness of the experiment, but I wanted to bring it up in case one of these does become an issue. We can also explore ways to improve the experiment to make it more representative, if one of them is indeed a problem. Please give it a try! |
The One limitation I encountered is that the As for preserving traces after OOM, one possible approach could be to use a separate agent process with a shared memory buffer to write the traces into. That agent could then detect that the main process crashed and write the contents of the memory buffer into a file or other storage. However, this would require keeping all the flight recorder data in a single pre-allocated buffer. |
Thanks for trying it out!
Yeah, that's a limitation of the experiment which lives out-of-tree for convenience, but it technically shouldn't be necessary if it's implemented in the runtime. I'd hoped to allow one of each type of consumer (one active flight recorder, one active |
@mknyszek , we've discussed allowing the caller to control the set of events that the trace includes. That would enable runtime maintainers to use the execution trace to grant access to events that are niche and expensive, or in the other direction would enable low-overhead access to a subset of events. Maybe this is how we build/improve delta profiles for #67942, and how we give access to goroutine tags/labels in monotonic profiles (heap, mutex, etc) for #23458. If that's the direction we go, I think we'll want either a/ multiple |
Multiple The only way I can imagine this working efficiently is via something like https://go.dev/cl/594595. TL;DR: We have some light infrastructure for emitting "experimental" events now, and this CL formalizes that a bit better by allowing us to split the event stream. It's at least a step toward that. If we have separate buffer streams, you could imagine a new registration consists of an It's certainly not off the table, but I think we're a little ways off still. |
I'd like to point out that in Linux it's now possible to be actively informed by the kernel through an epoll event if a control group is under memory pressure through the PSI subsystem. https://systemd.io/MEMORY_PRESSURE/ This could be used as a signal to dump a trace and/or to trigger GC proactively. |
This has been "waiting for implementation experience". Do we have enough experience to think about whether to add this to Go 1.24? |
We've had some experience with flight recording inside Google, and we've also gotten some good feedback from @martin-sucha. I think a few things are clear from our experiences so far:
Much of this can be refined and improved once we have an implementation that lives fully inside the runtime. Lastly, from my point of view, I don't know what other feedback we would be waiting on that this point. CC @cagedmantis |
This discussion seems to be not relevant given the "only one tracer" limitation. I think we agree that if there are multiple tracers, then SetPeriod/SetSize apply to the one tracer they are called on, and not the others. In discussion I was confused about "period" thinking it had something to do with sampling, which obviously it cannot. SetDuration might be a better name. It should also be made clearer what happens if SetDuration/SetSize conflict. Does the bigger one win? And how does anyone know what kinds of sizes to pass? There is no documentation of the relationship between duration and bytes. Maybe the API should be SetMaxDuration and SetMaxSize? Or SetMinDuration and SetMaxSize...? Or...? Should they be dropped entirely? Users need some way to understand how to use these, and it's unclear how to use them at the moment. |
To expand on what this would mean: the buffer would be kept as small as possible to satisfy the min duration, but if that exceeds max size, then max size wins over the duration. That seems consistent with your comment about SetSize "limiting the damage" of SetDuration. If you just want as much trace as can fit in, say, 16 MiB, you'd SetMaxSize to 16 MiB and SetMinDuration to (effectively) infinity. |
Should there be some way for the application to detect that the max size limit is capping the duration? A runtime metric tracking the current duration or size of the trace data in the buffers perhaps? Would such a metric be useful at all? |
I'll just note that the original docs for
(EDIT: That being said, all for clearer names and stricter semantics, personally.) |
I support @aclements description of the semantics for One caveat I would like to add is that I would also like to mention in the documentation that there is an internal minimum for the minimum duration, and that the setting might be ignored. If the minimum is too low, a simple implementation (call Finally, I agree we should add some guidance on how to set max size and min duration to the documentation. For min duration, how about:
For max size:
If this sounds OK overall, I'll workshop it a bit more and put all this back up into the original proposal. CC @cagedmantis |
Based on this description of the duration limit and the one in #63185 (comment), it seems like we're describing a maximum duration. I.e. the flight recorder will hold on to no longer than the given duration of recent activity. FWIW I don't think a minimum duration makes sense for a flight recorder. If we have a minimum duration the buffer should hold, surely holding more than that duration would be acceptable? Meaning we'd have an unlimited flight recording if a size limit is not set? That said, if I replace "minimum duration" with "maximum duration" then the descriptions and guidance in #63185 (comment) seem reasonable to me. In a continuous profiling context, we would likely use both. We'd have a duration limit of ~1 minute since that's how often we upload data, and it'd be a waste to upload substantially overlapping traces. And we'd have a size limit to prevent OOMing user programs, and keep bandwidth & storage costs under control. For some prior art here, the Java JDK Flight Recorder (JFR) has |
I see your point, but I can still see the argument both ways. The window size is, IMO, a bit more of a target than it is a strict minimum or maximum. It's much fuzzier to me than |
I think where I got confused with Austin's comment was the relationship between duration and size in bytes. Rereading it, I think Austin is suggesting that
I had pictured a |
The original implementation I had in mind would always keep 2 generations around (the currently-accumulating one and the previous one), but adjust the rate at which generations are produced based on the duration and size parameters. It would track how many bytes live in buffers and force the creation of a new generation in response to One downside is that this implementation strategy has visible effects on concurrent calls to Your proposed implementation is also reasonable, and it might be helpful to decouple the internal generation policy from the flight recorder policy. But it also means the duration and size parameters will be less precise, always rounded up to the minimum granularity the runtime exposes. I believe it would be backwards compatible (since the wire format doesn't follow compatibility guarantees) to change the implementation if one or the other is a mistake, so we can experiment and start with whatever's easier. |
Does this mean that during the |
@martin-sucha No. In my mind, |
In the diagnostics meeting today, we discussed
I think the API in this proposal is not quite right. Having different configurations for different subscribers of trace data is going to be very complex to implement in the general case. It's going to be tricky for duration and size, but even worse if we consider things like opting into specific events in the future (filtering the event stream effectively requires parsing it, which is much slower than emitting it). I would like to propose an alternative API that makes these knobs,
This API is substantially different, but gets at the heart of what the size and duration parameters really represent, and makes them global settings, so there's no conflict between subscribers. It also presents a coherent semantics for these durations with respect to regular tracing and makes them global. It also aligns even more closely with the existing proposal for CPU profiling configuration (#42502). Lastly, it allows for multiple trace subscribers (that is, the equivalent of multiple calls to My one concern with this API is that the sync period is arguably be an implementation detail, but I think we've reached a point where it cannot. Without the concept of a synchronization point in the trace, many important properties of the new trace format simply cannot be relied upon (like the fact that large traces can be parsed using a reasonable amount of memory). While most of the time users do not need to be aware of this at all, for those working with and integrating with traces deeply, these details matter a lot. If this API sketch is of interest to those on this issue, I will file a separate proposal to define:
Extending the tracer for multiple subscribers is actually not that hard to implement. The trace reader goroutine just needs to spit trace data out into multiple io.Writers, effectively acting as a MultiWriter itself. The advantage of the subscription mechanism is that different tracing sources can come in an out concurrently at different times. This means that net/http/pprof need not return an error if someone tries to grab a trace manually while, say, the a diagnostics suite performs occasionally continuous monitoring duties by enabling tracing briefly. But, we can also cut down this API to allow only one subscriber. That would be OK with me too. |
How does net/http/pprof end up sharing the same
I don't have a feel for the cost of EventSync events. What's the difference between syncing every 1000 ms, 100 ms, or 10 ms? If "10 ms" is so great, why don't we do that unconditionally?
This sounds helpful .. we expect to have an OS thread for each GOMAXPROCS, and for each OS thread to have a pretty big system stack. (Is that true for all threads, or only for the first one?) Can we convert that into justification for a general-purpose default for the sync interval (like "if we sync every 100 ms, that's 10% more memory usage than you'd get from the OS threads' stacks themselves, even if the app didn't have its own memory needs"). Maybe I still don't understand the root we're trying to get to. It seems like every consumer of the execution trace data stream can make its own decisions about when to discard data (because it's more bytes than the consumer has in its budget, or because it covers more seconds that the consumer has been told it needs to record). But the one thing that the consumers need to agree on is how often the runtime will generate the sync events: that's the source of the need for some global config (or mutual exclusion). Is that right? |
Yeah, that's a good point; I don't have a good answer. We could have net/http/pprof expose a
You raise a good point. The documentation should state this, but yes, there's a CPU cost to generating synchronization events more frequently. The cost is roughly proportional to the number of Ms and Gs. Off the top of my head, I don't know what the cost is, but it's certainly measurable.
The cost of synchronization events is proportional to not only OS threads, but the number of goroutines (since we record information about even waiting goroutines too). I'm not totally sure I follow your "10% more memory" example though; can you explain more?
Yeah, that's mostly right. I think in a hypothetical future scenario where we want to optionally expose "expensive" events (like, say, an allocation trace, which is currently behind a |
How does one create a If it has a global nature, why not make the methods package-level functions, or make a global variable of And, how does the unrelated code know if there is already an instance of |
+1 to making the global nature of
@rhysh in addition to the CPU costs mentioned in @mknyszek's reply, I think it's also important to consider the size overhead. More @mknyszek I have a question about |
Cherry and I spoke offline and I agree, I think the new API I wrote above is not quite right. Apologies for the confusion. I think where we landed was that we should continue with the API as-proposed in the original post, but that the implementation's decisions on when to cut a new generation should be decoupled from the flight recorder parameters (size and duration). That being said, we can consider having a separate top-level API that controls the size and duration between sync events. Like: (This is definitely incomplete.)
I think the original sin that spawned this whole discussion was conflating the trace implementation's size/duration with the flight recorder's size/duration. It's very convenient from an implementation perspective to couple the flight recorder's size/duration with the trace's underlying size/duration, but I think it ultimately doesn't make much sense from an API perspective.
I think this question really gets at the heart at why this question should be decoupled from flight recording. With flight recording, you care predominantly about how much trace data you're keeping around, and you want to control that. If we want to have a knob for the performance trade-off, we should have a separate knob for that.
Yes. In both the original proposal and the (now defunct) new one, a request to dump flight recorder data will always finish the current generation to ensure that the most recent data ends up in the trace that is produced. |
During the diagnostic sync on Thursday I mentioned an API idea for ensuring that two libraries trying to control a global runtime setting don't end up interfering with each other in a problematic way. This might be less of a concern here, but I'm also thinking ahead to future APIs for controlling which trace events to emit or suppress. Below is a potential API pattern that could work: // SyncRate holds the rate at which the trace implementation resynchronizes the internal tracer state.
// The rate is expressed as a duration and a size in bytes. The tracer is synchronized each time the period
// elapses, or the amount of trace data produced since the last sync exceeds size bytes, whichever comes
// first.
//
// More frequent synchronization results in a bigger overall trace (because of more duplicate trace data),
// higher run-time CPU overheads when tracing is enabled, and lower run-time memory overheads. Less
// frequent synchronization results in the opposite.
type SyncRate struct {
Period time.Duration
SizeBytes int64
}
// ControlSyncRate gives the caller control over the sync rate until release is called. Only one caller can
// control the sync rate at a time. If another caller tries to control the sync rate, an error is returned.
func ControlSyncRate() (control func(SyncRate), release func(), err error)
// ReadSyncRate returns the current sync rate.
func ReadSyncRate() SyncRate In the case that ControlSyncRate returns an error, we could also include the stack trace of the caller that currently has control over the rate, making it very easy to diagnose potential configuration conflicts. I'd also like to include ReadSyncRate for two reasons: a) It's useful for tests that want to set a rate and then restore the previous one. b) It allows libraries to better adjust their behavior in case they are not able to control a setting. Let me know what you think. 🙇 |
Change https://go.dev/cl/555716 mentions this issue: |
Apologies for the very long delay here. Since a lot of the remaining questions here touch on our overall strategy and treatment of diagnostics, and how they interact with multiple consumers, I brought together a few people internally to discuss. We can use these points as starting points in the next diagnostics weekly meeting. Here's what we're thinking.
Given all of this, I have made some minor updates to the API at the top of this proposal based on the feedback provided in this issue, and I think we should move forward with that. @felixge With respect to the sync rate knob, I'm currently thinking trace sync rate should probably be a separate discussion and a separate proposal. (That isn't to say we won't follow up, just that it doesn't need to block this proposal.) It's also not totally clear to me that we should have "Period" in there at all. It's about controlling CPU and memory overheads, so it should control them more directly. Given the above, I think it would be good to fold this into the "minimum requirements" idea instead of designating one goroutine and/or package as the knob owner. I am not totally sure what this would look like yet, but I'll give it some thought. We can also discuss this next week. I'll add this to the diagnostics meeting agenda for next week and I'll report back on what we land on. |
@nsrip-dd I thought more about If we think of Also, in the context of a user trying to decide what to do, I think That being said, I recognize these points are somewhat marginal, but we need to make a decision one way or the other, and this is enough to tip the scales for me. |
Should it return an error in this case, then? Or we're okay with just a silent no-op? |
If we are worried about setters being called after Start, one option would be to define a separate FlightConfig struct that has the fields we need and then have NewFlightRecorder take a *FlightConfig. |
Sorry for not posting this sooner, but to follow up on my last message, we reached consensus at a couple diagnostics meetings ago on the API.
That's a good point. The parameter is kind of a hint anyway, so I figured a silent no-op was OK. However...
Yeah, this is something we could certainly do, and just make the parameters immutable. It's a little bit awkward since the zero value becomes meaningful, whereas with the setters we get to say the default is implementation-defined. Unless we say the zero value of any parameter indicates that the caller just wants the implementation default. |
It occurs to me #42502 (comment), which is an accepted proposal, has its setter return an error, as @cherrymui suggested. So maybe we should align with that? There's also no constructor for the type. |
We probably should have a constructor function, so it is clear how to create a |
I generally dislike having "typestate" in an API, so I agree with the direction of separating configuration and running.
I think this would be fine. So this would look something like package trace
type FlightRecorderConfig struct {
// MinAge is a lower bound on the age of an event in the flight recorder's window.
//
// The flight recorder will strive to promptly discard events older than the minimum age,
// but older events may appear in the window snapshot. The age setting will always be
// overridden by MaxSize.
//
// If this is 0, the minimum age is implementation defined, but can be assumed to be on the order
// of seconds.
MinAge time.Duration
// MaxSize is an upper bound on the size of the window in bytes.
//
// This setting takes precedence over MinAge.
// However, it does not make any guarantees on the size of the data WriteTo will write,
// nor does it guarantee memory overheads will always stay below MaxBytes. Treat it
// as a hint.
//
// If this is 0, the maximum size is implementation defined.
MaxBytes uint64
}
type FlightRecorder struct {
...
}
// NewFlightRecorder creates a new flight recorder.
func NewFlightRecorder(config FlightRecorderConfig) *FlightRecorder
func (*FlightRecorder) Start() error
func (*FlightRecorder) Stop() error
func (*FlightRecorder) Enabled() bool
func (*FlightRecorder) WriteTo(w io.Writer) (n int64, err error) |
@nsrip-dd and I spend some time to go over the latest proposal again. It looks good to us, but we'd like a bit more color on the operation of `WriteTo():
|
Yes, it cuts off an on-going generation immediately so you have up-to-date data. I don't think that's necessary to document, really, since I think that's the most intuitive and useful behavior it could have. Though I guess we could say something like "best-effort up-to-date data as of when WriteTo returns" so nobody has to second-guess it.
Calling multiple Multiple |
@mknyszek , would you mind posting the latest version of the full proposed API as a self-contained comment? Thanks! |
Proposal: runtime/trace flight recording
Updated: 23 September 2023
Background
"Flight recording" is a technique in which trace data is kept in a conceptual circular buffer, flushed upon request. The purpose of this technique is to capture traces of interesting program behavior, even when one does not know ahead of time when that will happen. For instance, if the web service fails a health check, or the web service takes an abnormally long time to handle a request. Specifically, the web service can identify such conditions when they happen, but the programmer setting up the environment can't predict when exactly they will occur. Starting tracing after something interesting happens also tends not to be useful, because the program has already executed the interesting part.
The Java ecosystem has had this for years through Java's flight recorder. Once the JVM's flight recorder is enabled, the JVM can obtain a trace representing the last few seconds of time. This trace can come from triggers set up in JMX, or by passing a flag to the JVM that dumps a trace on exit.
With the implementation of #60773 now approaching a stable point, hopefully in Go 1.22 we'll have all traces become series of self-contained partitions. This implementation change presents an opportunity to easily add something similar to the Go execution tracer by always retaining at least one partition that can be snapshotted at any time.
This is also enabled by work in the Go 1.21 release to make traces dramatically cheaper. Because flight recording relies on waiting until something interesting happens, tracing needs to be enabled for a much longer period of time. Enabling flight recording across, for example, a small portion of a production fleet, becomes much more palatable when the tracing itself isn't too expensive.
Design
The core of the design is a new API in the
runtime/trace
package to enable flight recording. This means that programs can be instrumented with their own triggers.Implementation
Most of the implementation work has already been done. I think it would be OK to ship the implementation in golang.org/x/exp/trace, even though it has some efficiencies (like, having to copy buffers outside the runtime). We could make this more efficient by reference-counting the runtime buffers to avoid a copy.
The main thing that definitely needs to change, however, is that the flight recorder needs to be able to run simultaneously with a call to
trace.Start
, which is currently not possible, sincegolang.org/x/exp/trace
usestrace.Start
itself. This can be implemented by having the trace reader goroutine (misnomer) write buffers to multiple writers. All we need to do is calltraceAdvance
simultaneously with adding a new writer, so that the new writer always starts receiving data on a generation boundary.Discussion
SetMinAge
andSetMaxBytes
could give more rigid guarantees, but it's both complex to implement and not terribly useful. The primary use-case forSetMinAge
is to allow users to ask for longer traces (for example, if a web service's "long request" means something much longer than a handful of seconds). Meanwhile the primary use-case ofSetMaxBytes
is to control memory overheads and limit the damage caused by a largeSetMinAge
.WriteTo
could allow multiple goroutines to call it since it could easily serialize them internally. However, this can create some bad situations. For instance, consider some snapshot trigger condition that causes multiple goroutines to callWriteTo
. The call is heavyweight and they'll queue up behind each other; the longest one will likely take quite a while to resolve, and the application will be significantly disrupted. It'll also produce traces that aren't very useful (consisting of short partitions corresponding approximately to the duration of the lastWriteTo
call) unless we also allow for multiple goroutines to read the same partition's buffers. However, that's going to be fairly complicated to implement, and also doesn't really add much value either, since it's just duplicate data. The current design encourages callers reduces the risk of run-time panics while also side-stepping these issues by returning an error in this case.Alternatives considered
External circular buffer
@dominikh has suggested adding a similar feature to gotraceui. Because the partitioning is actually baked into the trace's structure, it's possible for trace consumers to implement something similar themselves. The only thing missing is a standard streaming endpoint for execution traces (see follow-up work).
However, there are a few advantages to adding support directly to the standard library and runtime.
runtime/trace
package's annotations, but can instead decide when to grab a snapshot directly.SetPeriod
andSetSize
). Any external solution will be at the mercy of the runtime's defaults.Despite these advantages, it's likely worth pursuing support for such a use-case even if the API described in this proposal is made available. A shortcoming of this document's proposal is that there's no way to trigger a snapshot explicitly against trace data, only program logic. Handling traces externally also means the ability to perform ad-hoc analyses without the need for additional instrumentation.
Follow-up work
Add support for trace streaming endpoints to
net/http/pprof
As stated in the discussion of the "external circular buffer" alternative, we could support that alternative easily and well by just adding a standard debugging endpoint for streaming trace data. It probably makes the most sense to just add new query parameters to the existing
trace
endpoint; the details of that can be hashed out elsewhere.The text was updated successfully, but these errors were encountered: