-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing: add a node-local registry of inflight spans and metadata #55592
Comments
Copying this comment here for posterity: Wanted to share an insight I had that mulling over our tracing discussion last night. As I had mentioned, the recordings API that we currently have is not good. Status quo:
Instead I think with Raphael's idea of centering the storage around the registry, it becomes very appealing to simplify as follows:
I think this is pretty much as idiomatic as can be and also is much closer to how a real trace collector would work. One consideration here is about the recording data lifecycle: we want the data do be dropped from the local registry the moment the span (local root span) is done, which means recordings can't be accessed after Pseudocode: var tr *Tracer // Tracer = Registry in this example at least
func rpcServerMiddleware(req, traceID, parentSpanID) (Response, error) {
// NB: sp is a local root span since ParentSpanID won't be known locally
sp := tr.StartSpan(traceID, Options{Parent: ParentSpanID})
resp, err := do(req) // assume err=nil because pseudocode
// Get recording for all spans derived from sp. Note that by "induction" this also
// has all of the additional remote spans that descended from this local root span,
// see the client middleware below.
recs := tr.GetTransitiveRecording(traceID, sp.SpanID)
sp.Finish()
resp.Recordings = recs
return resp, err
}
func rpcClientMiddleware(req, traceID, spanID) (Response, error) {
req.TraceID, req.ParentSpanID = traceID, spanID
resp, err := sendRPC(req)
tr.AddRecordings(traceID, resp.Recordings)
}
func explainTrace(stmt string) {
sp := tr.StartSpan()
exec(sp, stmt)
recs := tr.GetRecordings(sp.TraceID())
renderRows(recs)
sp.Finish() // this drops the traceID from `tr` completely
} There are some caveats here when child spans outlive their parents, but they boil down to just accepting that such spans will only be collected partially - anything that happens after the local root span finishes is never exposed to users. That seems fine. Opentracing has a "follows from" reference type to make this idiomatic, which we may want to implement as well to be able to pass it to external OpenTracing instances, but it won't make much of a difference in our internal tracing system. |
A local root span is a Span which was created without the `WithParentAndAutoCollection` option, meaning that either it is the root of the trace, or it is a child whose parent will not include the child in its recording. A new method `Tracer.VisitSpans` is added which allows iterating through these Spans. Touches cockroachdb#55592. Release note: None
A local root span is a Span which was created without the `WithParentAndAutoCollection` option, meaning that either it is the root of the trace, or it is a child whose parent will not include the child in its recording. A new method `Tracer.VisitSpans` is added which allows iterating through these Spans. Touches cockroachdb#55592. Release note: None
58490: tracing: add registry of unfinished local root Spans r=irfansharif a=tbg A local root span is a Span which was created without the `WithParentAndAutoCollection` option, meaning that either it is the root of the trace, or it is a child whose parent will not include the child in its recording. A new method `Tracer.VisitSpans` is added which allows iterating through these Spans. Touches #55592. Release note: None Co-authored-by: Tobias Grieger <[email protected]>
Is your feature request related to a problem? Please describe.
In CockroachDB, traces become visible to the user only when their corresponding requests return. This means that without external tracing enabled, we have a hard time investigating requests that don't return in a timely manner.
Describe the solution you'd like
Provide a per-node component that provides programmatic access to the open traces and their contents. Provide a (bare-bones) debug page that prints these, similar to the net/trace http endpoint, unless we get a SQL table that does the same. Alternatively or in addition, we could also expose these in a format suitable for importing into an existing tracing frontend such as Jaeger.
Always-on tracing provides an additional challenge. This mode will be the default and it implies that information is added to the trace much less frequently. In particular, the last emitted metadata will typically not reflect the "hanging" operation.
Two possible solutions present themselves:
a) some tracing support for "in-flight metadata", i.e. a way to expose metadata before it is "finalized". For example, a transaction conflict could be added to the registry as "inflight", would be updated as the conflict is being resolved, and finally be added to the trace when the conflict handling is complete.
b) for spans that are "long-running" (for a suitable definition of that word), keep a ring buffer of verbose trace messages for the span and use that.
Option a) results in a possibly more complex and error-prone API. Option b) is easier but it means that programmatic diagnosis of "stuck" spans is not possible.
The text was updated successfully, but these errors were encountered: