Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dnm] tracing: sketching out tracing api #55859

Closed

Conversation

irfansharif
Copy link
Contributor

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the 201022.tracing-api-design branch from ad7bdbc to 3cb67f0 Compare October 22, 2020 15:04
@tbg
Copy link
Member

tbg commented Oct 23, 2020

Wanted to share an insight I had that mulling over our tracing discussion last night. As I had mentioned, the recordings API that we currently have is not good. Status quo:

  • spans give you recordings, but there are spans that give you the local child span recordings as well, and others that explicitly avoid doing that (different use cases in SQL code)
  • you can shove recordings into existing spans - this is actually how we propagate the span trace as a whole to one place: you get a recording from an RPC boundary, shove it into your local span, so it becomes part of the local root span, etc, until everything flows into one span at the root of the trace tree.

Instead I think with Raphael's idea of centering the storage around the registry, it becomes very appealing to simplify as follows:

  • span is associated with a recording for only that span, but the recordings on a node are all stored in a registry.
  • each span for a trace is associated with a trace id (i.e. same trace id for all spans in trace)
  • we can ask registry for recordings for a traceid (getting in effect recordings of all spans for that trace tracked in the registry)
  • when receiving a span at the client end of RPC boundary (i.e. with the RPC response), we put it into the registry (note that the recording gives us the spanID)

I think this is pretty much as idiomatic as can be and also is much closer to how a real trace collector would work.

One consideration here is about the recording data lifecycle: we want the data do be dropped from the local registry the moment the span (local root span) is done, which means recordings can't be accessed after .Finish(). This isn't true today but it seems like a good design as well as Finish suggests that the span is no longer to be accessed.

Pseudocode:

var tr *Tracer // Tracer = Registry in this example at least

func rpcServerMiddleware(req, traceID, parentSpanID) (Response, error) {
	// NB: sp is a local root span since ParentSpanID won't be known locally
	sp := tr.StartSpan(traceID, Options{Parent: ParentSpanID})
	resp, err := do(req) // assume err=nil because pseudocode
	// Get recording for all spans derived from sp. Note that by "induction" this also
        // has all of the additional remote spans that descended from this local root span,
        // see the client middleware below.
	recs := tr.GetTransitiveRecording(traceID, sp.SpanID)
	sp.Finish()
	resp.Recordings = recs
	return resp, err
}

func rpcClientMiddleware(req, traceID, spanID) (Response, error) {
	req.TraceID, req.ParentSpanID = traceID, spanID
	resp, err := sendRPC(req)
	tr.AddRecordings(traceID, resp.Recordings)
}


func explainTrace(stmt string) {
	sp := tr.StartSpan()
	exec(sp, stmt)
	recs := tr.GetRecordings(sp.TraceID())
	renderRows(recs)
        sp.Finish() // this drops the traceID from `tr` completely
}

There are some caveats here when child spans outlive their parents, but they boil down to just accepting that such spans will only be collected partially - anything that happens after the local root span finishes is never exposed to users. That seems fine. Opentracing has a "follows from" reference type to make this idiomatic, which we may want to implement as well to be able to pass it to external OpenTracing instances, but it won't make much of a difference in our internal tracing system.

@tbg
Copy link
Member

tbg commented Oct 23, 2020

I should add that the functionality in the proposed API is the same as current API. Things just seem to have moved into their proper places and it makes intuitive sense now, to me, at least.

@irfansharif irfansharif deleted the 201022.tracing-api-design branch January 4, 2021 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants