Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing,tracingservice: adds a trace service to pull clusterwide trace spans #65559

Merged
merged 1 commit into from
Jun 8, 2021

Conversation

adityamaru
Copy link
Contributor

Previously, every node in the cluster had a local inflight span
registry that was aware of all the spans that were rooted on that
particular node. Child spans of a given traceID executing on a remote
node would only become visible to the local registry once execution
completes, and the span pushes its recordings over gRPC to the
"client" node.

This change introduces a tracingservice package.
Package tracingservice contains a gRPC service to be used for
remote inflight span access.

It is used for pulling inflight spans from all CockroachDB nodes.
Each node will run a trace service, which serves the inflight spans from the
local span registry on that node. Each node will also have a trace client
dialer, which uses the nodedialer to connect to another node's trace service,
and access its inflight spans. The trace client dialer is backed by a remote
trace client or a local trace client, which serves as the point of entry to this
service. Both clients support the TraceClient interface, which includes the
following functionalities:

  • GetSpanRecordings

The spans for a traceID are sorted by StartTime before they are
returned. The per-node trace dialer has yet to be hooked up to an
appropriate location depending on where we intend to use it.

Resolves: #60999
Informs: #64992

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru adityamaru force-pushed the tracing-prototype branch from a7bd2d0 to d67bb84 Compare May 26, 2021 09:55
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, very straightforward and easy to reason about. Just a few nits but mostly :lgtm:!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @irfansharif)


pkg/util/tracingservice/client.go, line 59 at r1 (raw file):

Quoted 9 lines of code…
	err := c.tracer.VisitSpans(func(span *tracing.Span) error {
		if span.TraceID() != traceID {
			return nil
		}
		for _, rec := range span.GetRecording() {
			resp.SpanRecordings = append(resp.SpanRecordings, rec)
		}
		return nil
	})

nit: any chance we can union this anon. function with what's used in tracingservice/service.go's GetSpanRecordings without too much hassle?


pkg/util/tracingservice/service.go, line 43 at r1 (raw file):

blob service

nit: trace service

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the massive review lag! I've left a few comments, but this is mostly what I expected.

What motivated the separated package structure? Circular dependencies? I'd personally prefer just putting everything into pkg/util/tracing. tracing.Service or tracing/service feel more concise than tracingservice.Service. None of this precludes us from pulling it out into its own thing later down the line.

It would also be worth considering limiting the amount of in-use memory for this tracing service. I'm not sure what would look like, but the fanout here + RPC-ing over blobs of data could mean a lot of trace data held in memory for all the GetSpanRecordingsFromCluster calls happening concurrently. Is it something we can attach a budget for?

pkg/server/server_sql.go Outdated Show resolved Hide resolved
pkg/util/tracingservice/service_test.go Outdated Show resolved Hide resolved
pkg/util/tracingservice/service_test.go Outdated Show resolved Hide resolved
pkg/util/tracingservice/service_test.go Outdated Show resolved Hide resolved
pkg/util/tracingservice/client_test.go Outdated Show resolved Hide resolved
}

func TestTraceClientGetSpanRecordings(t *testing.T) {
localNodeID := roachpb.NodeID(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using testcluster instead? Instead of using "raw tracers"? I think that'd be the better thing to do for these fake-distributed tests, cause you're interesting in making sure that the "wiring" is all correct (i.e. we find the right set of nodes, etc.), and that the right service endpoints are registered with grpc/etc. Right now it feels like it wants to be an integration test, but is setting up all the integration by hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good point, changed to using a TC with 2 nodes. Unfortunately, we can't have tracingservice depend on testcluster or serverutils because we get into a:

tracingservice -> testcluster -> server -> tracingservice

So I had to move the tests into their own subpackage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's customary in these cases is using package tracingservice_test for the test files, and keeping them in the same package. Is that possible here?

return res, err
}
var mu syncutil.Mutex
res = append(res, localSpanRecordings.SpanRecordings...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a pattern we use elsewhere? Sidestepping an RPC to talk to the local service directly? Given it's just observability, I think it's fine to not distinguish the local node and issue RPCs everywhere. I think gRPC is smart enough to short circuit the local request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we use this in the blob service for nodelocal, but I agree probably not worth the additional complexity for observability.

pkg/util/tracingservice/client.go Outdated Show resolved Hide resolved
pkg/util/tracingservice/client.go Outdated Show resolved Hide resolved
@adityamaru
Copy link
Contributor Author

adityamaru commented May 31, 2021

What motivated the separated package structure? Circular dependencies? I'd personally prefer just putting everything into pkg/util/tracing. tracing.Service or tracing/service feel more concise than tracingservice.Service. None of this precludes us from pulling it out into its own thing later down the line.

Yeah, I had tried putting all of this in tracing initially but tracing can't depend on rpc, roachpb, migrationcluster. It seems like it can't depend on anything in cockroach except for github.com/cockroachdb/cockroach/pkg/util.

edit: just saw your comment about not depending on migrationcluster, so that isn't a problem but I think the other two still are.

@adityamaru adityamaru force-pushed the tracing-prototype branch from d67bb84 to d125299 Compare May 31, 2021 21:21
@blathers-crl blathers-crl bot requested a review from irfansharif May 31, 2021 21:21
@adityamaru
Copy link
Contributor Author

adityamaru commented May 31, 2021

Nice work, very straightforward and easy to reason about. Just a few nits but mostly :lgtm:!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @irfansharif)

pkg/util/tracingservice/client.go, line 59 at r1 (raw file):

Quoted 9 lines of code…
nit: any chance we can union this anon. function with what's used in tracingservice/service.go's GetSpanRecordings without too much hassle?

the refactor seems to have gotten rid of this.

pkg/util/tracingservice/service.go, line 43 at r1 (raw file):

blob service

nit: trace service

done.

@adityamaru
Copy link
Contributor Author

@irfansharif I have not addressed the memory monitor concern yet. I'm thinking about it but want to limit the size of this PR so might do it as a follow-up if that's okay?

@adityamaru adityamaru requested a review from abarganier May 31, 2021 21:26
@adityamaru adityamaru force-pushed the tracing-prototype branch from d125299 to 7d14a8a Compare May 31, 2021 21:36
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif)


pkg/util/tracingservice/client.go, line 128 at r1 (raw file):

I think it's fine to not distinguish the local node and issue RPCs everywhere. I think gRPC is smart enough to short circuit the local request.

Worth leaving a comment on the corresponding for-loop noting this assumption?

}
}

func setupTraces(t1, t2 *tracing.Tracer) (uint64, uint64, func()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd benefit from a top-level comment here, perhaps with a diagram, that explains the trace structure being setup here. It'd then be easy to answer at a quick glance what the trace should look like from each observer (t1, t2, for each traceID/span).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, hope the diagram is along the lines of what you were thinking.


traceNode1 := tracing.NewTracer()
traceNode2 := tracing.NewTracer()
tID1, _, cleanup := setupTraces(traceNode1, traceNode2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/tID1/trace1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// Start another remote child span on "node 1".
anotherChild2RemoteChild := t1.StartSpan("root2.child.remotechild2", tracing.WithParentAndManualCollection(child2.Meta()))
return root.TraceID(), root2.TraceID(), func() {
root.Finish()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] for _, span := range ... { span.Finish() }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

_ context.Context, request *tracingservicepb.SpanRecordingRequest,
) (*tracingservicepb.SpanRecordingResponse, error) {
var resp tracingservicepb.SpanRecordingResponse
err := s.tracer.VisitSpans(func(span *tracing.Span) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a comment for TestTracingServiceGetSpanRecordings, but just looking at this code I think the corresponding test for it should be very simple. Setup a tracer and start two or more spans with the same trace ID, and one other with a different trace ID. It's important that these two spans are not derived from one another - we want two or more entries in the tracer's map corresponding to two forked spans from the same trace both ending up on the same node. In the test we'll then ensure that the recordings for spans with the request ID are present, and the span from the other trace isn't.

This RPC is a purely "local" one, it doesn't internally fan out to other nodes, so our tests should be similar. The current test for this guy is doing something funky -- it's setting up a remote tracer and manually importing some remote child span into a local one. I don't think this code deal with any of that, so we're testing something more than necessary (which already has coverage for the tests written for GetSpanRecordingsFromCluster).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, done.

@@ -171,6 +171,8 @@ go_library(
"//pkg/util/timeutil",
"//pkg/util/tracing",
"//pkg/util/tracing/tracingpb",
"//pkg/util/tracingservice",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just a slightly better package structure could be defining this entire package under pkg/util/tracing/service -- it's find for a nested package to depend on a top-level directory, we use this pattern in other places. Importing packages could import it as tracingservice or whatever. I just want to try and avoid two top-level tracing packages, when they could be just the one with multiple sub-packages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also see comment elsewhere about the tests subpackage, I don't think we need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks.

"github.com/stretchr/testify/require"
)

func TestTracingClientGetSpanRecordings(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move setupTraces to this file, use a much simpler (inlined) setup in the other file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

}

func TestTraceClientGetSpanRecordings(t *testing.T) {
localNodeID := roachpb.NodeID(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's customary in these cases is using package tracingservice_test for the test files, and keeping them in the same package. Is that possible here?

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I have not addressed the memory monitor concern yet. I'm thinking about it but want to limit the size of this PR so might do it as a follow-up if that's okay?

Yup, looks like this isn't wired up to anything (?), so fine by me.

@@ -171,6 +171,8 @@ go_library(
"//pkg/util/timeutil",
"//pkg/util/tracing",
"//pkg/util/tracing/tracingpb",
"//pkg/util/tracingservice",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also see comment elsewhere about the tests subpackage, I don't think we need it.

if err != nil {
return nil, err
}
if !live {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you care about this for observability. If a node is not live, do we want to error out or do you want to skip over it (though maybe mentioning somewhere that you did)? Even then, we'd time out if issuing RPCs to dead nodes. This filter here is basically making sure we don't even try. Perhaps that's what we want, but timing out would give us a convenient place to capture the error and propagate it somehow -- if that's what we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we'd just use node liveness to retrieve a list of non-decommissioned nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, changed to only returning non-decommissioned nodes. Also added a comment explaining that we will rely on the RPC timing out so we can surface something along the lines of "trace for node x is missing".

if !live {
return nil, errors.Newf("n%d required, but unavailable", l.NodeID)
}
ns = append(ns, node{id: l.NodeID, epoch: l.Epoch})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the epoch either. Simply returning a list of roachpb.NodeIDs seem sufficient for our purposes -- all we need is a list of nodes to reach out to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// NewTraceClientDialer returns a TraceClientDialer.
func NewTraceClientDialer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] tracingservice.NewTraceClientDialer doesn't feel like a good/descriptive name. I'm not sure what's better. Something collector maybe? tracing/service.New() could serve local traces through an RPC, tracing/collector.New() could reach out to all the various tracing/services?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the struct to TraceCollector, and the methods to New() now that they are in their own packages.

var _ tracingservicepb.TracingServer = &Service{}

// NewTracingService instantiates a tracing service server.
func NewTracingService(tracer *tracing.Tracer) *Service {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually an anti-pattern to have repetition between package and symbols: https://blog.golang.org/package-names. tracing/service.New seems fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@adityamaru adityamaru force-pushed the tracing-prototype branch 4 times, most recently from 40636f1 to 9aa7cbf Compare June 5, 2021 22:54
Previously, every node in the cluster had a local inflight span
registry that was aware of all the spans that were rooted on that
particular node. Child spans of a given traceID executing on a remote
node would only become visible to the local registry once execution
completes, and the span pushes its recordings over gRPC to the
"client" node.

This change introduces a `collector` and `service` package
that together contain a gRPC service to be used for
remote inflight span access.

It is used for pulling inflight spans from all CockroachDB nodes.
Each node will run a trace service, which serves the inflight spans from the
local span registry on that node. Each node will also have a trace collector
which uses the nodedialer to connect to another node's trace service,
and access its inflight spans.

The spans for a traceID are sorted by `StartTime` before they are
returned. The per node trace dialer has yet to be hooked up to an
appropriate location depending on where we intend to use it.

Release note: None
@adityamaru adityamaru force-pushed the tracing-prototype branch from 9aa7cbf to 4b4c209 Compare June 5, 2021 23:59
@adityamaru
Copy link
Contributor Author

TFTRs!

bors r=irfansharif,abarganier

@craig
Copy link
Contributor

craig bot commented Jun 8, 2021

Build succeeded:

@craig craig bot merged commit d1c46b4 into cockroachdb:master Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: allow inflight traces to be collected across the cluster
4 participants