Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: untangle local/remove parent vs recording collection #72898

Merged

Conversation

andreimatei
Copy link
Contributor

Before this patch, two concepts were bundled together: whether or not
the parent of a span is local or remote to the child's node, and whether
or not the parent should include the child in its recording. You could
create a child span with either:

  • the WithParentAndAutoCollection option, which would give you a local
    parent which includes the child in its recording, or
  • the WithParentAndManualCollection option, which would give you a
    remote parent (which naturally would not / couldn't include the child
    in its recording).

The WithParentAndManualCollection is sometimes used even when the parent
is local, which is a) quite confusing, because at a lower level it
produced what was called a "remote parent" and b) sub-optimal, because
the child looked like a "local root", which meant it had to be put into
the active spans registry. Were it not for the bundling of the two
concerns together, such a child of a local parent, but for whom the
parent is not in charge of collecting the recording, would not need to
be put directly into the span registry; such a span (like other child
spans) should be part of the registry indirectly, through its parent -
which is cheaper because it doesn't add contention on the global mutex
lock.

The funky case of a local parent whose recording does not include the
child's is used by DistSQL processors - each processor gets a span whose
recording is collected by the DistSQL infrastructure, independent of the
parent.

This patch untangles the parent-child relationship from the
recording collection concern, and cleans up the API in the process. Now
you have the following options for creating a span:

WithParent(parent) - creates a parent/child relationship. The parent has
a reference to the child. The child is part of the active spans registry
through the parent. Unless WithDetachedRecording() is also specified,
the parent's recording includes the child's.

WithRemoteParent(parentMeta) - creates a parent/child relationship
across process boundaries.

WithDetachedRecording() - asks the parent to not include the new child
in its recording. This is a no-op in case WithRemoteParent() is used.
This option is used by DistSQL.

So, we get a cleaner API, and a more efficient active spans registry
implementation because the DistSQL spans are no longer directly part of
it. Also, we open the door for a new possibility in the future: you can
imagine that, in certain scenarios, it'd be interesting for a span to
collect the recording of all its children, including the ones whose
recording, under normal circumstances, is not the parent's
responsibility to collect. For example, when asking for the recording of
a span in the registry in an out-of-band way (i.e. when debugging), I
want all the info I can get. So, I should be able to point to a DistSQL
flow span and ask it for the recording of all processors on its node.
This is not implemented yet.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei andreimatei force-pushed the tracing.local-parent-without-collection branch 3 times, most recently from c3ef960 to 8952cc8 Compare November 18, 2021 22:23
@andreimatei andreimatei marked this pull request as ready for review November 18, 2021 22:23
@andreimatei andreimatei requested review from a team as code owners November 18, 2021 22:23
@andreimatei andreimatei requested review from a team, ajwerner and abarganier and removed request for a team and ajwerner November 18, 2021 22:23
@andreimatei
Copy link
Contributor Author

Only the last 3 commits are to be reviewed here. The rest are test fixes sent out separately.

@andreimatei andreimatei force-pushed the tracing.local-parent-without-collection branch from 8952cc8 to 648c169 Compare November 21, 2021 20:00
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with a nit and minor question. Overall I think this decoupling makes a lot of sense, and makes API much more intuitive. Well done!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier and @andreimatei)


pkg/util/tracing/crdbspan.go, line 652 at r17 (raw file):

"should have been handled above"

nit: can this be more descriptive? I imagine this message would be cryptic if we somehow hit this condition.


pkg/util/tracing/span_options.go, line 64 at r17 (raw file):

parentDoesNotCollectRecording bool

Why make this field private when the rest are public?

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for all the changes other than the changes to the tracing package in the last commit. For that you can take a review from someone else.

Reviewed 18 of 18 files at r15, 12 of 12 files at r16, 26 of 35 files at r17, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @andreimatei, and @knz)


pkg/kv/kvserver/replica_application_decoder.go, line 156 at r17 (raw file):

					opName,
					// NB: Nobody is collecting the recording of this span; we have no
					// mechanism for it.

Does this need a followup issue?

Creating a child with a different Tracer than the parent is
broken because the child span can leak: if the parent is Finish()ed
first, the child gets inserted into the parent's registry:
https://github.com/cockroachdb/cockroach/blob/d60e17a7047bb16b6aed1585bc5d274bd748d04e/pkg/util/tracing/crdbspan.go#L186

However, when the child gets Finish()ed, we attempt to remove it from
its Tracer's registry.

Mixing Tracers like this seems fundamentally a bad idea, because it is
inherently unclear what the behavior with respect to the registries
should be. This patch adds an assertion at span creation time checking
that the two Tracers involved are one and the same.

Release note: None
@andreimatei andreimatei force-pushed the tracing.local-parent-without-collection branch from 648c169 to b3bcd84 Compare November 30, 2021 15:38
Before this patch, two concepts were bundled together: whether or not
the parent of a span is local or remote to the child's node, and whether
or not the parent should include the child in its recording. You could
create a child span with either:
- the WithParentAndAutoCollection option, which would give you a local
  parent which includes the child in its recording, or
- the WithParentAndManualCollection option, which would give you a
  remote parent (which naturally would not / couldn't include the child
  in its recording).

WithParentAndManualCollection is sometimes used even when the parent
is local, which is a) quite confusing, because at a lower level it
produced what was called a "remote parent" and b) sub-optimal, because
the child looked like a "local root", which meant it had to be put into
the active spans registry. Were it not for the bundling of the two
concerns together, such a child of a local parent, but for whom the
parent is not in charge of collecting the recording, would not need to
be put directly into the span registry; such a span (like other child
spans) should be part of the registry indirectly, through its parent -
which is cheaper because it doesn't add contention on the global mutex
lock.

The funky case of a local parent whose recording does not include the
child's is used by DistSQL processors - each processor gets a span whose
recording is collected by the DistSQL infrastructure, independent of the
parent.

This patch untangles the parent-child relationship from the
recording collection concern, and cleans up the API in the process. Now
you have the following options for creating a span:

WithParent(parent) - creates a parent/child relationship. The parent has
a reference to the child. The child is part of the active spans registry
through the parent. Unless WithDetachedRecording() is also specified,
the parent's recording includes the child's.

WithRemoteParent(parentMeta) - creates a parent/child relationship
across process boundaries.

WithDetachedRecording() - asks the parent to not include the new child
in its recording. This is a no-op in case WithRemoteParent() is used.
This option is used by DistSQL.

So, we get a cleaner API, and a more efficient active spans registry
implementation because the DistSQL spans are no longer directly part of
it. We also get a new feature: given a span id, the trace registry is
now able to collect the recording of all the children of that span,
including the ones created with WithDetachedRecording. This is nice,
since it'll make inspection of DistSQL traces easier (since they'll look
more like regular traces, rather then a collection of unrelated root
spans that all need to be looked at in isolation).

WithDetachedRecording() is used in two cases (nothing really new here):
- for DistSQL processors, like discussed
- for async tasks which (may) outlive the parent. In particular, the
stopper uses it for its FollowsFromSpan task option. This use of
detached recording is debatable. It's not necessary for correctness,
since there's no particular harm in having the async task be a regular
child. It arguably produces better looking recordings, since they don't
include info on the async task. It can also serve as an optimization
(although currently it doesn't) - if we know that the async span is
likely to outlive the parent, then we might be better off inserting it
directly into the registry, as opposed to first inserting it into the
parent, and only moving it to the registry when the parent Finish()es.

Release note: None
@andreimatei andreimatei force-pushed the tracing.local-parent-without-collection branch from b3bcd84 to 37175f7 Compare November 30, 2021 15:54
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier and @knz)


pkg/kv/kvserver/replica_application_decoder.go, line 156 at r17 (raw file):

Previously, knz (kena) wrote…

Does this need a followup issue?

I don't think so... It's hard to imagine who'd collect recordings for async application and what they'd do with them...


pkg/util/tracing/crdbspan.go, line 652 at r17 (raw file):

Previously, abarganier (Alex Barganier) wrote…
"should have been handled above"

nit: can this be more descriptive? I imagine this message would be cryptic if we somehow hit this condition.

well I'm not sure what to say other than "shouldn't get here". If anyone runs into this, they'll have to read the source for context. More realistically, they'll need to read the source from the time when the assertion was introduced. I think it's a fairly common pattern to have complete switches like this and explicitly mark the impossible branches.


pkg/util/tracing/span_options.go, line 64 at r17 (raw file):

Previously, abarganier (Alex Barganier) wrote…
parentDoesNotCollectRecording bool

Why make this field private when the rest are public?

Done.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier and @knz)

@craig
Copy link
Contributor

craig bot commented Nov 30, 2021

Build failed:

@andreimatei
Copy link
Contributor Author

TestComposeGSS flake

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 30, 2021

Build succeeded:

@craig craig bot merged commit 2e038a1 into cockroachdb:master Nov 30, 2021
@andreimatei andreimatei deleted the tracing.local-parent-without-collection branch January 22, 2022 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants