Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: add registry of unfinished local root Spans #58490

Merged
merged 1 commit into from
Jan 7, 2021

Conversation

tbg
Copy link
Member

@tbg tbg commented Jan 6, 2021

A local root span is a Span which was created without the
WithParentAndAutoCollection option, meaning that either
it is the root of the trace, or it is a child whose parent
will not include the child in its recording.

A new method Tracer.VisitSpans is added which allows
iterating through these Spans.

Touches #55592.

Release note: None

@tbg tbg added A-kv-observability A-tracing Relating to tracing in CockroachDB. labels Jan 6, 2021
@tbg tbg requested review from knz and irfansharif January 6, 2021 09:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the tracing-registry branch from 584b1c8 to 634d0fa Compare January 6, 2021 09:44
@tbg tbg marked this pull request as ready for review January 6, 2021 09:44
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection here, but I'd like to pick up the argument I've served you before, that we want completed spans to also visible through a global registry (i.e. the ability to view trace spans for fast queries, when they may have completed already at the instant when we look into the registry).

My original idea for this was to keep all the trace entries in a cyclic buffer (not the spans), so that memory consumption remains bounded regardless of trace complexity. What do you say to that?

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to that idea in general, but I don't think its time is now - we are not too far off the stability period and want to make sure that we deliver always-on tracing and observability into hanging requests. Until these basic commitments are checked off, we need to be mindful of feature creep. And looking at the remaining plans for the release (in particular CC-serverless), I do not think we will have additional bandwidth here, even in the best case.

Note that anyone wanting to track all requests can point CRDB at Jaeger. This was a bad idea so far due to the giant overhead of verbose tracing, but now you can, in principle (with small code changes) send non-verbose traces to Jaeger. (And similary it is straightforward to offer an OpenTracing backend that simply writes all traces to files, for later import into Jaeger). With this option around the corner, there bar for considering doing something fancier here is very high.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/util/tracing/tracer.go, line 623 at r1 (raw file):

		sl = append(sl, sp)
	}
	t.activeSpans.Unlock()

curious: Is allocating the slice under a lock really that bad? Wouldn't we end up reallocating again anyway if there was a new active span added while allocating outside the lock?

A local root span is a Span which was created without the
`WithParentAndAutoCollection` option, meaning that either
it is the root of the trace, or it is a child whose parent
will not include the child in its recording.

A new method `Tracer.VisitSpans` is added which allows
iterating through these Spans.

Touches cockroachdb#55592.

Release note: None
@tbg tbg force-pushed the tracing-registry branch from 634d0fa to de915fe Compare January 7, 2021 08:37
@tbg tbg requested review from irfansharif and knz January 7, 2021 08:38
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r=irfansharif

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)


pkg/util/tracing/tracer.go, line 623 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

curious: Is allocating the slice under a lock really that bad? Wouldn't we end up reallocating again anyway if there was a new active span added while allocating outside the lock?

No, you're right, this is premature optimization, and not of the useful kind. It could help - but could not. I guess in a steady state you're about as likely to see more in the map after than before than not, so it helps in 50% of cases roughly. Either way, since it's not clear that we need to do this anyway, I removed it.

@craig
Copy link
Contributor

craig bot commented Jan 7, 2021

Build succeeded:

@craig craig bot merged commit 4e74b55 into cockroachdb:master Jan 7, 2021
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jan 13, 2021
With always-on tracing, we're maintaining in-memory registry of active
spans (cockroachdb#58490). Spans are added and removed from this registry when
they're Start()-ed and Finish()-ed. Spans that not explicitly finished
(typically using `defer sp.Finish()`) are now a resource-leak, as they
take up room in the registry. We'll want to find instances of this leak
as soon as they crop up.

To that end we add a check in our TestCluster.Stop codepaths that
asserts against the registry being empty. This should give us wide
coverage given it's usage throughout. We expect this change to capture
the cases described in cockroachdb#58721.

---

This check is currently failing, given we're actually leaking spans in
`txnState.resetForNewSQLTxn`. Fixing that doesn't look simple, it ties
into questions around draining SQL and rolling back open txns.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jan 19, 2021
With always-on tracing, we're maintaining in-memory registry of active
spans (cockroachdb#58490). Spans are added and removed from this registry when
they're `Start()`-ed and `Finish()`-ed. Spans that not explicitly finished
(typically using `defer sp.Finish()`) are now a resource-leak, as they
take up room in the registry. We'll want to find instances of this leak
as soon as they crop up[*].

To that end we add a check in our `TestCluster.Stop` codepaths that
asserts against the registry being empty. This should give us wide
coverage given it's usage throughout. We expect this change to capture
the cases described in cockroachdb#58721.

---

[*] For known issues like `txnState.resetForNewSQLTxn`, we'll
temporarily introduce a `WithBypassRegistry` option. Fixing that doesn't
look simple, it ties into questions around draining SQL and rolling back
open txns.

Release note: None
craig bot pushed a commit that referenced this pull request Jan 19, 2021
56812: kvserver: replace multiTestContext with TestCluster in client_raft_test r=lunevalex a=lunevalex

Makes progress on #8299
Fixes #40351
Fixes #57560
Fixes #57537

multiTestContext is a legacy construct that is deprecated in favor of running
tests via TestCluster. This is one PR out of many to remove the usage of
multiTestContext in the client_raft test cases. This does not remove all the
uses of mtc, just the simple ones. Leaving the more complex uses cases for a later PR.

With this switch we can also clean up some TestingKnobs and TestServer interfaces.
    - DisablePeriodicGossips flag is removed, it does not work with TestCluster
      and is no longer used
    - DontPreventUseOfOldLeaseOnStart flag is removed, it did not work consistently
      in TestCluster. This flag tries to leave the Lease on the same node after a
      restart, but CRDB makes no such guarantees in the real world and artificially
      testing it does not prove anything. The affected tests were re-worked to
      not rely on this condition and can deal with a lease holder moving on a restart.
    - GetRaftLeader is ported from multiTestContext to TestCluster

Release note: None


58265: sql: fix substring(byte[]) to treat input as raw bytes without escaping r=solongordon a=rafiss

fixes #57367 

Release note (bug fix): The substring function on byte arrays would
treat its input as unicode code points, which would cause the wrong
bytes to be returned. Now it only operates on the raw bytes.

Release note (bug fix): The substring(byte[]) functions were not able to
interpret bytes that had the `\` character since it was treating it as
the beginning of an escape sequence. This is now fixed.

58902: tracing,testutils: detect span leaks r=irfansharif a=irfansharif

With always-on tracing, we're maintaining in-memory registry of active
spans (#58490). Spans are added and removed from this registry when
they're Start()-ed and Finish()-ed. Spans that not explicitly finished
(typically using `defer sp.Finish()`) are now a resource-leak, as they
take up room in the registry. We'll want to find instances of this leak
as soon as they crop up.

To that end we add a check in our TestCluster.Stop codepaths that
asserts against the registry being empty. This should give us wide
coverage given it's usage throughout. We expect this change to capture
the cases described in #58721.

---

This check is currently failing, given we're actually leaking spans in
`txnState.resetForNewSQLTxn`. Fixing that doesn't look simple, it ties
into questions around draining SQL and rolling back open txns.

Release note: None

59049: sql: implement alter sequence/view owner r=arulajmani a=RichardJCai

Resolves #57965

Release note (sql change): Add support for ALTER VIEW/SEQUENCE OWNER TO commands.

59055: colexec: fix external aggregator fallback and bool agg functions reset r=yuzefovich a=yuzefovich

Previously, `reset` method of the ordered aggregator would always set
the flag to reset the internal batch to `true`. However, that batch is
only allocated when `Next` is called at least once with a non-zero batch
coming from the input, which is not the case when the fallback to the
disk-backed strategy occurs in the external aggregator (there, we call
`reset` before we use the operator every time). This would lead to
a nil pointer exception, and it is now fixed.

Our unit tests didn't catch it because we forgot to set the forced
number of repartitions which is now also fixed.

This also revealed a bug with resetting of `bool_and` and `bool_or`
aggregates - we forgot to reset whether they have seen a non-null value
or not.

Fixes: #59043.

Release note: None (no stable release with these bugs)

Co-authored-by: Alex Lunev <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: richardjcai <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-tracing Relating to tracing in CockroachDB.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants