Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: spans that are never explicitly Finish()-ed #58721

Closed
irfansharif opened this issue Jan 11, 2021 · 5 comments
Closed

*: spans that are never explicitly Finish()-ed #58721

irfansharif opened this issue Jan 11, 2021 · 5 comments
Assignees
Labels
A-kv-observability A-tracing Relating to tracing in CockroachDB.

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Jan 11, 2021

Consider the following two examples (there may be others, will need to audit more thoroughly):

var stmtThresholdSpan *tracing.Span
alreadyRecording := ex.transitionCtx.sessionTracing.Enabled()
stmtTraceThreshold := traceStmtThreshold.Get(&ex.planner.execCfg.Settings.SV)
if !alreadyRecording && stmtTraceThreshold > 0 {
ctx, stmtThresholdSpan = createRootOrChildSpan(ctx, "trace-stmt-threshold", ex.transitionCtx.tracer)
stmtThresholdSpan.SetVerbose(true)
}

ctx, span := tracing.ChildSpanRemote(ctx, "")

The created spans are never explicitly Finish()-ed. This may pose to be a problem with always-on tracing (#55592), where un-Finish()-ed spans take up room in an in-memory registry of active spans (#58490). This is a tracking issue to refer to as we fix these.

@irfansharif irfansharif added A-tracing Relating to tracing in CockroachDB. A-kv-observability labels Jan 11, 2021
@blathers-crl
Copy link

blathers-crl bot commented Jan 11, 2021

Hi @irfansharif, please add a C-ategory label to your issue. Check out the label system docs.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@tbg
Copy link
Member

tbg commented Jan 11, 2021

I'm not sure they aren't explicitly finished, but it's possible. In some of the SQL code it isn't obvious when the spans are closed as they are tied to the connection. That said - I'd be surprised if there weren't any leaks. We should be able to find them easily after @angelapwen's starter project #55733.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 13, 2021
This is follow-up work from cockroachdb#58712, where we measured the overhead for
always-on tracing and found it to be minimal/acceptable. Lets switch
this on by default to shake the implications of doing so. We can
reasonably expect two kinds of fallout:

1. Unexpected blow up in memory usage due to resource leakage (which is
a can be problem now that we're always maintaining open spans in an
internal registry, see cockroachdb#58721)

2. Performance degradataion due to tracing overhead per-request
(something cockroachdb#58712) was spot checking for.

For 1 we'll introduce a future test in a separate PR. For 2, we'll
monitor roachperf over the next few weeks.

---

Also moved some of the documentation for the cluster setting into a
comment form above. Looking at what's rendered in our other cluster
settings (`SHOW ALL CLUSTER SETTINGS`), we default to a very pity,
unwrapped description.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 13, 2021
With always-on tracing, we're maintaining in-memory registry of active
spans (cockroachdb#58490). Spans are added and removed from this registry when
they're Start()-ed and Finish()-ed. Spans that not explicitly finished
(typically using `defer sp.Finish()`) are now a resource-leak, as they
take up room in the registry. We'll want to find instances of this leak
as soon as they crop up.

To that end we add a check in our TestCluster.Stop codepaths that
asserts against the registry being empty. This should give us wide
coverage given it's usage throughout. We expect this change to capture
the cases described in cockroachdb#58721.

---

This check is currently failing, given we're actually leaking spans in
`txnState.resetForNewSQLTxn`. Fixing that doesn't look simple, it ties
into questions around draining SQL and rolling back open txns.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 18, 2021
This is follow-up work from cockroachdb#58712, where we measured the overhead for
always-on tracing and found it to be minimal/acceptable. Lets switch
this on by default to shake the implications of doing so. We can
reasonably expect two kinds of fallout:

(1) Unexpected blow up in memory usage due to resource leakage (which is
a can be problem now that we're always maintaining open spans in an
internal registry, see cockroachdb#58721)

(2) Performance degradataion due to tracing overhead per-request
(something cockroachdb#58712) was spot checking for.

For (1) we'll introduce a future test in a separate PR. For (2), we'll
monitor roachperf over the next few weeks.

---

Also moved some of the documentation for the cluster setting into a
comment form above. Looking at what's rendered in our other cluster
settings (`SHOW ALL CLUSTER SETTINGS`), we default to a very pithy,
unwrapped description.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 19, 2021
With always-on tracing, we're maintaining in-memory registry of active
spans (cockroachdb#58490). Spans are added and removed from this registry when
they're `Start()`-ed and `Finish()`-ed. Spans that not explicitly finished
(typically using `defer sp.Finish()`) are now a resource-leak, as they
take up room in the registry. We'll want to find instances of this leak
as soon as they crop up[*].

To that end we add a check in our `TestCluster.Stop` codepaths that
asserts against the registry being empty. This should give us wide
coverage given it's usage throughout. We expect this change to capture
the cases described in cockroachdb#58721.

---

[*] For known issues like `txnState.resetForNewSQLTxn`, we'll
temporarily introduce a `WithBypassRegistry` option. Fixing that doesn't
look simple, it ties into questions around draining SQL and rolling back
open txns.

Release note: None
craig bot pushed a commit that referenced this issue Jan 19, 2021
58897: tracing: enable always-on tracing by default r=irfansharif a=irfansharif

This is follow-up work from #58712, where we measured the overhead for
always-on tracing and found it to be minimal/acceptable. Lets switch
this on by default to shake the implications of doing so. We can
reasonably expect two kinds of fallout:

1. Unexpected blow up in memory usage due to resource leakage (which is
a can be problem now that we're always maintaining open spans in an
internal registry, see #58721)

2. Performance degradataion due to tracing overhead per-request
(something #58712) was spot checking for.

For 1 we'll introduce a future test in a separate PR. For 2, we'll
monitor roachperf over the next few weeks.

---

Also moved some of the documentation for the cluster setting into a
comment form above. Looking at what's rendered in our other cluster
settings (`SHOW ALL CLUSTER SETTINGS`), we default to a very pity,
unwrapped description.

Release note: None

58974: opt: suppress logs in benchmarks r=rytaft a=mgartner

As of #57134 passing `-logtostderr=false` as a `TESTFLAG` in benchmarks
errs: `flag provided but not defined: -logtostderr`. The preferred
method for suppressing logs in tests and benchmarks to is add
`defer log.Scope(t).Close(t)` to the top of the test/benchmark
(see #57979).

This commit uses this new method to suppress logs in optimizer
benchmarks.

Release note: None

59009: kv/batcheval: only expose immutable range state to commands r=nvanbenschoten a=nvanbenschoten

The DeclareKeysFunc has always included a full RangeDescriptor, but it
has never been clear which fields in this descriptor are safe to use and
which are not when declaring keys for a request. The concern is that any
property of the RangeDescriptor that is not immutable may change between
the time that a request declares its keys to latch and the time that it
evaluates, so any assumptions based on these mutable fields may not
hold.

The quintessential example of a property of a Range that is not
immutable is its end key. It would be incorrect to declare keys between
a Range's start key and its current end key as a means of latching the
entire range, because a merge of a right-hand neighbor could complete in
between the time that a request declares its keys and the time that it
evaluates. This could lead to a violation of the mutual exclusion that
the command was expecting to have.

This commit makes these kinds of mistakes impossible to make by putting
the RangeDescriptor behind an interface that only exposes the properties
of a Range that cannot change across a Range's lifetime.

59099: kvserver: fix rangelog event logging for non-voter additions r=aayushshah15 a=aayushshah15

Before this patch, we were incorrectly logging non-voter additions as
removals.

Release note: None

59142: sql: fix indentation in information_schema.columns schema r=otan a=arulajmani

Reviewable lied to me.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: arulajmani <[email protected]>
craig bot pushed a commit that referenced this issue Jan 19, 2021
56812: kvserver: replace multiTestContext with TestCluster in client_raft_test r=lunevalex a=lunevalex

Makes progress on #8299
Fixes #40351
Fixes #57560
Fixes #57537

multiTestContext is a legacy construct that is deprecated in favor of running
tests via TestCluster. This is one PR out of many to remove the usage of
multiTestContext in the client_raft test cases. This does not remove all the
uses of mtc, just the simple ones. Leaving the more complex uses cases for a later PR.

With this switch we can also clean up some TestingKnobs and TestServer interfaces.
    - DisablePeriodicGossips flag is removed, it does not work with TestCluster
      and is no longer used
    - DontPreventUseOfOldLeaseOnStart flag is removed, it did not work consistently
      in TestCluster. This flag tries to leave the Lease on the same node after a
      restart, but CRDB makes no such guarantees in the real world and artificially
      testing it does not prove anything. The affected tests were re-worked to
      not rely on this condition and can deal with a lease holder moving on a restart.
    - GetRaftLeader is ported from multiTestContext to TestCluster

Release note: None


58265: sql: fix substring(byte[]) to treat input as raw bytes without escaping r=solongordon a=rafiss

fixes #57367 

Release note (bug fix): The substring function on byte arrays would
treat its input as unicode code points, which would cause the wrong
bytes to be returned. Now it only operates on the raw bytes.

Release note (bug fix): The substring(byte[]) functions were not able to
interpret bytes that had the `\` character since it was treating it as
the beginning of an escape sequence. This is now fixed.

58902: tracing,testutils: detect span leaks r=irfansharif a=irfansharif

With always-on tracing, we're maintaining in-memory registry of active
spans (#58490). Spans are added and removed from this registry when
they're Start()-ed and Finish()-ed. Spans that not explicitly finished
(typically using `defer sp.Finish()`) are now a resource-leak, as they
take up room in the registry. We'll want to find instances of this leak
as soon as they crop up.

To that end we add a check in our TestCluster.Stop codepaths that
asserts against the registry being empty. This should give us wide
coverage given it's usage throughout. We expect this change to capture
the cases described in #58721.

---

This check is currently failing, given we're actually leaking spans in
`txnState.resetForNewSQLTxn`. Fixing that doesn't look simple, it ties
into questions around draining SQL and rolling back open txns.

Release note: None

59049: sql: implement alter sequence/view owner r=arulajmani a=RichardJCai

Resolves #57965

Release note (sql change): Add support for ALTER VIEW/SEQUENCE OWNER TO commands.

59055: colexec: fix external aggregator fallback and bool agg functions reset r=yuzefovich a=yuzefovich

Previously, `reset` method of the ordered aggregator would always set
the flag to reset the internal batch to `true`. However, that batch is
only allocated when `Next` is called at least once with a non-zero batch
coming from the input, which is not the case when the fallback to the
disk-backed strategy occurs in the external aggregator (there, we call
`reset` before we use the operator every time). This would lead to
a nil pointer exception, and it is now fixed.

Our unit tests didn't catch it because we forgot to set the forced
number of repartitions which is now also fixed.

This also revealed a bug with resetting of `bool_and` and `bool_or`
aggregates - we forgot to reset whether they have seen a non-null value
or not.

Fixes: #59043.

Release note: None (no stable release with these bugs)

Co-authored-by: Alex Lunev <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: richardjcai <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@tbg
Copy link
Member

tbg commented Jan 20, 2021

#34313 is related. @irfansharif is fixing these (SQL) spans on our plate? I would have assumed SQL execution.

@asubiotto
Copy link
Contributor

Is this different from #59315?

@irfansharif
Copy link
Contributor Author

#59315 should be the last of it I think. The only other issue I'm aware of was #61279, but there's a PR out there already. I'll close it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-tracing Relating to tracing in CockroachDB.
Projects
None yet
Development

No branches or pull requests

3 participants