Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: possible regression in the amount of things dropped from the trace #87536

Open
yuzefovich opened this issue Sep 7, 2022 · 5 comments · Fixed by #88414
Open

tracing: possible regression in the amount of things dropped from the trace #87536

yuzefovich opened this issue Sep 7, 2022 · 5 comments · Fixed by #88414
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Sep 7, 2022

I have statement bundles for the same statement that ran on 22.1.6 and 22.2.0-alpha.2. The statement reads a lot of data, so it is expected that some things are dropped due to constant limits defined in tracing/tracer.go. However, the bundle from 22.2.0 has too much stuff that is dropped. It appears as if we started dropping more stuff on 22.2 (I did check that things are propagated correctly when nothing is dropped on a smaller dataset). This can significantly complicate debugging some of the queries, so I'm tagging this as a GA-blocker until this issue is better understood.
22.1.6.zip
22.2.0.zip

Jira issue: CRDB-19403

Epic CRDB-20796

@yuzefovich yuzefovich added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 labels Sep 7, 2022
@dhartunian dhartunian assigned andreimatei and unassigned dhartunian Sep 19, 2022
andreimatei added a commit to andreimatei/cockroach that referenced this issue Sep 21, 2022
Before this patch, when the recording of a child span was being added to
the parent, if the number of spans in the child recording + the number
of spans in the parent's recording were greater than the span limit
(1000), then the child's recording was completely dropped (apart from
the structured events, which were still retained). So, for example, if
the parent had a recording of 1 span, and the child has a recording of
1000 spans, the whole 1000 spans were dropped.
This patch improves things by always combining the parent trace and the
child trace, and then trimming the result according to the following
arbitrary algorithm:
- start at the root of the trace and sort its children by size, desc
- drop the fattest children (including their descendents) until the
  remaining number of spans to drop becomes smaller than the size of the
  fattest non-dropped child
- recurse into that child, with an adjusted number of spans to drop

So, the idea is that, recursively, we drop parts of the largest child -
including dropping the whole child if needed.

Fixes cockroachdb#87536

Release note: None
@andreimatei
Copy link
Contributor

We think what's going on here is a combination of the shape of the trace changing in 22.1 with the introduction of the Streamer, which acts as a parent for a large number of spans, and a change that made trace recordings imported through DistSQL be subject to the 1000 span limit.
In #88414 I'm changing the way the span limit is implemented, which should result in a larger trace being preserved even when some spans need to dropped.

@andreimatei
Copy link
Contributor

I'm removing GA-blocker; I don't think this change in behavior raises to that level.

@andreimatei andreimatei removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker labels Sep 26, 2022
@yuzefovich
Copy link
Member Author

Another data point where we drop more stuff: consider the bundles attached in #88891. On 22.1.7 the distsql diagram has execution stats for all processors
Screen Shot 2022-09-28 at 11 39 29 AM
but on 22.2.0-beta some processors are missing stats
Screen Shot 2022-09-28 at 11 39 52 AM
I think the query in question doesn't use the lookup and index joins, so the streamer cannot explain this. We did add more tracing to the apply join present in the query which could be partially to blame. Still, I think we should make some improvements to 22.2.x even if this issue is not a GA-blocker.

@andreimatei
Copy link
Contributor

Hmmm I think those execution stats come from structured events, which are not supposed to be dropped by the span limits. This is a query we can reproduce, since it operates on pg_catalog data. Let me see if I can figure something out.

@craig craig bot closed this as completed in b03a24d Sep 30, 2022
@andreimatei andreimatei reopened this Sep 30, 2022
@yuzefovich yuzefovich reopened this Sep 30, 2022
craig bot pushed a commit that referenced this issue Oct 19, 2022
89785: util/tracing: bug fixes and improvements r=andreimatei a=andreimatei

Please see individual commits.

Touches #87536
Epic: None

Co-authored-by: Andrei Matei <[email protected]>
@yuzefovich
Copy link
Member Author

Ran into a possibly related problem (probably not a regression though).

When running EXPLAIN ANALYZE (DISTSQL) of TPCH Q20 some of the execution stats are dropped. This is on current master which includes #89785. Note that with #91117 merged we now use "structured" recording type (rather than verbose) for this (when not collecting bundles). Diagram.

@exalate-issue-sync exalate-issue-sync bot added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker labels Dec 21, 2022
@dhartunian dhartunian removed GA-blocker branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 labels Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants