util/tracing: trim trace recordings in a smarter way #88414

andreimatei · 2022-09-21T22:28:48Z

Before this patch, when the recording of a child span was being added to
the parent, if the number of spans in the child recording + the number
of spans in the parent's recording were greater than the span limit
(1000), then the child's recording was completely dropped (apart from
the structured events, which were still retained). So, for example, if
the parent had a recording of 1 span, and the child has a recording of
1000 spans, the whole 1000 spans were dropped.
This patch improves things by always combining the parent trace and the
child trace, and then trimming the result according to the following
arbitrary algorithm:

start at the root of the trace and sort its children by size, desc
drop the fattest children (including their descendents) until the
remaining number of spans to drop becomes smaller than the size of the
fattest non-dropped child
recurse into that child, with an adjusted number of spans to drop

So, the idea is that, recursively, we drop parts of the largest child -
including dropping the whole child if needed.

Fixes #87536

Release note: None

cockroach-teamcity · 2022-09-21T22:29:01Z

This change is

andreimatei · 2022-09-21T22:29:37Z

cc @yuzefovich

yuzefovich

Thanks!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @andreimatei)

-- commits line 4 at r1:
I was thinking that we'd backport some improvements to 22.2, but this commit goes against that - what's your take?

BTW we should be getting the dataset from which #87536 was filed some time soon.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @yuzefovich)

-- commits line 4 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I was thinking that we'd backport some improvements to 22.2, but this commit goes against that - what's your take?

BTW we should be getting the dataset from which #87536 was filed some time soon.

If we backport, I'll deal with it in the backport.
But I don't think I'll backport this PR; it's a bit large. Now that we think we understand what's going on, do you think we necessarily need to backport anything at all?
I was thinking that one thing we could backport, if we have to, is an increase in the trace size limit (perhaps from 1k to 5k spans) for traces gathered through statement diagnostics.

andreimatei

friendly ping

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @yuzefovich)

aadityasondhi

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier and @yuzefovich)

This field was maintained for 22.1 compatibility. Release note: None

The referenced field went away a while ago. Release note: None

Before this patch, trace recordings were represented as []RecordedSpan. This flat slice doesn't make it easy to reconstruct the tree form. This patch switches the tracing library to represent traces as trees. The interface outside of the tracing library stays the flat format, at least for now. The point of the change is that a further patch will use the tree shape to make smarter decisions about which spans to drop from a trace when the size of the trace is too large. Release note: None

Before this patch, when the recording of a child span was being added to the parent, if the number of spans in the child recording + the number of spans in the parent's recording were greater than the span limit (1000), then the child's recording was completely dropped (apart from the structured events, which were still retained). So, for example, if the parent had a recording of 1 span, and the child has a recording of 1000 spans, the whole 1000 spans were dropped. This patch improves things by always combining the parent trace and the child trace, and then trimming the result according to the following arbitrary algorithm: - start at the root of the trace and sort its children by size, desc - drop the fattest children (including their descendents) until the remaining number of spans to drop becomes smaller than the size of the fattest non-dropped child - recurse into that child, with an adjusted number of spans to drop So, the idea is that, recursively, we drop parts of the largest child - including dropping the whole child if needed. Fixes cockroachdb#87536 Release note: None

andreimatei

TFTR

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier and @yuzefovich)

craig · 2022-09-30T17:32:58Z

Build succeeded:

Bazel Essential CI (Cockroach)

andreimatei requested review from abarganier, aadityasondhi and a team September 21, 2022 22:28

andreimatei requested review from a team as code owners September 21, 2022 22:28

andreimatei requested a review from a team September 21, 2022 22:28

andreimatei removed request for a team September 21, 2022 22:29

andreimatei mentioned this pull request Sep 21, 2022

tracing: possible regression in the amount of things dropped from the trace #87536

Open

yuzefovich reviewed Sep 21, 2022

View reviewed changes

andreimatei changed the title ~~util/tracing: remove old tags field from RecordedSpan~~ util/tracing: trim trace recordings in a smarter way Sep 22, 2022

andreimatei commented Sep 26, 2022

View reviewed changes

andreimatei commented Sep 28, 2022

View reviewed changes

aadityasondhi approved these changes Sep 28, 2022

View reviewed changes

andreimatei added 4 commits September 30, 2022 11:23

util/tracing: remove old tags field from RecordedSpan

3f74af2

This field was maintained for 22.1 compatibility. Release note: None

tracingpb: remove stale reference in comment

9b0a6ab

The referenced field went away a while ago. Release note: None

andreimatei force-pushed the tracing.improve-limit branch from b809388 to b03a24d Compare September 30, 2022 15:23

andreimatei commented Sep 30, 2022

View reviewed changes

craig bot merged commit aaca5ce into cockroachdb:master Sep 30, 2022

yuzefovich mentioned this pull request Nov 7, 2022

release-22.2: sql: reduce the overhead of EXPLAIN ANALYZE #91208

Closed

pav-kv mentioned this pull request Apr 6, 2023

tracing: sampling in event logs for better coverage #100790

Open

yuzefovich mentioned this pull request Apr 24, 2023

tracing: important bits are dropped from the recorded span with many descendant spans #102102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

util/tracing: trim trace recordings in a smarter way #88414

util/tracing: trim trace recordings in a smarter way #88414

andreimatei commented Sep 21, 2022 •

edited

Loading

cockroach-teamcity commented Sep 21, 2022

andreimatei commented Sep 21, 2022

yuzefovich left a comment

andreimatei left a comment

andreimatei left a comment

aadityasondhi left a comment

andreimatei left a comment

craig bot commented Sep 30, 2022

util/tracing: trim trace recordings in a smarter way #88414

util/tracing: trim trace recordings in a smarter way #88414

Conversation

andreimatei commented Sep 21, 2022 • edited Loading

cockroach-teamcity commented Sep 21, 2022

andreimatei commented Sep 21, 2022

yuzefovich left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

craig bot commented Sep 30, 2022

andreimatei commented Sep 21, 2022 •

edited

Loading