Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: number of spans growing unboundedly with blocked txns #59560

Closed
irfansharif opened this issue Jan 28, 2021 · 3 comments
Closed

tracing: number of spans growing unboundedly with blocked txns #59560

irfansharif opened this issue Jan 28, 2021 · 3 comments
Labels
A-tracing Relating to tracing in CockroachDB. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@irfansharif
Copy link
Contributor

Describe the problem

I was playing around with #59492. I tried setting up two txns, one that would block on another, and seeing what the traces show (to possibly grab a handle on where the blockage is occurring). Instead I observed that the set of spans captured in our internal registry kept growing unboundedly. The spans were cleared up once the txns were unblocked, but we should figure out why that's happening (this may be the same thing as #59391).

To Reproduce

Terminal 1:

CREATE TABLE kv (k INT PRIMARY KEY, v INT);
INSERT INTO kv (k, v) VALUES (1, 5), (2, 10), (3, 15);
BEGIN;
SELECT * FROM kv WHERE k = 1 FOR UPDATE;

Terminal 2:

BEGIN;
SELECT * FROM kv WHERE k = 1 FOR UPDATE;

Terminal 3:

-- this will keep growing
SELECT COUNT(*) FROM crdb_internal.node_inflight_trace_spans 
WHERE 
trace_id != (
  SELECT trace_id 
  FROM crdb_internal.node_inflight_trace_spans 
  WHERE operation = 'liveness heartbeat loop' -- we don't care about these internal spans
);

Everything is appropriately cleaned up after closing out the T1 txn.

UPDATE kv SET v = v + 5 WHERE k = 1;
COMMIT;
@irfansharif irfansharif added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-tracing Relating to tracing in CockroachDB. labels Jan 28, 2021
@irfansharif
Copy link
Contributor Author

Again, this is probably not be a "bug" - it's almost certainly existing behaviour, just something I'm not sure how to explain. It could be due to a pattern like #59203 (comment) where we're holding onto an open txn, and thus the associated un-Finish()ed span, so I guess we're also iterating over all short lived children spans. Hm. The fact that we can't distinguish between "finished" spans and in-flight ones could be a bit annoying (we're surfacing start_time and duration, where duration is either how long it took till a particular span was Finish()-ed, or for un-Finish()ed spans, it'd be "current time" - start_time).

@tbg
Copy link
Member

tbg commented Feb 22, 2021

One clear reason why the number of spans is going up is because once our txn starts heartbeating, the heartbeats get traced:

[email protected]:26257/defaultdb  OPEN> select count(operation), operation from crdb_internal.node_inflight_trace_spans group by operation;
  count |                 operation
--------+--------------------------------------------
      1 | materializer
      1 | [async] kv.TxnCoordSender: heartbeat loop
    205 | dist sender send
    205 | /cockroach.roachpb.Internal/Batch
(4 rows)

Time: 4ms total (execution 3ms / network 0ms)

@tbg
Copy link
Member

tbg commented Mar 4, 2021

For Irfan's repro above:

root@:26257/defaultdb> select count(operation) as n, operation from crdb_internal.node_inflight_trace_spans group by operation order by n desc;
   n   |                 operation
-------+--------------------------------------------
  1402 | dist sender send
  1402 | /cockroach.roachpb.Internal/Batch
     8 | txn coordinator send
     3 | flow
     3 | exec portal
     3 | table reader
     2 | sql txn
     2 | prepare stmt
     2 | [async] kv.TxnCoordSender: heartbeat loop
     2 | bind stmt
     2 | materializer
     1 | get-jobs
     1 | AUTO CREATE STATS-638283665577508865
     1 | sortAll
     1 | join reader
     1 | job
     1 | [async] job-638283665577508865
     1 | sampler
     1 | [async] monitoring pusher txn
     1 | sample aggregator
     1 | load-job-query
     1 | sync
(22 rows)

This is consistent with two heartbeat loops sending a batch once per second. I see the count increase at approximately 2/sec.
I think there's not much else to do here other than update #22615, which I just did.

@tbg tbg closed this as completed Mar 4, 2021
irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 9, 2021
In cockroachdb#59560 we've seen that this can grow unboundedly. Set a limit out of
an abundance of caution. Fixes cockroachdb#59188.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tracing Relating to tracing in CockroachDB. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

2 participants