Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: apparent deadlock waiting for pushes #59377

Closed
nick-jones opened this issue Jan 25, 2021 · 10 comments
Closed

kv: apparent deadlock waiting for pushes #59377

nick-jones opened this issue Jan 25, 2021 · 10 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@nick-jones
Copy link

Describe the problem

We have a recurring issue where simple queries to a particular table start to hang, and never complete.

  • the queries that hang are very basic, e.g.SELECT COUNT(*) FROM table
  • this table has approx 4k rows in it (i.e. it's tiny)
  • restarting the crdb nodes causes the issue to go away, but after a period of time it always returns
  • we see the issue on one of table in the same database, although oddly we can select rows, but are unable to select a count

When this starts to occur, one of the crdb nodes starts to log the following:

W210125 13:06:51.296256 303430 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹bd2e7a13›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:06:51.303892 303380 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹fdac8db3›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:06:54.332016 304081 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹a3e3190b›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:07:03.170207 304497 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹85b4ebdb›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:07:03.289963 304610 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹c4c5c5ea›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:07:07.967060 304825 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹8d0c357c›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:07:08.019760 304809 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹3136c454›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:07:08.041635 304746 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹a214d58e›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:07:10.423216 305013 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹9a492228›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:07:10.426590 305018 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹d1beb3b1›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:07:10.426795 304911 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹9ca8ea7c›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:07:23.895783 305647 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹71e6606e›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:09:43.743896 310257 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹fbfb653e›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:09:43.744051 310429 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹78027c1c›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:09:43.912781 310417 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹ecf20a5c›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:10:11.988964 310483 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹94d646c1›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:10:11.990022 311402 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹11d51c53›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:10:11.991626 311408 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹3687bf7d›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:10:14.965386 305074 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹73226702›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:10:14.967038 311361 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹14c88dc4›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:10:15.116247 311524 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹5b6013f5›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:17:24.060675 326144 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹8e07e94d›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:27:24.220959 346312 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹a3fdc14b›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:36:24.119688 346312 kv/kvserver/txnwait/queue.go:524 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹a3fdc14b›: finished waiting after 599.90s for pushee ‹fcf8c62f›
W210125 13:36:24.120472 346312 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] slow range RPC: ‹have been waiting 599.90s (1 attempts) for RPC PushTxn(a3fdc14b->fcf8c62f) [/Table/65/1/"\x13$\x8f\xb3\xf9\xd5B)\xb6\xcc5\xb3E\xadf\x8e"/0,/Min) to r85:/Table/65{-/12} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=40]; resp: (err: context canceled)›
W210125 13:36:24.121533 346312 kv/kvclient/kvcoord/dist_sender.go:1507 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] slow RPC response: ‹slow RPC finished after 599.90s (1 attempts)›
W210125 13:36:24.122285 346312 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n3,client=‹10.2.127.235:43192›,hostssl,user=‹planner›,txn=‹a3fdc14b›] slow range RPC: ‹have been waiting 600.00s (1 attempts) for RPC [txn: a3fdc14b], [can-forward-ts], Scan [/Table/65/1,/Table/65/2) to r85:/Table/65{-/12} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=40]; resp: (err: context canceled)›
W210125 13:36:24.122351 346312 kv/kvclient/kvcoord/dist_sender.go:1507 ⋮ [n3,client=‹10.2.127.235:43192›,hostssl,user=‹planner›,txn=‹a3fdc14b›] slow RPC response: ‹slow RPC finished after 600.00s (1 attempts)›
W210125 13:37:24.404409 366438 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹713f29c8›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:46:24.300697 366438 kv/kvserver/txnwait/queue.go:524 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹713f29c8›: finished waiting after 599.90s for pushee ‹fcf8c62f›
W210125 13:46:24.300995 366438 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] slow range RPC: ‹have been waiting 599.90s (1 attempts) for RPC PushTxn(713f29c8->fcf8c62f) [/Table/65/1/"\x13$\x8f\xb3\xf9\xd5B)\xb6\xcc5\xb3E\xadf\x8e"/0,/Min) to r85:/Table/65{-/12} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=40]; resp: (err: context canceled)›
W210125 13:46:24.301063 366438 kv/kvclient/kvcoord/dist_sender.go:1507 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] slow RPC response: ‹slow RPC finished after 599.90s (1 attempts)›
W210125 13:46:24.301704 366438 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n3,client=‹10.2.127.235:47822›,hostssl,user=‹planner›,txn=‹713f29c8›] slow range RPC: ‹have been waiting 600.00s (1 attempts) for RPC [txn: 713f29c8], [can-forward-ts], Scan [/Table/65/1,/Table/65/2) to r85:/Table/65{-/12} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=40]; resp: (err: context canceled)›
W210125 13:46:24.301801 366438 kv/kvclient/kvcoord/dist_sender.go:1507 ⋮ [n3,client=‹10.2.127.235:47822›,hostssl,user=‹planner›,txn=‹713f29c8›] slow RPC response: ‹slow RPC finished after 600.00s (1 attempts)›
W210125 13:47:24.582639 386631 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹0458f0ff›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:51:37.858034 394925 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹17597049›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:51:38.219352 395172 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹afa2a136›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:51:41.918264 395457 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹4017788d›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:51:41.921868 395482 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹0a781a0e›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:51:41.922087 395183 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹2efcc1ad›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 13:57:24.731503 407149 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/65{-/12}›] pusher ‹04ebdbc9›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 14:05:12.698886 422924 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/6{5-6}›] pusher ‹baaa518d›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 14:05:12.699366 422966 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/6{5-6}›] pusher ‹3a304ba4›: have been waiting 60.00s for pushee ‹fcf8c62f›
W210125 14:05:12.841048 423008 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/6{5-6}›] pusher ‹91eb8cff›: have been waiting 60.00s for pushee ‹fcf8c62f›
<snip>

This continues until we start nodes in the cluster.

To Reproduce

Unfortunately we haven't pinpointed what might be triggering this. Typically it takes a few hours to upwards of a couple of days for the issue to reoccur. It's possible that some external factor is triggering the issue (i.e. it's not just randomly happening after a period of time), we're looking into what that might be.

Expected behavior

Queries should complete.

Additional data / screenshots

Stack traces:

When the above traces were captured, the pusher ‹*›: have been waiting 60.00s for pushee ‹*› log entries were occurring on cockroachdb-1.

The cluster is reporting healthy when the issue occurs, as far as we can tell: https://user-images.githubusercontent.com/350792/105723011-318e5080-5f1e-11eb-9a03-b7ea60a314c8.png

Environment:

  • CockroachDB version: 2.2.3
  • Server OS: Linux (crdb nodes are running on top of a kubernetes cluster)
  • Client app: cockroach sql, pgx golang driver
root@cockroachdb-proxy:26257/defaultdb> SELECT VERSION();
                                          version
--------------------------------------------------------------------------------------------
  CockroachDB CCL v20.2.3 (x86_64-unknown-linux-gnu, built 2020/12/14 18:33:39, go1.13.14)
(1 row)
Time: 1ms total (execution 0ms / network 1ms)

3 node cluster with the following members by hostname:

  • cockroachdb-0
  • cockroachdb-1
  • cockroachdb-2

Additional context

I described the issue on Slack, and was asked to open an issue here. There is a thread of discussion under https://cockroachdb.slack.com/archives/CP4D9LD5F/p1611572137084100

  • PRIORITY HIGH is not used for any transactions
  • We run crdb clusters with identical configuration in the same kubernetes cluster, and have yet to see this issue anywhere else
@nick-jones nick-jones added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 25, 2021
@blathers-crl
Copy link

blathers-crl bot commented Jan 25, 2021

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Jan 25, 2021
@ajwerner ajwerner changed the title Simple queries never completing kv: apparent deadlock waiting for pushes Jan 25, 2021
@nvanbenschoten
Copy link
Member

Hi @nick-jones, thanks for the report. Do you have any open transactions that have written to this table and are now just blocking in the client? What we see is that all waiting transactions are waiting on a writing transaction with the ID fcf8c62f.

@nvanbenschoten
Copy link
Member

It may help to look in the Sessions page in the admin UI to determine which open session is holding open this transaction.

@nick-jones
Copy link
Author

@nvanbenschoten thanks for the reply.

Unfortunately we've restarted the nodes in the cluster as of ~30 minutes ago, as the issue was getting in the way of some of our devs. Next time it occurrs, I'll share further details from the Sessions page.

In case it's of similar or equal value, I do still have the output of SHOW QUERIES from when the issue was occurring earlier. Note that we've had happen twice today, I captured the queries in relation to the first instance - my initial report above contains details from the second instance.

The log entires looked like this for the first instance:

<snip>
W210125 09:59:32.147055 7537052 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹7d587396›: have been waiting 60.00s for pushee ‹3240288a›
W210125 09:59:32.191225 7536340 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹68725c84›: have been waiting 60.00s for pushee ‹3240288a›
W210125 10:00:16.096845 7538028 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹ea64e273›: have been waiting 60.00s for pushee ‹3240288a›
W210125 10:06:37.697785 7532990 kv/kvserver/txnwait/queue.go:524 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹e8391ac3›: finished waiting after 599.90s for pushee ‹3240288a›
W210125 10:06:37.698058 7532990 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] slow range RPC: ‹have been waiting 599.90s (1 attempts) for RPC PushTxn(e8391ac3->3240288a) [/Table/65/1/"\x1e\bX\x1aHg@\xbe\x95\xe1\xe7\xf5RCL\xda"/0,/Min) to r85:/Table/6{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=39]; resp: (err: context canceled)›
W210125 10:06:37.698211 7532990 kv/kvclient/kvcoord/dist_sender.go:1507 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] slow RPC response: ‹slow RPC finished after 599.90s (1 attempts)›
W210125 10:06:37.698361 7532990 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n1,client=‹10.2.127.235:47084›,hostssl,user=‹planner›,txn=‹e8391ac3›] slow range RPC: ‹have been waiting 600.00s (1 attempts) for RPC [txn: e8391ac3], [can-forward-ts], Scan [/Table/65/1,/Table/65/2) to r85:/Table/6{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=39]; resp: (err: context canceled)›
W210125 10:06:37.698440 7532990 kv/kvclient/kvcoord/dist_sender.go:1507 ⋮ [n1,client=‹10.2.127.235:47084›,hostssl,user=‹planner›,txn=‹e8391ac3›] slow RPC response: ‹slow RPC finished after 600.00s (1 attempts)›
W210125 10:07:38.000158 7546543 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹f8b34bf6›: have been waiting 60.00s for pushee ‹3240288a›
W210125 10:17:38.211581 7558605 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹17e63b84›: have been waiting 60.00s for pushee ‹3240288a›
W210125 10:23:58.821505 7566143 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹95bbb301›: have been waiting 60.00s for pushee ‹3240288a›
W210125 10:27:38.362853 7570677 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹32bab719›: have been waiting 60.00s for pushee ‹3240288a›
W210125 10:28:28.584648 7571533 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹eba4a24c›: have been waiting 60.00s for pushee ‹3240288a›
W210125 10:29:55.586225 7573414 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹2c313c46›: have been waiting 60.00s for pushee ‹3240288a›
W210125 10:30:25.657344 7574038 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹cb0433f9›: have been waiting 60.00s for pushee ‹3240288a›
W210125 10:36:38.261863 7570677 kv/kvserver/txnwait/queue.go:524 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹32bab719›: finished waiting after 599.90s for pushee ‹3240288a›
W210125 10:36:38.262079 7570677 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] slow range RPC: ‹have been waiting 599.90s (1 attempts) for RPC PushTxn(32bab719->3240288a) [/Table/65/1/"\x1e\bX\x1aHg@\xbe\x95\xe1\xe7\xf5RCL\xda"/0,/Min) to r85:/Table/6{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=39]; resp: (err: context canceled)›
W210125 10:36:38.262160 7570677 kv/kvclient/kvcoord/dist_sender.go:1507 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] slow RPC response: ‹slow RPC finished after 599.90s (1 attempts)›
W210125 10:36:38.262290 7570677 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n1,client=‹10.2.127.235:41648›,hostssl,user=‹planner›,txn=‹32bab719›] slow range RPC: ‹have been waiting 600.00s (1 attempts) for RPC [txn: 32bab719], [can-forward-ts], Scan [/Table/65/1,/Table/65/2) to r85:/Table/6{5-6} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=39]; resp: (err: context canceled)›
W210125 10:36:38.262344 7570677 kv/kvclient/kvcoord/dist_sender.go:1507 ⋮ [n1,client=‹10.2.127.235:41648›,hostssl,user=‹planner›,txn=‹32bab719›] slow RPC response: ‹slow RPC finished after 600.00s (1 attempts)›
W210125 10:37:38.467382 7582526 kv/kvserver/txnwait/queue.go:517 ⋮ [n1,s1,r85/1:‹/Table/6{5-6}›] pusher ‹ccc48e66›: have been waiting 60.00s for pushee ‹3240288a›
<snip>

Queries: https://gist.github.com/nick-jones/2edbd67a95f30e7dc778eaf5dfbeec2b (note that I've onfuscated some column names, I'm happy to email the original output if requested)

@ajwerner
Copy link
Contributor

It's probably worth also grabbing the set of open transactions. You can grab that with SELECT * FROM crdb_internal.cluster_transactions.

@nick-jones
Copy link
Author

nick-jones commented Jan 25, 2021

@nvanbenschoten @ajwerner it started to happen again, so as requested:

logs from cockroachdb-1 (node_id = 3):

<snip>
W210125 16:37:23.871602 113657 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/6{5-6}›] pusher ‹b21d230c›: have been waiting 60.00s for pushee ‹700fdf6d›
W210125 16:37:36.931660 113886 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/6{5-6}›] pusher ‹c7cc71ed›: have been waiting 60.00s for pushee ‹700fdf6d›
W210125 16:42:04.603657 120079 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/6{5-6}›] pusher ‹d9134abf›: have been waiting 60.00s for pushee ‹700fdf6d›
W210125 16:44:04.710395 122816 kv/kvserver/txnwait/queue.go:517 ⋮ [n3,s3,r85/2:‹/Table/6{5-6}›] pusher ‹f0404ddb›: have been waiting 60.00s for pushee ‹700fdf6d›

open transactions:

root@cockroachdb-proxy:26257/defaultdb> SELECT * FROM crdb_internal.cluster_transactions;
                   id                  | node_id |            session_id            |              start               |                                                                                                                           txn_string                                                                                                                           | application_name | num_stmts | num_retries | num_auto_retries
---------------------------------------+---------+----------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------+-------------+-------------------
  964c704e-2b3f-4f5c-b5d6-7610cec77b27 |       2 | 165d87683e6e51540000000000000002 | 2021-01-25 16:45:25.887266+00:00 | "sql txn" meta={id=964c704e key=/Min pri=0.03180769 epo=0 ts=1611593125.887254606,0 min=1611593125.887254606,0 seq=0} lock=false stat=PENDING rts=1611593125.887254606,0 wto=false max=1611593126.387254606,0                                                  | $ cockroach sql  |         1 |           0 |                0
  700fdf6d-8cde-4fdf-bc8a-95dec4669208 |       3 | 165d86e7eccdf7f20000000000000003 | 2021-01-25 16:36:16.882415+00:00 | "sql txn" meta={id=700fdf6d key=/Table/65/1/"\xb9B\xf6c\xb3\bK\xa4\xa7R\xb2z[\xe806"/0 pri=0.02191132 epo=0 ts=1611592576.882388349,0 min=1611592576.882388349,0 seq=5} lock=true stat=PENDING rts=1611592576.882388349,0 wto=false max=1611592577.382388349,0 |                  |         5 |           0 |                0
  c7cc71ed-71e3-45b2-af3d-37933eac3bde |       1 | 165d86ed9a043af30000000000000001 | 2021-01-25 16:36:36.819851+00:00 | "sql txn" meta={id=c7cc71ed key=/Min pri=0.05333629 epo=0 ts=1611592596.819841921,0 min=1611592596.819841921,0 seq=0} lock=false stat=PENDING rts=1611592596.819841921,0 wto=false max=1611592597.319841921,0                                                  |                  |         1 |           0 |                0
  b21d230c-b297-46ab-96c1-bfc81950e06f |       1 | 165d86ea964b9ca10000000000000001 | 2021-01-25 16:36:23.775021+00:00 | "sql txn" meta={id=b21d230c key=/Min pri=0.01905798 epo=0 ts=1611592583.775012879,0 min=1611592583.775012879,0 seq=0} lock=false stat=PENDING rts=1611592583.775012879,0 wto=false max=1611592584.275012879,0                                                  |                  |         1 |           0 |                0
(4 rows)

Time: 16ms total (execution 7ms / network 9ms)

queries:

root@cockroachdb-proxy:26257/defaultdb> SHOW QUERIES;
              query_id             | node_id |            session_id            | user_name |              start               |                                                                                                                                                                   query                                                                                                                                                                   |   client_field_10   | application_name | distributed |   phase
-----------------------------------+---------+----------------------------------+-----------+----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+------------------+-------------+------------
  165d879f90c65f540000000000000003 |       3 | 165d879f02889e430000000000000003 | root      | 2021-01-25 16:49:21.06686+00:00  | SHOW CLUSTER QUERIES                                                                                                                                                                                                                                                                                                                      | 10.2.134.47:43896  | $ cockroach sql  |    false    | executing
  165d87765a7e74a90000000000000003 |       3 | 165d87765a07be510000000000000003 | planner   | 2021-01-25 16:46:24.062539+00:00 | SELECT count(*) FROM prospects                                                                                                                                                                                                                                                                                                            | 10.2.127.235:57872 |                  |    true     | executing
  165d86eda018deb30000000000000001 |       1 | 165d86ed9a043af30000000000000001 | planner   | 2021-01-25 16:36:36.819794+00:00 | SELECT id, field_1, field_2, field_3, field_4, field_5, field_6, field_7, field_8, field_9, field_10, field_11, field_12, field_13, field_14, field_15, field_16, field_17, field_18, deleted_at, field_19 FROM prospects WHERE (deleted_at IS NULL) AND (field_18 = $1) | 10.2.1.244:50654   |                  |    true     | executing
(3 rows)

Time: 11ms total (execution 9ms / network 2ms)

sessions view: https://user-images.githubusercontent.com/350792/105737300-b16fe700-5f2d-11eb-8982-7eeb8cfff62d.png (let me know if you want a screengrab of something more specific here)

@nvanbenschoten
Copy link
Member

Thanks @nick-jones. We see the open transaction with ID 700fdf6d on node 3. It has issued 5 statements so far and seems to be waiting in the application. This is what is blocking the other transactions.

A SELECT * FROM crdb_internal.cluster_sessions WHERE session_id = '165d86e7eccdf7f20000000000000003'; should tell us a little more about this session. In particular, it will tell us the last_active_query for this session, which should help you find the issue in your app.

@nick-jones
Copy link
Author

nick-jones commented Jan 25, 2021

Thanks @nick-jones. We see the open transaction with ID 700fdf6d on node 3. It has issued 5 statements so far and seems to be waiting in the application. This is what is blocking the other transactions.

A SELECT * FROM crdb_internal.cluster_sessions WHERE session_id = '165d86e7eccdf7f20000000000000003'; should tell us a little more about this session. In particular, it will tell us the last_active_query for this session, which should help you find the issue in your app.

Thanks! So it looks like we might have an application that is holding a transaction open, having issued statements. We know which one it likely is, we're just looking to confirm (once we have, I'll close the issue).

Is what we've observed here "expected behaviour" if this sort of thing happens?

@ajwerner
Copy link
Contributor

Is what we've observed here "expected behaviour" if this sort of thing happens?

Yes, writing transactions put down exclusive locks on the values they write. Concurrent transactions will block on those locks. The tool that users have at their disposal is the idea of PRIORITY. A regular priority transaction can push a low priority lock out of its ways and similarly a PRIORITY HIGH transaction can push any regular or low priority lock out of the way. A reader pushing a writer should not fundamentally prevent the writer from committing if none of its reads get invalidated however a higher priority writer will abort a lower priority writer.

See https://www.cockroachlabs.com/docs/v20.2/transactions#transaction-priorities

@nick-jones
Copy link
Author

Is what we've observed here "expected behaviour" if this sort of thing happens?

Yes, writing transactions put down exclusive locks on the values they write. Concurrent transactions will block on those locks. The tool that users have at their disposal is the idea of PRIORITY. A regular priority transaction can push a low priority lock out of its ways and similarly a PRIORITY HIGH transaction can push any regular or low priority lock out of the way. A reader pushing a writer should not fundamentally prevent the writer from committing if none of its reads get invalidated however a higher priority writer will abort a lower priority writer.

See https://www.cockroachlabs.com/docs/v20.2/transactions#transaction-priorities

Thanks 👍

We haven't quite got to the bottom of what is happening our side, though we're fairly sure it is due to one of our applications holding a transaction open. So with that in mind, I'm going to close the issue. Apologies for the fact this was a non-issue in the end, and many thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

3 participants