You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On v19 we experienced a VTTablet effectively locking up, with attempts to start a new transaction timing out. The shard did not recover until we performed a failover to another tablet.
We were able to trace the cause to a bug in waitlist.expire
The function should wake up all waiters whose context has expired, but it only wakes up the first one it finds. This is due to calling wl.list.Remove(e) while traversing the list. This sets e.Next to nil, which terminates the loop.
for run in {1..100}; do time mysql -e "begin; select $run, sleep(10) ; rollback" & done
This starts 100 transactions in parallel, with each one sleeping for 10 seconds.
The transaction pool has a default wait timeout of 1 second (source), so if the pool is exhausted we would expect wait times of 1-2 seconds (1s wait timeout + up to 1s for the expire worker to run) before failing with a ResourceExhausted error (returned here, message modified here)
What we see in v19 and on main
We see one transaction erroring out with ResourceExhausted each second, with ever increasing wait times. The last transaction completes after about 40 seconds.
ERROR 1203 (42000) at line 1: target: commerce.0.primary: vttablet: rpc error: code = ResourceExhausted desc = transaction pool connection limit exceeded (CallerID: userData1)
real 0m1.105s
user 0m0.004s
sys 0m0.005s
ERROR 1203 (42000) at line 1: target: commerce.0.primary: vttablet: rpc error: code = ResourceExhausted desc = transaction pool connection limit exceeded (CallerID: userData1)
real 0m2.096s
user 0m0.004s
sys 0m0.003s
ERROR 1203 (42000) at line 1: target: commerce.0.primary: vttablet: rpc error: code = ResourceExhausted desc = transaction pool connection limit exceeded (CallerID: userData1)
real 0m3.082s
user 0m0.004s
sys 0m0.003s
ERROR 1203 (42000) at line 1: target: commerce.0.primary: vttablet: rpc error: code = ResourceExhausted desc = transaction pool connection limit exceeded (CallerID: userData1)
real 0m4.075s
user 0m0.004s
sys 0m0.003s
ERROR 1203 (42000) at line 1: target: commerce.0.primary: vttablet: rpc error: code = ResourceExhausted desc = transaction pool connection limit exceeded (CallerID: userData1)
real 0m5.070s
user 0m0.005s
sys 0m0.003s
...
Binary Version
impacts v19+
Operating System and Environment details
n/a
Log Fragments
No response
The text was updated successfully, but these errors were encountered:
Overview of the Issue
On v19 we experienced a VTTablet effectively locking up, with attempts to start a new transaction timing out. The shard did not recover until we performed a failover to another tablet.
We were able to trace the cause to a bug in
waitlist.expire
vitess/go/pools/smartconnpool/waitlist.go
Line 79 in 3e4e1b9
expire
is called once per second by a pool worker to remove any waiters whose context has timed out or been canceledvitess/go/pools/smartconnpool/pool.go
Lines 196 to 201 in 3e4e1b9
The function should wake up all waiters whose context has expired, but it only wakes up the first one it finds. This is due to calling
wl.list.Remove(e)
while traversing the list. This setse.Next
tonil
, which terminates the loop.vitess/go/pools/smartconnpool/waitlist.go
Lines 89 to 95 in 3e4e1b9
A fix PR will be posted shortly
Reproduction Steps
1. start a local cluster
2. Exhaust the transaction pool
This starts 100 transactions in parallel, with each one sleeping for 10 seconds.
The transaction pool has a default wait timeout of 1 second (source), so if the pool is exhausted we would expect wait times of 1-2 seconds (1s wait timeout + up to 1s for the expire worker to run) before failing with a
ResourceExhausted
error (returned here, message modified here)What we see in v19 and on
main
We see one transaction erroring out with
ResourceExhausted
each second, with ever increasing wait times. The last transaction completes after about 40 seconds.Binary Version
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: