You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Something similar to the "stuck connections" issue we see in production occurred during the 0.5.0 load test. Though, due to the different connection handling in bb8, it was not readily apparent which pod was "stuck"
Connection pools looked like this:
It appears that one pod (...-nr48p) was unable to use all of the idle connections, was very slow to handle requests, and was returning 503s to clients.
Request handling durations:
5xx rate:
After deleting that one pod, performance returned to normal.
The text was updated successfully, but these errors were encountered:
pjenvey
added
5
Estimate - m - This is a small change, but there's some uncertainty.
8
Estimate - xl - Moderately complex, medium effort, some uncertainty.
and removed
5
Estimate - m - This is a small change, but there's some uncertainty.
labels
Aug 3, 2020
To elaborate, we were seeing nodes get into these "stuck states" of either not responding entirely, or taking very long to respond. As described in #64 (and #61 (comment)), we even saw time outs on endpoints that did not checkout a db connection.
bb8 has potential connection leaks, and worse is its Drop impl. was potentially blocking our event loop, explaining time outs even when no db was involved.
Switching to deadpool from bb8 has fixed the timeouts or "stuck state".
This was a significantly different issue from the "stuck state" we're seeing on prod under 0.4.x.
Something similar to the "stuck connections" issue we see in production occurred during the 0.5.0 load test. Though, due to the different connection handling in bb8, it was not readily apparent which pod was "stuck"
Connection pools looked like this:
It appears that one pod (
...-nr48p
) was unable to use all of the idle connections, was very slow to handle requests, and was returning 503s to clients.Request handling durations:
5xx rate:
After deleting that one pod, performance returned to normal.
The text was updated successfully, but these errors were encountered: