0.5.0 load test anomaly #63

erkolson · 2020-07-20T14:55:10Z

Something similar to the "stuck connections" issue we see in production occurred during the 0.5.0 load test. Though, due to the different connection handling in bb8, it was not readily apparent which pod was "stuck"

Connection pools looked like this:

It appears that one pod (...-nr48p) was unable to use all of the idle connections, was very slow to handle requests, and was returning 503s to clients.

Request handling durations:

5xx rate:

After deleting that one pod, performance returned to normal.

The text was updated successfully, but these errors were encountered:

pjenvey · 2020-08-17T16:58:22Z

Considering #64 a duplicate of this: these 50x spikes on 0.5 are due to the timeout issue described there

pjenvey · 2020-08-24T17:34:31Z

mozilla-services/syncstorage-rs#794 seems to have solved this

pjenvey · 2020-08-25T22:09:14Z

To elaborate, we were seeing nodes get into these "stuck states" of either not responding entirely, or taking very long to respond. As described in #64 (and #61 (comment)), we even saw time outs on endpoints that did not checkout a db connection.

bb8 has potential connection leaks, and worse is its Drop impl. was potentially blocking our event loop, explaining time outs even when no db was involved.

Switching to deadpool from bb8 has fixed the timeouts or "stuck state".

This was a significantly different issue from the "stuck state" we're seeing on prod under 0.4.x.

pjenvey · 2020-09-14T16:50:48Z

Reopening this, we're seeing similar spikes of 503s due to upstream timeouts on 0.5.8 on production.

pjenvey self-assigned this Jul 20, 2020

tublitzed added bug Something isn't working p1 labels Jul 21, 2020

pjenvey added 5 Estimate - m - This is a small change, but there's some uncertainty. 8 Estimate - xl - Moderately complex, medium effort, some uncertainty. and removed 5 Estimate - m - This is a small change, but there's some uncertainty. labels Aug 3, 2020

pjenvey mentioned this issue Aug 17, 2020

Investigate syncstorage-rs stage timeouts #64

Closed

pjenvey closed this as completed Aug 24, 2020

pjenvey reopened this Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.5.0 load test anomaly #63

0.5.0 load test anomaly #63

erkolson commented Jul 20, 2020

pjenvey commented Aug 17, 2020

pjenvey commented Aug 24, 2020

pjenvey commented Aug 25, 2020 •

edited

Loading

pjenvey commented Sep 14, 2020

0.5.0 load test anomaly #63

0.5.0 load test anomaly #63

Comments

erkolson commented Jul 20, 2020

pjenvey commented Aug 17, 2020

pjenvey commented Aug 24, 2020

pjenvey commented Aug 25, 2020 • edited Loading

pjenvey commented Sep 14, 2020

pjenvey commented Aug 25, 2020 •

edited

Loading