Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.5.0 load test anomaly #63

Open
erkolson opened this issue Jul 20, 2020 · 4 comments
Open

0.5.0 load test anomaly #63

erkolson opened this issue Jul 20, 2020 · 4 comments
Assignees
Labels
8 Estimate - xl - Moderately complex, medium effort, some uncertainty. bug Something isn't working p1

Comments

@erkolson
Copy link

Something similar to the "stuck connections" issue we see in production occurred during the 0.5.0 load test. Though, due to the different connection handling in bb8, it was not readily apparent which pod was "stuck"

Connection pools looked like this:
Screen Shot 2020-07-20 at 10 43 10 AM

It appears that one pod (...-nr48p) was unable to use all of the idle connections, was very slow to handle requests, and was returning 503s to clients.

Request handling durations:
Screen Shot 2020-07-20 at 10 45 57 AM

5xx rate:
Screen Shot 2020-07-20 at 10 46 09 AM

After deleting that one pod, performance returned to normal.

@pjenvey pjenvey self-assigned this Jul 20, 2020
@tublitzed tublitzed added bug Something isn't working p1 labels Jul 21, 2020
@pjenvey pjenvey added 5 Estimate - m - This is a small change, but there's some uncertainty. 8 Estimate - xl - Moderately complex, medium effort, some uncertainty. and removed 5 Estimate - m - This is a small change, but there's some uncertainty. labels Aug 3, 2020
@pjenvey
Copy link
Member

pjenvey commented Aug 17, 2020

Considering #64 a duplicate of this: these 50x spikes on 0.5 are due to the timeout issue described there

@pjenvey
Copy link
Member

pjenvey commented Aug 24, 2020

mozilla-services/syncstorage-rs#794 seems to have solved this

@pjenvey pjenvey closed this as completed Aug 24, 2020
@pjenvey
Copy link
Member

pjenvey commented Aug 25, 2020

To elaborate, we were seeing nodes get into these "stuck states" of either not responding entirely, or taking very long to respond. As described in #64 (and #61 (comment)), we even saw time outs on endpoints that did not checkout a db connection.

bb8 has potential connection leaks, and worse is its Drop impl. was potentially blocking our event loop, explaining time outs even when no db was involved.

Switching to deadpool from bb8 has fixed the timeouts or "stuck state".

This was a significantly different issue from the "stuck state" we're seeing on prod under 0.4.x.

@pjenvey
Copy link
Member

pjenvey commented Sep 14, 2020

Reopening this, we're seeing similar spikes of 503s due to upstream timeouts on 0.5.8 on production.

@pjenvey pjenvey reopened this Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8 Estimate - xl - Moderately complex, medium effort, some uncertainty. bug Something isn't working p1
Projects
None yet
Development

No branches or pull requests

3 participants