Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate 0.8.2 node hitting max db connections #945

Closed
pjenvey opened this issue Dec 3, 2020 · 3 comments
Closed

Investigate 0.8.2 node hitting max db connections #945

pjenvey opened this issue Dec 3, 2020 · 3 comments
Assignees
Labels
5 Estimate - l - Moderately complex, will require some effort but clearly defined.

Comments

@pjenvey
Copy link
Member

pjenvey commented Dec 3, 2020

Following the 0.8.2 deploy yesterday a node became "stuck" with the max (30) number of db connections.

Screen Shot 2020-12-03 at 3 34 30 PM

The node continued serving requests until ops manually killed it, but its response times were incredibly slow (e.g. mostly in double digits seconds). Its affect on the overall 95th percentile:

Screen Shot 2020-12-03 at 3 35 05 PM

(and it triggered the nginix GET dashboard alarm).

@pjenvey pjenvey self-assigned this Dec 3, 2020
@jrconlin
Copy link
Member

jrconlin commented Dec 3, 2020

I'm hoping that this is anomalous since it was only one node and only one incident. Still, there was some discussion off-channel that perhaps granting a timeout of 10m might be too generous, and that we might want to find a spanner connection timeout that is shorter, but does not trigger the stream of 502s we had prior.

@tublitzed tublitzed transferred this issue from mozilla-services/services-engineering Dec 4, 2020
@pjenvey pjenvey added the 3 Estimate - m - This is a small change, but there's some uncertainty. label Dec 7, 2020
@tublitzed tublitzed assigned fzzzy and unassigned pjenvey and fzzzy Jan 11, 2021
@pjenvey
Copy link
Member Author

pjenvey commented Jan 21, 2021

Documenting more detail about these incidents here

@tublitzed tublitzed added 5 Estimate - l - Moderately complex, will require some effort but clearly defined. and removed 3 Estimate - m - This is a small change, but there's some uncertainty. labels Jan 21, 2021
jrconlin added a commit that referenced this issue Jan 26, 2021
*Ops:*

Adds `SYNC_DATABASE_POOL_CONNECTION_DEADMAN_SWITCH` which is the
number of milliseconds the pool can report being at 0 available
connections before the application triggers a `panic!`.

The current default is `0` meaning the deadman switch is inactive.

Issue #945
@jrconlin jrconlin mentioned this issue Jan 26, 2021
jrconlin added a commit that referenced this issue Jan 27, 2021
*Ops:*

Adds `SYNC_DATABASE_POOL_CONNECTION_DEADMAN_SWITCH` which is the
number of milliseconds the pool can report being at 0 available
connections before the application triggers a `panic!`.

The current default is `0` meaning the deadman switch is inactive.

Issue #945
Closes #984
jrconlin added a commit that referenced this issue Jan 28, 2021
This will add several fields to `__lbheartbeat__` if
`database_pool_max_size` is specified. These include:

```json
{
  "active_connections": ... /* Number of active connections */,
  "idle_connections": ... /* number of idle connections */,
  "duration": ... /* how long no idle connections have been availble */,
}
```

Note that "duration" will only be present if `idle_connections` has been
zero since the last time a check was performed.

* this also adds `database_pool_max_size` as a config option.

Issue: #945
jrconlin added a commit that referenced this issue Feb 1, 2021
* feat: Add pool connection info to __lbheartbeat__ for ops

This will add several fields to `__lbheartbeat__` if
`database_pool_max_size` is specified. These include:

```json
{
  "active_connections": ... /* Number of active connections */,
  "idle_connections": ... /* number of idle connections */,
  "duration": ... /* how long no idle connections have been availble */,
}
```

Note that "duration" will only be present if `idle_connections` has been
zero since the last time a check was performed.

Issue: #945

Co-authored-by: Philip Jenvey <[email protected]>
@tublitzed
Copy link
Contributor

We haven't hit this in prod since #985 rolled out - closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 Estimate - l - Moderately complex, will require some effort but clearly defined.
Projects
None yet
Development

No branches or pull requests

4 participants