Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanner emits 14-UNAVAILABLE GOAWAY #861

Closed
pjenvey opened this issue Oct 14, 2020 · 5 comments · Fixed by #1006
Closed

Spanner emits 14-UNAVAILABLE GOAWAY #861

pjenvey opened this issue Oct 14, 2020 · 5 comments · Fixed by #1006
Assignees
Labels
8 Estimate - xl - Moderately complex, medium effort, some uncertainty. bug Something isn't working

Comments

@pjenvey
Copy link
Member

pjenvey commented Oct 14, 2020

We began seeing 14-UNAVAILABLE GOAWAY received errors after the 0.5.x release to prod, e.g.:

https://sentry.prod.mozaws.net/operations/syncstorage-stage/issues/9186276/

These result in 500 errors (which in turn are 503s from nginx).

@pjenvey pjenvey added bug Something isn't working 8 Estimate - xl - Moderately complex, medium effort, some uncertainty. labels Oct 14, 2020
@pjenvey
Copy link
Member Author

pjenvey commented Oct 14, 2020

We have an issue open with google, inquiring on the best way to deal with this and if GOAWAYS are some indication of misbehavior on our end.

@pjenvey
Copy link
Member Author

pjenvey commented Oct 16, 2020

Some information on GOAWAYS from gRPC's perspective:

  • Should we retry on the existing connection? Or disconnect and retry on a separate connection (the go client seems to do the latter)?

GOAWAY indicates that the server wants the client to send no new RPCs on this connection. If the client has more RPCs to send, it should open a new connection. It is OK to wait for existing RPCs to finish before closing the connection.

The GOAWAY is server behavior independent of the client, but client behavior such as keeping connections around for longer can trigger it. Many servers send GOAWAY after a fixed duration (say 1 hour) to force clients to reconnect. It is also common for servers to send GOAWAY after a period of inactivity (say 10 minutes) on the connection.

  • What does GOAWAY mean besides "retry"? Does this signal something specific e.g. are we misusing our connections: using too many or holding too many idle connections?

See https://tools.ietf.org/html/rfc7540#section-6.8

@pjenvey
Copy link
Member Author

pjenvey commented Oct 16, 2020

To reduce the number of GOAWAYS we should try:

  • Reducing our total number of connections. It's possible it would help (though it may not do much w/ deadpool's FIFO behavior). It may at least reduce a few occurrences of GOAWAYS due to inactivity, which might cause 503 errors QA has seen on an idle stage cluster (stage 0.6.1 POST batch=true&commit=true producing 500 errors  #846)
  • Implement equivalent to r2d2/bb8's Builder::max_lifetime (unsupported by deadpool)
  • Implement equivalent to r2d2/bb8's Builder::idle_timeout (unsupported by deadpool)

@pjenvey pjenvey self-assigned this Oct 19, 2020
@pjenvey
Copy link
Member Author

pjenvey commented Oct 22, 2020

Reducing the total number of connections has helped the number of GOAWAYS, one extreme example:

300 max conns:
https://sentry.prod.mozaws.net/operations/syncstorage-prod/issues/9482220/

vs 30:
https://sentry.prod.mozaws.net/operations/syncstorage-prod/issues/9820650/

But there's still too many

@fzzzy fzzzy assigned fzzzy and unassigned pjenvey Dec 3, 2020
jrconlin added a commit that referenced this issue Feb 12, 2021
* Two new optional arguments:

 _database_pool_connection_lifespan_ = max connection lifespan (in seconds)
 _database_pool_connection_max_idle_ = max idle time (in seconds)

TODO: Tests

Issue: #861
@jrconlin jrconlin linked a pull request Feb 23, 2021 that will close this issue
jrconlin added a commit that referenced this issue Feb 23, 2021
* feat: kill old or excessivly idled connections

* Two new optional arguments:

 _database_pool_connection_lifespan_ = max connection lifespan (in seconds)
 _database_pool_connection_max_idle_ = max idle time (in seconds)

* add reporting (sorta) of connection lifespans to appropriate errors

Issue: #861
@pjenvey
Copy link
Member Author

pjenvey commented Mar 16, 2021

Initial lifespan/max_idle settings: issue1698211

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8 Estimate - xl - Moderately complex, medium effort, some uncertainty. bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants