Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: recover watch stream on more error types #9995

Merged
merged 3 commits into from
Jan 2, 2020

Conversation

crwilcox
Copy link
Contributor

Watch Retry is more permissive in Go. This PR replicates that in Python.

Fixes #9890 and b/144734355

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Dec 18, 2019
@crwilcox crwilcox changed the title fix: Recover watch stream on more error types fix: recover watch stream on more error types Dec 18, 2019
Copy link
Member

@frankyn frankyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Reviewed error codes map back to the Go list.

@crwilcox crwilcox self-assigned this Dec 18, 2019
Copy link
Contributor

@tritone tritone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that several of these errors are noted as non-retryable under https://aip.dev/194 , however I don't have enough context specific to firestore and this client to understand whether this is problematic or not.

@BenWhitehead
Copy link
Contributor

@schmidt-sebastian Can you take a look at the list of error codes here and weigh in if they are okay to retry or not?

@crwilcox
Copy link
Contributor Author

I should note in theory, UNAVAILABLE should be sufficient, but Go is functioning, Python isn't and this seems like a likely suspect. I am currently letting a long run occur on my machine to see the results of this change over a few hours.

@schmidt-sebastian
Copy link

@crwilcox crwilcox added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Dec 20, 2019
@crwilcox
Copy link
Contributor Author

@jadekler voiced concerns about many of the retry codes in Go that I copied. I also found that this didn't fully stop the issue. It is possible I could match node. I have a debug session running and am waiting for a failure so I can dig into what went on.

@crwilcox
Copy link
Contributor Author

I left this run for a few days. I think if we just retry INTERNAL that will be sufficient. @jadekler @schmidt-sebastian thoughts?

2019-12-23 00:20:52,887 - google.api_core.bidi - DEBUG - waiting for recv.
DEBUG:google.api_core.bidi:Re-opening stream from gRPC callback.
2019-12-23 00:21:05,749 - google.api_core.bidi - DEBUG - Re-opening stream from gRPC callback.
DEBUG:google.auth.transport.requests:Making request: POST https://accounts.google.com/o/oauth2/token
2019-12-23 00:21:05,787 - google.auth.transport.requests - DEBUG - Making request: POST https://accounts.google.com/o/oauth2/token
DEBUG:urllib3.connectionpool:Resetting dropped connection: accounts.google.com
INFO:google.api_core.bidi:Re-established stream
2019-12-23 00:21:05,801 - urllib3.connectionpool - DEBUG - Resetting dropped connection: accounts.google.com
2019-12-23 00:21:05,802 - google.api_core.bidi - INFO - Re-established stream
  File "/usr/local/google/home/crwilcox/scratch/firestore_rst_stream_unavailable_retry/venv/lib/python3.7/site-packages/grpc/_channel.py", line 519, in traceback
    raise self
  File "/usr/local/google/home/crwilcox/scratch/firestore_rst_stream_unavailable_retry/venv/lib/python3.7/site-packages/google/api_core/bidi.py", line 505, in _recoverable
    return method(*args, **kwargs)
  File "/usr/local/google/home/crwilcox/scratch/firestore_rst_stream_unavailable_retry/venv/lib/python3.7/site-packages/google/api_core/bidi.py", line 561, in _recv
    return next(call)
  File "/usr/local/google/home/crwilcox/scratch/firestore_rst_stream_unavailable_retry/venv/lib/python3.7/site-packages/grpc/_channel.py", line 392, in __next__
    return self._next()
  File "/usr/local/google/home/crwilcox/scratch/firestore_rst_stream_unavailable_retry/venv/lib/python3.7/site-packages/grpc/_channel.py", line 561, in _next
    raise self
DEBUG:google.api_core.bidi:Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f8bd4199910>> caused <_Rendezvous of RPC that terminated with:
        status = StatusCode.INTERNAL
        details = "Received RST_STREAM with error code 0"
        debug_error_string = "{"created":"@1577060465.749246373","description":"Error received from peer ipv4:74.125.142.95:443","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Received RST_STREAM with error code 0","grpc_status":13}"
>.
2019-12-23 00:21:05,810 - google.api_core.bidi - DEBUG - Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f8bd4199910>> caused <_Rendezvous of RPC that terminated with:
        status = StatusCode.INTERNAL
        details = "Received RST_STREAM with error code 0"
        debug_error_string = "{"created":"@1577060465.749246373","description":"Error received from peer ipv4:74.125.142.95:443","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Received RST_STREAM with error code 0","grpc_status":13}"
>.
DEBUG:google.api_core.bidi:Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f8bd4199910>>.
2019-12-23 00:21:05,811 - google.api_core.bidi - DEBUG - Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f8bd4199910>>.
DEBUG:google.api_core.bidi:Stream was already re-established.
2019-12-23 00:21:05,811 - google.api_core.bidi - DEBUG - Stream was already re-established.
DEBUG:urllib3.connectionpool:https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None
2019-12-23 00:21:05,847 - urllib3.connectionpool - DEBUG - https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None
DEBUG:google.api_core.bidi:Re-opening stream from gRPC callback.
2019-12-23 00:21:05,852 - google.api_core.bidi - DEBUG - Re-opening stream from gRPC callback.
INFO:google.api_core.bidi:Re-established stream
2019-12-23 00:21:05,853 - google.api_core.bidi - INFO - Re-established stream
  File "/usr/local/google/home/crwilcox/scratch/firestore_rst_stream_unavailable_retry/venv/lib/python3.7/site-packages/grpc/_channel.py", line 519, in traceback
    raise self
  File "/usr/local/google/home/crwilcox/scratch/firestore_rst_stream_unavailable_retry/venv/lib/python3.7/site-packages/google/api_core/bidi.py", line 505, in _recoverable
    return method(*args, **kwargs)
  File "/usr/local/google/home/crwilcox/scratch/firestore_rst_stream_unavailable_retry/venv/lib/python3.7/site-packages/google/api_core/bidi.py", line 561, in _recv
    return next(call)
  File "/usr/local/google/home/crwilcox/scratch/firestore_rst_stream_unavailable_retry/venv/lib/python3.7/site-packages/grpc/_channel.py", line 392, in __next__
    return self._next()
  File "/usr/local/google/home/crwilcox/scratch/firestore_rst_stream_unavailable_retry/venv/lib/python3.7/site-packages/grpc/_channel.py", line 561, in _next
    raise self
DEBUG:google.api_core.bidi:Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f8bd4199910>> caused <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Transport closed"
        debug_error_string = "{"created":"@1577060465.850641483","description":"Error received from peer ipv4:74.125.142.95:443","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Transport closed","grpc_status":14}"
>.
2019-12-23 00:21:05,854 - google.api_core.bidi - DEBUG - Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f8bd4199910>> caused <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Transport closed"
        debug_error_string = "{"created":"@1577060465.850641483","description":"Error received from peer ipv4:74.125.142.95:443","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Transport closed","grpc_status":14}"
>.
DEBUG:google.api_core.bidi:Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f8bd4199910>>.
2019-12-23 00:21:05,855 - google.api_core.bidi - DEBUG - Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f8bd4199910>>.
DEBUG:google.api_core.bidi:Stream was already re-established.
2019-12-23 00:21:05,855 - google.api_core.bidi - DEBUG - Stream was already re-established.
DEBUG:google.api_core.bidi:recved response.
2019-12-23 00:21:05,964 - google.api_core.bidi - DEBUG - recved response.
DEBUG:google.cloud.firestore_v1.watch:on_snapshot: target change: 1

@jeanbza
Copy link
Member

jeanbza commented Dec 23, 2019

Did you figure out whether RST_STREAM was being returned as an INTERNAL error or not? If so, the change you propose makes sense to me.

I think if we just retry INTERNAL that will be sufficient

INTERNAL in addition to UNAVAILABLE, or just by itself? (UNAVAILABLE should always be retried)

@schmidt-sebastian
Copy link

I would prefer if we used the same retry configuration everywhere, and the Node SDK has the configuration that is most battle-tested. We should try to retry every Watch request unless we know beforehand that a retry will not help (e.g. "PERMISSION_DENIED"). If we don't do this, our users will, and they will do so without backoff.

@crwilcox
Copy link
Contributor Author

@jadekler I haven't gotten an answer on that yet, but discussion is ongoing at b/144734355.

I think for now @schmidt-sebastian has a reasonable point. Customer of watch are determined to keep it running, likely to the point of personally retrying any of the codes. If we retry most all of them, but with sensible timeouts, that is likely better than leaving it to chance.

I will modify this PR to match Node.js

@crwilcox
Copy link
Contributor Author

crwilcox commented Jan 2, 2020

Merging this. We have further discussion internally on whether RST_STREAM should occur with the error type we are seeing (INTERNAL), but this ought to resolve the issues for users currently. We can always soften later if this becomes unecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement. do not merge Indicates a pull request not ready for merge, due to either quality or timing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RST_STREAM error from grpc via Bidi in Firestore Client Library
7 participants