Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub: subscriber: keep getting "Stream removed" errors #4826

Closed
ghost opened this issue Feb 2, 2018 · 17 comments
Closed

pubsub: subscriber: keep getting "Stream removed" errors #4826

ghost opened this issue Feb 2, 2018 · 17 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p2 Moderately-important priority. Fix may not be included in next release. status: awaiting information triaged for GA type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@ghost
Copy link

ghost commented Feb 2, 2018

every few days, sometimes a week, one of our subscribers in a while True loop waiting on future.result(timeout=1) is getting excepted, exception string is None Stream removed. this might happen maybe every thousand received messages (I can count better if it's needed).

originally we were on grpcio-1.8.1, and after searching, we seemed to find some indication that this error might be solved by an RPC library upgrade. the reference is a totally different language/application but using the same google RPC library, they say upgrade to grpcio-1.8.4 fixed their issue. recent comment seems to indicate that the nodejs library would normally retry this RPC but wasn't getting the right return value, which grpcio-1.8.4 fixed for them.

moving to later grpcio-1.8.4 for us however did not help with the Python library, we still see this error.
we don't think our application should see this, the library should be retrying it based on the RPC result, just like they do in the referenced cloud client library.

environment is ubuntu16 on a gce node, relevant libs:

  • grpcio: 1.8.4
  • google-api-core: 0.1.4
  • google-cloud-core: 0.28.0
  • google-cloud-pubsub: 0.30.1
@ghost
Copy link
Author

ghost commented Feb 2, 2018

it looks like naively retrying in the application code does not work; in our case we had to wait for a whole other process to come and repeat the job, starting a new subscription and picking up the pending subscriber messages that got cut short the first time because of Stream removed. actually future.done() was set on the next loop after getting Stream removed.

so this means for proper retry we would have to go through the whole process again of setting up a subscription, and starting a new receive message loop, in order to get the rest of these messages. we think this should be hidden in the library; please let us know what you think, thanks.

@theacodes theacodes added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Feb 6, 2018
@theacodes
Copy link
Contributor

@lukesneeringer @dhermes can one of you take a look?

@scottingly can you possibly give us a full stack trace?

@ghost
Copy link
Author

ghost commented Feb 8, 2018

these are in a multiprocessing pool, and we're on python2, so we aren't getting the indirect traces, just the summary. It just hit again today right on cue, after about 5-6 days, seems to be the usual interval. I will try to rework the code to get a full trace next time and update the ticket.

@theacodes
Copy link
Contributor

That could definitely be part of the cause here; grpcio doesn't currently play well with multiprocessing. Can you give me an idea of how you're using multiprocessing and how that interacts with these client libraries?

@danoscarmike danoscarmike added triaged for GA priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Feb 8, 2018
@ghost
Copy link
Author

ghost commented Feb 9, 2018

we're using multiprocessing.pool.ThreadPool, so these are threads, not separate procs. we break up a work list via ThreadPool.map(), each one makes a pubsub subscription, then forks a subprocess which spawns a GCE node, which does its assigned work and sends back messages via the topic. after spawning the worker nodes, the pool threads each sit in a loop checking if their worker finished (set using threading.Event() in the callback) and then waiting on future.result(timeout=1) and looping on TimeoutError. If the "done" Event is set, we kill the worker and cleanup, otherwise try again.

normally this works fine, but every few days one of the threads waiting in future.result() as above is excepting with this "Stream removed" error and we have no way to handle that without making a new subscription completely, the original "future" is invalid at that point.

it turns out we were catching Exception so we could keep going with the other workers that were running and do clean shutdown, that's why we didn't get the full trace. The Stream removed happened again just now and I have managed to print more by utilizing traceback.format_exc() (sorry no newlines):

Traceback (most recent call last): [our code...] in receive_messages [...] self.future.result(timeout=1) File "/usr/local/lib/python2.7/dist-packages/google/cloud/pubsub_v1/futures.py", line 103, in result raise err Unknown: None Stream removed

this seems to give the location of the library-internal exception, but I don't know how to get any more information from it. the referenced file in the google-cloud-python library looks like this:

   def result(self, timeout=None):
        # [...]
        # Attempt to get the exception if there is one.
        # If there is not one, then we know everything worked, and we can
        # return an appropriate value.
        err = self.exception(timeout=timeout)
        if err is None:
            return self._result
        raise err

so it seems to be the library raising the error, I'm not sure how to get any more out of it at this point.

@theacodes
Copy link
Contributor

Thanks for the details, @scottingly. It would be super awesome if you're able to give us an example program that can reproduce this, but it's not required.

For transparency, we're working on burning down critical issues before GA. Since this has some workaround or recovery path we're going to prioritize it below others, but we do plan to get to this soon.

@ghost
Copy link
Author

ghost commented Feb 15, 2018

ok sure, but are you still "awaiting information?" I may be able to make a test case, but that's a lot of code to strip down, probably be a while.

our "workaround" is to just repeat it again with a different node which completes processing, which isn't really a workaround per se, we just have to start the whole thing over. you're right that it's not a showstopper for us at this time because we can do node-level retries.

there does seem to be a bunch of people getting this "Stream Removed" as it relates to pubsub and grpcio, in different languages; see links upticket, also these folks seem to have fixed it with grpcio downgrade, but some people said upgrade in above tickets, which didn't work for us. again we had tried 1.8.1 and 1.8.4, maybe we should try grpcio-1.7.x, hm.

@theacodes
Copy link
Contributor

I may be able to make a test case, but that's a lot of code to strip down, probably be a while.

If you do have the time, it'll be immensely helpful for us. It might also be useful to try to reproduce this on grpc 1.9.1.

@ghost
Copy link
Author

ghost commented Feb 15, 2018

one more piece of information to add which may or may not be relevant: we just got the following traceback in exact same location in pubsub library:

self.future.result(timeout=1) File "/usr/local/lib/python2.7/dist-packages/google/cloud/pubsub_v1/futures.py", line 103, in result raise err InternalServerError: 500 Received RST_STREAM with error code 2

this is a different error, but seems like retrying at this point should not be done in the user code but the library should handle it (but, I don't know). however if it's really us that should be doing all this retry handling on endpoint codes that break the socket, and cause us to have to redo entire subscription setup every time, I suppose we will have to do that.

this seems to link to similar issue for RST_STREAM + pubsub

@chemelnucfin chemelnucfin self-assigned this Feb 20, 2018
@ghost
Copy link
Author

ghost commented Mar 2, 2018

we updated to grpcio 1.9.1 almost 9 days ago and have not seen this happen since. this is an unusually long period, it normally happens by a week. not definitive yet, but this may have solved our issue. will follow up after another week or so, by then I'll be convinced because this occurrence was quite regular.

@theacodes
Copy link
Contributor

@scottingly great to hear! Also, upgrading to 1.10.0rc2 should help with any memory issues per #4978 (comment). I'm closing this for now but feel free to comment if this persists/re-appears in 1.10.0rc2.

@ghost
Copy link
Author

ghost commented Mar 9, 2018

this has reappeared, this time we can reproduce it more quickly. while we haven't seen it on the nodes that run for each hour anymore (48 per day, two for each hour, workers use pubsub to communicate status back to a controller), we have just reprocessed a full day, and 5 out of the 48 workers failed with "Stream removed." I run the same backfill several times with the same input data and 48 workers, they each have somewhere from 1-5 occurrences of "Stream removed."

we tested on grpcio 1.10.0 and 1.9.1 and also on 1.8.4. at least for the last (1.8.4) we know backfill was working completely and reliable before, but now we have tried 1.8.4 again and alas we're getting "Stream removed" a few times with every run of 48. so this has regressed, even for the working version.

1.9.1 had seemed initially to solve problem but maybe this really just coinciding with a server-side change? it may be that grpcio-1.9.1 works better on less frequent / occasional use of pubsub rather than 48x all at once, but our message rate is very low, this is only one every few seconds from each worker to the controller.

is it correct assumption that we should only need to get the subscription once for the lifetime of our daemon process? or should we keep looping getting messages until the stream gets removed, then resubscribe and do it again until the next stream removed, etc?

@theacodes
Copy link
Contributor

Thanks, @scottingly. Instead of re-opening this, I'm going to track this over at #4989.

/cc @kir-titievsky

@ghost
Copy link
Author

ghost commented Mar 9, 2018

ok, the important thing to note is regression with same version, implying an external factor. also note, we aren't using windows like in the referenced ticket, just ubuntu 16.

we just did another backfill test with grpcio-1.10.0 and this time 11 out of 48 workers got "stream removed" disconnects. so it's increasing frequency, last few days when I was testing there were only max of 5/48.

@theacodes
Copy link
Contributor

Our current theory is that this is the serving frontend closing the stream on the server side. We're doing some investigation to make sure we can safely retry this before committing and fix.

@ghost
Copy link
Author

ghost commented Mar 14, 2018

so far, can confim that 1367d7b seems to fix our issue. no longer getting "Stream removed" errors on backfill and have restored our ingest process to normal operation. thanks, looking forward to release.

@theacodes
Copy link
Contributor

Great news. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p2 Moderately-important priority. Fix may not be included in next release. status: awaiting information triaged for GA type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants