-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub: subscriber: keep getting "Stream removed" errors #4826
Comments
it looks like naively retrying in the application code does not work; in our case we had to wait for a whole other process to come and repeat the job, starting a new subscription and picking up the pending subscriber messages that got cut short the first time because of so this means for proper retry we would have to go through the whole process again of setting up a subscription, and starting a new receive message loop, in order to get the rest of these messages. we think this should be hidden in the library; please let us know what you think, thanks. |
@lukesneeringer @dhermes can one of you take a look? @scottingly can you possibly give us a full stack trace? |
these are in a multiprocessing pool, and we're on python2, so we aren't getting the indirect traces, just the summary. It just hit again today right on cue, after about 5-6 days, seems to be the usual interval. I will try to rework the code to get a full trace next time and update the ticket. |
That could definitely be part of the cause here; grpcio doesn't currently play well with multiprocessing. Can you give me an idea of how you're using multiprocessing and how that interacts with these client libraries? |
we're using normally this works fine, but every few days one of the threads waiting in it turns out we were catching Exception so we could keep going with the other workers that were running and do clean shutdown, that's why we didn't get the full trace. The
this seems to give the location of the library-internal exception, but I don't know how to get any more information from it. the referenced file in the
so it seems to be the library raising the error, I'm not sure how to get any more out of it at this point. |
Thanks for the details, @scottingly. It would be super awesome if you're able to give us an example program that can reproduce this, but it's not required. For transparency, we're working on burning down critical issues before GA. Since this has some workaround or recovery path we're going to prioritize it below others, but we do plan to get to this soon. |
ok sure, but are you still "awaiting information?" I may be able to make a test case, but that's a lot of code to strip down, probably be a while. our "workaround" is to just repeat it again with a different node which completes processing, which isn't really a workaround per se, we just have to start the whole thing over. you're right that it's not a showstopper for us at this time because we can do node-level retries. there does seem to be a bunch of people getting this "Stream Removed" as it relates to pubsub and grpcio, in different languages; see links upticket, also these folks seem to have fixed it with grpcio downgrade, but some people said upgrade in above tickets, which didn't work for us. again we had tried 1.8.1 and 1.8.4, maybe we should try grpcio-1.7.x, hm. |
If you do have the time, it'll be immensely helpful for us. It might also be useful to try to reproduce this on grpc 1.9.1. |
one more piece of information to add which may or may not be relevant: we just got the following traceback in exact same location in pubsub library:
this is a different error, but seems like retrying at this point should not be done in the user code but the library should handle it (but, I don't know). however if it's really us that should be doing all this retry handling on endpoint codes that break the socket, and cause us to have to redo entire subscription setup every time, I suppose we will have to do that. this seems to link to similar issue for RST_STREAM + pubsub |
we updated to grpcio 1.9.1 almost 9 days ago and have not seen this happen since. this is an unusually long period, it normally happens by a week. not definitive yet, but this may have solved our issue. will follow up after another week or so, by then I'll be convinced because this occurrence was quite regular. |
@scottingly great to hear! Also, upgrading to 1.10.0rc2 should help with any memory issues per #4978 (comment). I'm closing this for now but feel free to comment if this persists/re-appears in |
this has reappeared, this time we can reproduce it more quickly. while we haven't seen it on the nodes that run for each hour anymore (48 per day, two for each hour, workers use pubsub to communicate status back to a controller), we have just reprocessed a full day, and 5 out of the 48 workers failed with "Stream removed." I run the same backfill several times with the same input data and 48 workers, they each have somewhere from 1-5 occurrences of "Stream removed." we tested on grpcio 1.10.0 and 1.9.1 and also on 1.8.4. at least for the last (1.8.4) we know backfill was working completely and reliable before, but now we have tried 1.8.4 again and alas we're getting "Stream removed" a few times with every run of 48. so this has regressed, even for the working version. 1.9.1 had seemed initially to solve problem but maybe this really just coinciding with a server-side change? it may be that grpcio-1.9.1 works better on less frequent / occasional use of pubsub rather than 48x all at once, but our message rate is very low, this is only one every few seconds from each worker to the controller. is it correct assumption that we should only need to get the subscription once for the lifetime of our daemon process? or should we keep looping getting messages until the stream gets removed, then resubscribe and do it again until the next stream removed, etc? |
Thanks, @scottingly. Instead of re-opening this, I'm going to track this over at #4989. /cc @kir-titievsky |
ok, the important thing to note is regression with same version, implying an external factor. also note, we aren't using windows like in the referenced ticket, just ubuntu 16. we just did another backfill test with grpcio-1.10.0 and this time 11 out of 48 workers got "stream removed" disconnects. so it's increasing frequency, last few days when I was testing there were only max of 5/48. |
Our current theory is that this is the serving frontend closing the stream on the server side. We're doing some investigation to make sure we can safely retry this before committing and fix. |
so far, can confim that 1367d7b seems to fix our issue. no longer getting "Stream removed" errors on backfill and have restored our ingest process to normal operation. thanks, looking forward to release. |
Great news. :) |
every few days, sometimes a week, one of our subscribers in a
while True
loop waiting onfuture.result(timeout=1)
is getting excepted, exception string isNone Stream removed
. this might happen maybe every thousand received messages (I can count better if it's needed).originally we were on
grpcio-1.8.1
, and after searching, we seemed to find some indication that this error might be solved by an RPC library upgrade. the reference is a totally different language/application but using the same google RPC library, they say upgrade togrpcio-1.8.4
fixed their issue. recent comment seems to indicate that the nodejs library would normally retry this RPC but wasn't getting the right return value, whichgrpcio-1.8.4
fixed for them.moving to later
grpcio-1.8.4
for us however did not help with the Python library, we still see this error.we don't think our application should see this, the library should be retrying it based on the RPC result, just like they do in the referenced cloud client library.
environment is ubuntu16 on a gce node, relevant libs:
The text was updated successfully, but these errors were encountered: