Problem with RDY locking to zero #6

nickstenning · 2015-06-11T17:28:45Z

Hi Trevor! First of all -- thank you so much for all your hard work on gnsq. It's a lovely little library and for the most part has been absolutely flawless for us.

Unfortunately we're having an issue at the moment with gnsq v0.2.3 (and nsqd v0.3.0) in which it seems like a network blip causes a Reader to lock itself in a ready state of zero and never recover.

We're getting the following error message in the logs:

[172.17.11.122:4150] error requeueing message (NSQSocketError(32, 'Broken pipe'))

This appears to come from line 699 of reader.py, and seems to me to imply that we're having a network blip while attempting a requeue. Thereafter, it seems like RDY never increments again. You can see the issue where we're dealing with this internally at hypothesis/h#2304.

I've tried to recreate the conditions under which this might occur using blockade (see this repo), but as yet haven't succeeded.

I've perused the source code, and I'm guessing this might be the issue you fixed in d193160 (and 6905a14). Does that sound right?

If not, is there any way I can help track this issue down?

The text was updated successfully, but these errors were encountered:

wtolson · 2015-06-11T18:57:24Z

Thanks for the report Nick. This is strange as it seems gnsq thinks the connection is closed but nsqd sees the connection as open.

When gnsq sees an exception while requeueing, it should close the connection and then reopen it, resetting the RDY count. Are you seeing anything in your logs to suggest this is the case?

Thanks for hunting it down this far. I will see if I can reproduce the behaviour on my end.

mreiferson · 2015-06-11T21:36:52Z

FWIW as of NSQ v0.3.0 client libraries no longer need to repeatedly send RDY - see nsqio/nsq#404

wtolson · 2015-06-11T22:53:01Z

@mreiferson That's good news, it was always a bit tricky to get that correct. What is recommended for clients in terms of support for older versions of NSQ?

mreiferson · 2015-06-11T23:02:36Z

Well, one approach would be to not support older nsqd in a future release of gnsq (cleanest code). If you want to be backwards compatible, then you can pivot on the version field in the IDENTIFY response and add some logic to the RDY code paths.

wtolson · 2015-06-11T23:28:15Z

Small bit of progress made. I'm able to consistently reproduce the behaviour you're seeing with this snippet:

import logging
import gnsq


logging.basicConfig(level=logging.DEBUG)
reader = gnsq.Reader('test', 'test', nsqd_tcp_addresses=['localhost:4150'])


@reader.on_message.connect
def handle_message(reader, message):
    for conn in reader.conns:
        conn.stream.socket.send('badcmd\n')
    raise Exception('test')


reader.start()

It seems there is some unexpected behaviour when the socket is closed on the server's side. gnsq reconnects, but an initial RDY command is never sent.

wtolson · 2015-06-14T21:43:11Z

Seems to have been a bug with how gnsq handles connections failures while starting to backoff. I've pushed a new version of gnsq to pypi (version 0.3.0) with a fix for this included. Let me know if this resolves the issue for you. Thanks again for reporting!

nickstenning · 2015-06-15T09:19:16Z

Thank you so much for this! I won't be able to tell you immediately if this resolves our issue as it manifests itself in production pretty rarely. I will deploy 0.3.0 and see how we get on.

This hopefully addresses the issues we've seen with RDY locking to zero (issue #2304). wtolson/gnsq#6

wtolson · 2015-06-15T16:04:46Z

Thanks, I'm going to close the issue. Feel free to reopen if the issue occurs again.

nickstenning · 2015-06-18T16:20:58Z

Just a note to say that we've seen the networking issues in production again, and no evidence that gnsq is getting confused any more. Thank you!

wtolson · 2015-06-18T17:23:26Z

Sweet, thanks for the update.

nickstenning added a commit to hypothesis/h that referenced this issue Jun 15, 2015

Upgrade to gnsq v0.3.0

b550071

This hopefully addresses the issues we've seen with RDY locking to zero (issue #2304). wtolson/gnsq#6

wtolson closed this as completed Jun 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with RDY locking to zero #6

Problem with RDY locking to zero #6

nickstenning commented Jun 11, 2015

wtolson commented Jun 11, 2015

mreiferson commented Jun 11, 2015

wtolson commented Jun 11, 2015

mreiferson commented Jun 11, 2015

wtolson commented Jun 11, 2015

wtolson commented Jun 14, 2015

nickstenning commented Jun 15, 2015

wtolson commented Jun 15, 2015

nickstenning commented Jun 18, 2015

wtolson commented Jun 18, 2015

Problem with RDY locking to zero #6

Problem with RDY locking to zero #6

Comments

nickstenning commented Jun 11, 2015

wtolson commented Jun 11, 2015

mreiferson commented Jun 11, 2015

wtolson commented Jun 11, 2015

mreiferson commented Jun 11, 2015

wtolson commented Jun 11, 2015

wtolson commented Jun 14, 2015

nickstenning commented Jun 15, 2015

wtolson commented Jun 15, 2015

nickstenning commented Jun 18, 2015

wtolson commented Jun 18, 2015