Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep on getting FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] #90

Closed
pradeepsimha143 opened this issue Aug 25, 2014 · 25 comments

Comments

@pradeepsimha143
Copy link

Hi Team.

I am trying to produce and consume Kafka messages using node library (kafka-node), I am using HighLevelConsumer API. But I keep on getting this exception at random times. and node.js server stops.

FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110]
at new FailedToRebalanceConsumerError (/home/strg/project/kafkaBroker/node_modules/kafka-node/lib/errors/FailedToRebalanceConsumerError.js:11:11)
at /home/strg/project/kafkaBroker/node_modules/kafka-node/lib/highLevelConsumer.js:141:71

I am not sure what is the issue in this?

I have kept zookeeper timeout as: 50000.

This is my High level consumer code:

consumer = new Consumer(
client,
[
{ topic: consumeTopic } //consumeTopic is the topic which user provided
],
{
autoCommit: false
}

);

consumer.on('message', function (message) {
console.log(message);
}

If I restart the server and it works fine, but again after I keep on getting this exception. Can anyone please guide me in this? I am not able to understand what does this exception means. I tried restarting the zookeeper server and kafka server but still I am facing this exception. Any help on this would be very helpful, as I am very new to Kafka

@pradeepsimha143 pradeepsimha143 changed the title FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] Keep on getting FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] Aug 25, 2014
@jezzalaycock
Copy link
Contributor

Under what circumstances is the rebalance happening - are you stopping the consumer using a CTRL-C?

@jcastill0
Copy link

I also get that exception occasionally. Restarting my consumer fixes it, but obviously that is not the way to fix it. I'm using a HighLevelConsumer as well, with only one zookeeper and one broker. I'm using the latest 0.8.2.1 version of Kafka.

2015-03-17T17:20:37.379Z - error: error FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110]
at new FailedToRebalanceConsumerError (/Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/errors/FailedToRebalanceConsumerError.js:11:11)
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:170:51
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:419:17
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:240:13
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:144:21
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:237:17
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:600:34
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:399:29
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:144:21
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:389:41

@jcastill0
Copy link

Anyone else having this problem? It happens periodically for me when I start the consumer (HighLevel).
As I said above, my kafka setup is very simple, just one broker and one zookeeper. Usually after the second attempt it will not throw that exception. I'm using the latest version (0.2.24).
Any insight would be appreicated.

thanks

** julio

@AhmedSoliman
Copy link

I'm facing the same problem too, however, I'm generating the clientId in a random fashion to avoid creating two consumers (on the same topic) with the same clientId using:

clientId = "worker-" + Math.floor(Math.random() * 10000)

@ericdolson
Copy link

Is there any progress being made on this? I am deploying an application across several nodes elastically and am getting this error about every-other time I start up an instance. I tried upping the retry attempts to 30 (in the HighLevelConsumer's rebalance() function) and it would get up to as high as 24 before finally succeeding. I am nervous to just pick a big number and expect that to work though.

@jcastill0 In what way are you able to restart your high-level consumer? I am trying to use consumer.on('error', ...) as a way to catch and restart, but can I reuse my consumer? My client? I would appreciate a pointer :)

@syymza
Copy link

syymza commented Jun 25, 2015

Same problem here and @AhmedSoliman 's solution does not seem to help... Any news?

@CWSpear
Copy link

CWSpear commented Jun 29, 2015

Randomizing the group ID worked for my tests. I don't understand enough of kafka to know if that will mess up production if production uses a fixed group ID? Or maybe production can use a random client ID and all will be well?

@Ovidiu-S
Copy link

Is this a node issue or kafka ? I am having the same problem.

@jezzalaycock
Copy link
Contributor

Node Exists is normally due to the zookeeper timeout. The ephemeral nodes under certain circumstances (CTRL-C for instance) don't get removed. If you're not bothered about balancing a number of consumers on a topic then I suggest you try the normal KafkaConsumer and not the HighLevel one

@felipesabino
Copy link

@CWSpear How are you dealing with offsets commits while using random consumer group ID?

I was sure that was what was used to keep track of consumed offset, and we use a fixed group ID in production for that reason...

@CWSpear
Copy link

CWSpear commented Jul 21, 2015

@felipesabino no idea. I'm actually using a company-specific library wrapped around HighLevelConsumer, and I have dug through the code some, but I haven't been able to get very deep, so many of the inner-workings are over my head.

I'm pretty sure something's going on not in my code specifically, but either in the company's lib, or in kafka and just trying to get to the bottom of it. It's been rather bothersome, and I'm not the only one experiencing issues similar to this, but for now, randomizing the IDs works for tests. It's QA's problem now, right? ;-)

@felipesabino
Copy link

We managed to easily reproduce this errors in our environment by killing the consumer process and starting it again quickly.

We noticed that whenever our server restated, we tried reconnecting before zookeeper killed the connection (session) on its side this exception would be thrown.

To know that zookeeper killed the connection, look for a message that looks like the following:

INFO  [SessionTracker:ZooKeeperServer@347] - Expiring session 0x14eb7c676540001, timeout of 30000ms exceeded

So far we manage to avoid any FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] error on servers restarts just by delaying any reconnection by at least this session timeout time. You can find this value for your server on you zookeeper config file on the maxSessionTimeout param - http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html

Also, this behavior is consistent with what @CWSpear reported, as randomizing the clientId will force zookeeper to create a new session for every new connection attempt and the exception would not be thrown. But that is far from ideal, as the clientId is what will be used to keep track of your committed offsets...

We are still observing if it will occur randomly, if that is the case may be a similar approach should be taken with the rebalance logic. Anyways, will keep you posted.

@Ovidiu-S
Copy link

I managed to get rid of the issue by setting

maxSessionTimeout=5000

and delaying the client connection by 6 seconds. Just use timeout() for that, and it will work just fine

@felipesabino
Copy link

@Ovidiu-S I could not find any maxSessionTimeout in the docs or code... do you mean sessionTimeout from node-zookeeper-client?

@Ovidiu-S
Copy link

@felipesabino I am referring to the zookeeper server config, not the client. It is the maxSessionTimeout in the zoo.cfg file

@bendpx
Copy link

bendpx commented Sep 8, 2015

+1

@barockok
Copy link

barockok commented Feb 4, 2016

any update ? I'm facing the same problem 😦

@Ovidiu-S
Copy link

Ovidiu-S commented Feb 4, 2016

@barock19 I recommend switching to Kafka 0.9 and the new (zookeeper free) client, when it releases : https://github.com/oleksiyk/kafka

Until then ... just bypass the re-balancing issue with the above fix.

@hyperlink
Copy link
Collaborator

This problem went away when I add a handler to handler for the CTRL+C case. This ensures the consumer/client is cleaned up otherwise you are at the mercy of whenever the zookeeper node timesout.

process.on('SIGINT', function() {
    highLevelConsumer.close(true, function(){
        process.exit();
    })
});

@jezzalaycock
Copy link
Contributor

As suggested by @hyperlink the problem is down to the fact that the ephemeral nodes are no relinquished in zk when issuing a cntrl-c (SIGINT). Under normal failure cases the nodes are released as expected.

Moving to kafka 0,9 will require wholesale changes to the node client - however I believe the kafka guys are creating the client node - so it might be that we can simply switch to using that when available.

@mlcloudsec
Copy link

I'm having the same issue. Tried changing the zoo.cfg maxSessionTimeout and also closing the high level consumer before SIGINT. Also tried to close the client in itself. Same result

@rtorrero
Copy link

rtorrero commented Apr 29, 2016

Using the suggested handler, with a small modification fixed the issue for me:

  • Add a connection.close() on the callback
  • Put the process.exit() in the connection.close() callback.

@springuper
Copy link

springuper commented May 6, 2016

@hyperlink's method works fine for me.

UPDATE: It still happens, and I have find out the real problem, please refer to #369

@pcothenet
Copy link

Same issue on my side. And I can't catch the SIGINT because AWS Elastic Beanstalk somehow does not send one. I'm pretty sure @springuper PR might fix this.

@snatesan
Copy link
Contributor

I have seen this issue occurs when the client (zoo keeper) looses connection and soon after it get connection.
Rebalance logic should account for zookeeper session time out as already specified here #90 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.