Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broker failover - consumer groups #556

Closed
joeirimpan opened this issue Sep 11, 2023 · 7 comments · Fixed by #569
Closed

Broker failover - consumer groups #556

joeirimpan opened this issue Sep 11, 2023 · 7 comments · Fixed by #569
Labels
enhancement New feature or request

Comments

@joeirimpan
Copy link

Hi @twmb. Thank you for this lib.

We have a scenario where we have N different brokers with topics and partitions, all replicated to one aggregated downstream broker node. Our goal is to replicate data from nodeUpstream1 to nodeDownstream, and in the event of nodeUpstream1 going down, we should be able to switch over to nodeUpstream2, nodeUpstream3, and so on in a round-robin fashion. This approach helps us deduplicate data downstream without requiring additional stores.

However, we have encountered few issues with our current implementation. I would like to discuss and confirm if we are following the right approach or if there are any better alternatives. The issues we are facing are as follows:

  • OnBrokerDisconnect() hook gets called multiple times when a broker goes down. It also occurs when idle connections are reaped internally. Is this behavior expected, or should the hook be called only once per broker disconnect event?

  • Close() cannot be called for the consumer group because it issues a fresh new request to LeaveGroup. This request gets stuck when the broker is already down. Is there a recommended way to gracefully handle this situation and ensure the consumer group is closed properly?

  • If autocommit is enabled, there seems to be no way to clean up the associated goroutines. Is there a recommended approach to clean up these goroutines when autocommit is enabled?

  • PurgeTopicsFromClient seems to be the closest cleanup option we found in the library. However, when we round-robin back to the same node, it causes an error (UNKNOWN_MEMBER_ID: The coordinator is not aware of this member.) when trying to commit offsets using kadm.CommitOffsets to resume the consumer group from where it was left off.

implementation: https://github.com/joeirimpan/kaf-relay/blob/fallbacks/consumer.go

@twmb
Copy link
Owner

twmb commented Sep 11, 2023

  • OnBrokerDisconnect is called whenever a connected connection is killed. There maybe be multiple connections per broker.
  • The consumer group must be left to be closed properly -- but the controller should transfer to another broker so that leave group can be successful. It is an oversight right now that the client does not allow shutting down without leaving the group, or a context-closed based shutdown. I can consider adding a ForceClose method in the next release.
  • Autocommit goroutines are only cleaned up when the group is left, i.e. when there is no more committing to do. Do you want to quit autocommitting sooner?
  • I think you can only commit to groups that are empty. Are you committing to an active group?

@joeirimpan
Copy link
Author

Yes. A ForceClose() method would be useful for graceful shutdowns and early termination of autocommit goroutines.

I was committing offsets to the group after it was started and the commits were going through for the new groups. Maybe the commits were going through because the consumer groups haven't joined yet (#167 (comment)). UNKNOWN_MEMBER_ID: The coordinator is not aware of this member This must be happening when we are committing to a Stable consumer group then?

Lets say we need to resume an existing consumer group from a different offset, should we first close the group, commit offsets through kadm and then reinit the consumer group?

@joeirimpan
Copy link
Author

Lets say we need to resume an existing consumer group from a different offset, should we first close the group, commit offsets through kadm and then reinit the consumer group?

this worked

@joeirimpan
Copy link
Author

Shall i close this issue? Like you said, ForceClose would be a nice addition.

@twmb
Copy link
Owner

twmb commented Sep 16, 2023

We can leave open to track ForceClose, although I'm a bit leaning towards ClosedForced, to be next to the other Close functions in godoc (unfortunate limitation due to alphabetical everything). Another option is CloseContext. wdyt?

// CloseForced is the same as CloseAllowingRebalance, but does not wait for the
// group to be left once the context is closed. If you are consuming in a
// group, this function attempts to leave the group but only waits until the
// context is closed.
//
// A nil context is valid and forces the client to quit immediately. This returns any
// context error or leave group error.
func (*Client) CloseForced(context.Context) error

Realistically, I should also add LeaveGroupContext(context.Context) error to allow LeaveGroup to be interrupted and to return any leave group error. The original version of LeaveGroup is just not sufficient and should probably be deprecated documented as an inferior function if LeaveGroupContext exists.

@twmb twmb added the enhancement New feature or request label Sep 16, 2023
twmb added a commit that referenced this issue Sep 21, 2023
This allows more control over timeouts when leaving a group or closing
the client, and also gives more insight into group leave errors.

Closes #556.
@twmb
Copy link
Owner

twmb commented Sep 21, 2023

After implementing it, I'll just add LeaveGroupContext and document that you can speed up Close by first using LeaveGroupContext.

twmb added a commit that referenced this issue Sep 21, 2023
This allows more control over timeouts when leaving a group or closing
the client, and also gives more insight into group leave errors.

Closes #556.
twmb added a commit that referenced this issue Sep 21, 2023
This allows more control over timeouts when leaving a group or closing
the client, and also gives more insight into group leave errors.

Closes #556.
@twmb twmb closed this as completed in #569 Sep 21, 2023
@joeirimpan
Copy link
Author

Thank you @twmb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants