Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition/goroutine leak in partition discovery goroutine #474

Merged

Conversation

PowerStateFailure
Copy link
Contributor

@PowerStateFailure PowerStateFailure commented Feb 22, 2021

Master issue: #462

Motivation

In some cases consumer Close method might be called in nearly the same moment with background partiotion discovery routine tick. When Close acquires mutex first, it will release all underlying consumers, but that not guarantees backgound routine's not wait for the same mutex in internalTopicSubscribeToPartitions, so when Close completes, mutex gets acquired by not-finished-yet goroutine leading to new partition consumers gets created without any chance to close them.
The same logic is valid for producer partition discovery goroutine which is closed the same manner.

This PR modifies background discrovery goroutine the way it will be closed before acquiring mutex in Close method.

Modifications

Described above.

Verifying this change

  • Make sure that the change passes the CI checks.

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (yes / no)
  • The schema: (yes / no / don't know)
  • The default values of configurations: (yes / no)
  • The wire protocol: (yes / no)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / GoDocs / not documented)
  • If a feature is not applicable for documentation, explain why?
  • If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

@omnilight
Copy link
Contributor

omnilight commented Feb 27, 2021

@PowerStateFailure I faced the same problem. My service is a proxy between foreign clients and pulsar. When foreign client connects/disconnects too often and we have changed number of partitions for topic, they could get another consumer for partition.

Could you please fix failed build?

@wolfstudy Could you please review this PR?

@PowerStateFailure
Copy link
Contributor Author

PowerStateFailure commented Feb 28, 2021

@omnilight

Could you please fix failed build?

As i can see from build log, there is race condition somewhere in connection acquire logic. As I don't change it, i'm unsure if i should fix it.

BTW, it seems that producer has the same issue with closing background partition discovery goroutine, so I'll try to fix it too.

@PowerStateFailure PowerStateFailure force-pushed the fix-partition-discovery-leak branch from 79dcbfc to aaea56d Compare February 28, 2021 15:08
…t actually gets closed before close active consumers
…t actually gets closed before close active producers
…rapping method body with sync.Once (similar to consumer.Close)
@PowerStateFailure PowerStateFailure force-pushed the fix-partition-discovery-leak branch from aaea56d to 2039c91 Compare March 29, 2021 16:58
@PowerStateFailure
Copy link
Contributor Author

Rebased to the latest master, thus avoiding undesirable data race error in tests.

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@merlimat merlimat merged commit 6ab17dc into apache:master Mar 30, 2021
@merlimat merlimat added this to the 0.5.0 milestone Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants