Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix deadlock in PendingAckHandleImpl #18989

Merged
merged 5 commits into from
Dec 20, 2022

Conversation

nicoloboschi
Copy link
Contributor

@nicoloboschi nicoloboschi commented Dec 19, 2022

Fixes #18988

Motivation

Based on stacktrace in the issue, in PendingAckHandleImpl, if the first time pendingAckHandleCompletableFuture is completed via the method addConsumer the PendingAckHandleImpl object is locked to the current thread because the method completeHandleFuture is synchronized. The addConsumer then locks the PersistentSubscription. In the reverse order the subscription closing procedure locks the PersistentSubscription object and then PendingAckHandleImpl.
It's possible that if they run concurrently it ends up in a deadlock

Modifications

  • Remove synchronized from completeHandleFuture methods. In this way the PendingAckHandleImpl object isn't locked in the stages chain.
  • Moved the init callback to use the internal executor instead of the metadata store executor

I haven't added tests since I wasn't able to reproduce it programmatically.

Verifying this change

  • Make sure that the change passes the CI checks.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 19, 2022
Copy link
Contributor

@liangyepianzhou liangyepianzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please be sure to update the PR description since the code no longer aligns with it.

Comment on lines +942 to 943
if (recoverTime.getRecoverStartTime() != 0L && recoverTime.getRecoverEndTime() == 0L) {
recoverTime.setRecoverEndTime(System.currentTimeMillis());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the synchronized keyword in the method declaration, this conditional update to recoveredEndTime is subject to data races, and the variables will not be safely published to other threads. However, at this time, the recoverTime object is only used for stats, and we have a tendency to have relaxed requirements for stats. Perhaps it is fine. We could alternatively add methods to the class that ensure proper synchronization of updates. I don't feel strongly, except to mention that it is somewhat brittle to assume the variable will only ever be used by stats.

Copy link
Contributor Author

@nicoloboschi nicoloboschi Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right. This recoveryTime registers the time of the replay. It should be handled directly in the callback after the replay is completed. For the current usage, being non thread-safe is fine and I'd prefer leave it as is.
Those methods don't need synchronization at all since, at the moment, there's no data race for them.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@codecov-commenter
Copy link

codecov-commenter commented Dec 20, 2022

Codecov Report

Merging #18989 (d83daac) into master (feb3cb4) will increase coverage by 0.36%.
The diff coverage is 62.50%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #18989      +/-   ##
============================================
+ Coverage     46.35%   46.72%   +0.36%     
- Complexity     8939    10525    +1586     
============================================
  Files           597      706     +109     
  Lines         56858    69004   +12146     
  Branches       5905     7392    +1487     
============================================
+ Hits          26357    32241    +5884     
- Misses        27616    33173    +5557     
- Partials       2885     3590     +705     
Flag Coverage Δ
unittests 46.72% <62.50%> (+0.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...nsaction/pendingack/impl/PendingAckHandleImpl.java 51.15% <57.14%> (-0.63%) ⬇️
...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java 53.68% <100.00%> (ø)
...g/apache/pulsar/client/impl/ConnectionHandler.java 50.00% <0.00%> (-5.32%) ⬇️
...ction/buffer/impl/TransactionBufferClientImpl.java 76.74% <0.00%> (-4.66%) ⬇️
.../buffer/impl/TransactionBufferClientStatsImpl.java 82.00% <0.00%> (-4.00%) ⬇️
...pulsar/broker/TransactionMetadataStoreService.java 58.51% <0.00%> (-3.94%) ⬇️
...tion/buffer/impl/TransactionBufferHandlerImpl.java 50.00% <0.00%> (-2.54%) ⬇️
...a/org/apache/pulsar/client/impl/RawReaderImpl.java 83.90% <0.00%> (-1.15%) ⬇️
.../org/apache/pulsar/client/impl/ConnectionPool.java 37.43% <0.00%> (-1.03%) ⬇️
...roker/loadbalance/impl/ModularLoadManagerImpl.java 67.78% <0.00%> (-0.51%) ⬇️
... and 154 more

@nicoloboschi nicoloboschi merged commit 22866bd into apache:master Dec 20, 2022
@nicoloboschi nicoloboschi deleted the fix-deadlock-txn branch December 20, 2022 10:57
nicoloboschi added a commit that referenced this pull request Dec 20, 2022
nicoloboschi added a commit that referenced this pull request Dec 20, 2022
nicoloboschi added a commit that referenced this pull request Dec 20, 2022
(cherry picked from commit 22866bd)
(cherry picked from commit 2d958a9)
nicoloboschi added a commit to datastax/pulsar that referenced this pull request Jan 10, 2023
(cherry picked from commit 22866bd)
(cherry picked from commit 2d958a9)
nicoloboschi added a commit to datastax/pulsar that referenced this pull request Jan 11, 2023
(cherry picked from commit 22866bd)
(cherry picked from commit 2d958a9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Deadlock pulsar-io and metadata-store if transactions enabled
9 participants