Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix wrong double-checked locking for readOnActiveConsumerTask in dispatcher #22279

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Mar 15, 2024

Motivation

The access on readOnActiveConsumerTask is not thread safe.

if (readOnActiveConsumerTask != null) {
    return;
}
readOnActiveConsumerTask = topic.getBrokerService().executor().schedule(() -> {/* ... */});

There is a case that:

Steps Thread 1 Thread 2
1 Read readOnActiveConsumerTask: null
2 call schedule
3 Read readOnActiveConsumerTask: null
4 call schedule
5 Write readOnActiveConsumerTask to the result of schedule
6 Write readOnActiveConsumerTask to the result of schedule

Then schedule() will be called twice and only one result (it could be either step 5 or 6) will be assigned to readOnActiveConsumerTask.

Modifications

Follow the double-checked locking when readOnActiveConsumerTask is null to ensure readOnActiveConsumerTask cannot be called concurrently.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

…erTask in dispatcher

### Motivation

The access on `readOnActiveConsumerTask` is not thread safe.

```java
if (readOnActiveConsumerTask != null) {
    return;
}
readOnActiveConsumerTask = topic.getBrokerService().executor().schedule(() -> {/* ... */});
```

There is a case that:

| Steps | Thread 1 | Thread 2 |
| :- | :- | :- |
| 1 | Read `readOnActiveConsumerTask`: null | |
| 2 | call `schedule` | |
| 3 | | Read `readOnActiveConsumerTask`: null |
| 4 | | call `schedule` |
| 5 | Write `readOnActiveConsumerTask` to the result of `schedule` | |
| 6 | | Write `readOnActiveConsumerTask` to the result of `schedule` |

Then `schedule()` will be called twice and only one result will be
assigned to `readOnActiveConsumerTask`.

### Modifications

Follow the double-checked locking when `readOnActiveConsumerTask` is
null to ensure `readOnActiveConsumerTask` cannot be called concurrently.
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 15, 2024
@BewareMyPower BewareMyPower self-assigned this Mar 15, 2024
@BewareMyPower BewareMyPower added type/bug The PR fixed a bug or issue reported a bug area/broker labels Mar 15, 2024
lhotari
lhotari previously approved these changes Mar 15, 2024
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good catch @BewareMyPower . I wonder if the same bug pattern is in other locations in the Pulsar code base?

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BewareMyPower Please also check that possible concurrent calls on scheduleReadOnActiveConsumer method aren't causing issues.

@lhotari lhotari dismissed their stale review March 15, 2024 17:12

Need to check scheduleReadOnActiveConsumer method.

@BewareMyPower
Copy link
Contributor Author

@lhotari scheduleReadOnActiveConsumer is thread safe. The steps of scheduleReadOnActiveConsumer are:

  1. Call cancelPendingRead(), which might set havePendingRead with false. (The concurrent call of cancelPendingRead() does not matter because it does not set havePendingRead with true so that following logic will be skipped.)
  2. Skip the following logic if havePendingRead is true, which could only be caused by readMoreEntries.
  3. Call readMoreEntries() immediately or lazily according to the subscription type and the config.

Yes, there is a chance that readMoreEntries is called concurrently.

Steps Thread 1 Thread 2
1 Read havePendingRead: false
2 Write havePendingRead: true
3 Call readMoreEntries

However, it's still safe because in readMoreEntries, before reading messages through cursor, it checks havePendingRead again in the synchronized block.

synchronized (this) {
if (havePendingRead) {
if (log.isDebugEnabled()) {
log.debug("[{}] Skipping read for the topic, Due to we have pending read.", topic.getName());
}
return;
}

if the same bug pattern is in other locations in the Pulsar code base?

BTW, yes, the same bug pattern is use in many places.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job and awesome explanations @BewareMyPower

@lhotari lhotari merged commit 4e0c145 into apache:master Mar 16, 2024
60 of 61 checks passed
@BewareMyPower BewareMyPower deleted the bewaremypower/dispatch-double-checked-lock branch March 16, 2024 07:03
@BewareMyPower BewareMyPower added this to the 3.3.0 milestone Mar 16, 2024
@BewareMyPower
Copy link
Contributor Author

@lhotari This PR is merged too quickly without a 2nd review. After revisiting the code, I found the double-checked locking is unnecessary.

scheduleReadOnActiveConsumer() is always called in pickAndScheduleActiveConsumer(), which is always called in addConsumer() and removeConsumer(). However, both of them are synchronized methods. So scheduleReadOnActiveConsumer could never be called concurrently.

The original motivation is the IDE warning:

image

It's true that it's a non-atomic operation on volatile field 'readOnActiveConsumerTask'. It's a commonly seen error to access a volatile field so IDE gives such warnings. However, we don't need to make scheduleReadOnActiveConsumer() thread safe. Unfortunately, there is no comment that notes this point. The lack of such comments makes code hard to maintain. Generally, developers assume all methods must be thread safe so the code might become unnecessarily complicated.

I opened another PR (#22285) that improves the code quality and reduces the unnecessary atomicity. That PR also reverts the changes of this PR. PTAL again.

@lhotari
Copy link
Member

lhotari commented Mar 18, 2024

@lhotari This PR is merged too quickly without a 2nd review. After revisiting the code, I found the double-checked locking is unnecessary.

@BewareMyPower We don't have a policy that there should be a 2nd review. This PR didn't contain any risky changes and wasn't harmful at all and that's why I merged. It's better to keep on moving instead of optimizing for perfect PRs. The resolution that you have done is a good one: opening another PR with follow-up changes.

@lhotari
Copy link
Member

lhotari commented Mar 18, 2024

Generally, developers assume all methods must be thread safe so the code might become unnecessarily complicated.

@BewareMyPower I agree that generalizing thread safety to a certain style in the Pulsar code base won't be very useful. However there are real problems where thread safety has been ignored and fixing the issues isn't prioritized. For example the topic level policies contain real problems, #21303.

lhotari pushed a commit that referenced this pull request Mar 27, 2024
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 15, 2024
…erTask in dispatcher (apache#22279)

(cherry picked from commit 4e0c145)
(cherry picked from commit e2070a8)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 17, 2024
…erTask in dispatcher (apache#22279)

(cherry picked from commit 4e0c145)
(cherry picked from commit e2070a8)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 17, 2024
…erTask in dispatcher (apache#22279)

(cherry picked from commit 4e0c145)
(cherry picked from commit e2070a8)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 19, 2024
…erTask in dispatcher (apache#22279)

(cherry picked from commit 4e0c145)
(cherry picked from commit e2070a8)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 23, 2024
…erTask in dispatcher (apache#22279)

(cherry picked from commit 4e0c145)
(cherry picked from commit e2070a8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants