Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the buggy Future and Promise implementations #299

Merged

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Jul 3, 2023

Fixes #298

Motivation

Currently the Future and Promise are implemented manually by managing conditional variables. However, the conditional variable sometimes behaviors incorrectly on macOS, while the existing future and promise from the C++ standard library works well.

Modifications

Redesign Future and Promise based on the utilities in the standard <future> header. In addition, fix the possible race condition when addListener is called after setValue or setFailed:

  • Thread 1: call setValue, switch existing listeners and call them one by one out of the lock.
  • Thread 2: call addListener, detect complete_ is true and call the listener directly.

Now, the previous listeners and the new listener are called concurrently in thread 1 and 2.

Verifications

Run the reproduce code in #298 for 10 times and found it never failed or hang.

Documentation

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

@BewareMyPower BewareMyPower added the bug Something isn't working label Jul 3, 2023
@BewareMyPower BewareMyPower added this to the 3.3.0 milestone Jul 3, 2023
@BewareMyPower BewareMyPower self-assigned this Jul 3, 2023
@BewareMyPower BewareMyPower marked this pull request as draft July 4, 2023 02:49
Fixes apache#298

### Motivation

Currently the `Future` and `Promise` are implemented manually by
managing conditional variables. However, the conditional variable
sometimes behaviors incorrectly on macOS, while the existing `future`
and `promise` from the C++ standard library works well.

### Modifications

Redesign `Future` and `Promise` based on the utilities in the standard
`<future>` header. In addition, fix the possible race condition when
`addListener` is called after `setValue` or `setFailed`:
- Thread 1: call `setValue`, switch existing listeners and call them one
  by one out of the lock.
- Thread 2: call `addListener`, detect `complete_` is true and call the
  listener directly.

Now, the previous listeners and the new listener are called concurrently
in thread 1 and 2.

This patch fixes the problem by adding a future to wait all listeners
that were added before completing are done.

### Verifications

Run the reproduce code in apache#298 for 10 times and found it never failed or
hang.
@BewareMyPower BewareMyPower force-pushed the bewaremypower/fix-macos-future-wait branch from c2453ef to 00473ec Compare July 4, 2023 07:50
@BewareMyPower BewareMyPower marked this pull request as ready for review July 4, 2023 07:50
@BewareMyPower BewareMyPower force-pushed the bewaremypower/fix-macos-future-wait branch from 96f4e20 to 3d16f50 Compare July 4, 2023 08:52
@BewareMyPower BewareMyPower force-pushed the bewaremypower/fix-macos-future-wait branch from 3d16f50 to 8ff31fe Compare July 4, 2023 08:59
Copy link
Member

@Demogorgon314 Demogorgon314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! After this change, the reproduce code never failed or hang.

lib/Future.h Outdated Show resolved Hide resolved
lib/Future.h Outdated Show resolved Hide resolved
@BewareMyPower
Copy link
Contributor Author

@RobertIndie PTAL again.

Copy link
Member

@RobertIndie RobertIndie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

lib/Future.h Outdated Show resolved Hide resolved
Co-authored-by: Zike Yang <[email protected]>
@BewareMyPower BewareMyPower merged commit 20f6fa0 into apache:main Jul 5, 2023
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-macos-future-wait branch July 5, 2023 06:49
BewareMyPower added a commit to BewareMyPower/pulsar-client-cpp that referenced this pull request Oct 26, 2023
### Motivation

There is a case that deadlock could happen for a `Future`. Assume there
is a `Promise` and its `Future`.

1. Call `Future::addListener` to add a listener that tries to acquire a
   user-provided mutex (`lock`).
2. Thread 1: Acquire `lock` first.
3. Thread 2: Call `Promise::setValue`, the listener will be triggered
   first before completed. Since `lock` is held by Thread 1, the
   listener will be blocked.
4. Thread 1: Call `Future::addListener`, since it detects the
   `InternalState::completed_` is true, it will call `get` to retrieve
   the result and value.

Then, deadlock happens:
- Thread 1 waits for `lock` is released, and then complete
  `InternalState::future_`.
- Thread 2 holds `lock` but wait for `InternalState::future_` is
  completed.

In a real world case, if we acquire a lock before
`ProducerImpl::closeAsync`, then another thread call `setValue` in
`ClientConnection::handleSuccess` and the callback of
`createProducerAsync` tries to acquire the lock, `handleSuccess` will be
blocked. Then in `closeAsync`, the current thread will be blocked in:

```c++
    cnx->sendRequestWithId(Commands::newCloseProducer(producerId_, requestId), requestId)
        .addListener([self, callback](Result result, const ResponseData&) { callback(result); });
```

The stacks:

```
Thread 1:
#11 0x00007fab80da2173 in pulsar::InternalState<...>::complete (this=0x3d53e7a10, result=..., value=...) at lib/Futre.h:61
#13 pulsar::ClientConnection::handleSuccess (this=this@entry=0x2214bc000, success=...) at lib/ClientConnection.cc:1552

Thread 2:
#8  get (result=..., this=0x3d53e7a10) at lib/Future.h:69
#9  pulsar::InternalState<...>::addListener (this=this@entry=0x3d53e7a10, listener=...) at lib/Future.h:51
#11 0x00007fab80e8dc4e in pulsar::ProducerImpl::closeAsync at lib/ProducerImpl.cc:794
```

There are two points that make the deadlock:
1. We use `completed_` to represent if the future is completed. However,
   after it's true, the future might not be completed because the value
   is not set and the listeners are not completed.
2. If `addListener` is called after it's completed, we still push the
   listener to `listeners_` so that previous listeners could be executed
   before the new listener. This guarantee is unnecessarily strong.

### Modifications

First, complete the future before calling the listeners.

Then, use an enum to represent the status:
- INITIAL: `complete` has not been called
- COMPLETING: when the 1st time `complete` is called, the status will
  change from INITIAL to COMPLETING
- COMPLETED: the future is completed.

Besides, implementation of `Future` is simplified.
apache#299 fixes a possible
mutex crash by introducing the `std::future`. However, the root cause is
the conditional variable is not used correctly:

> Even if the shared variable is atomic, it must be modified while owning the mutex to correctly publish the modification to the waiting thread.

See https://en.cppreference.com/w/cpp/thread/condition_variable

The simplest way to fix
apache#298 is just adding
`lock.lock()` before `state->condition.notify_all();`.
merlimat pushed a commit that referenced this pull request Oct 30, 2023
#334)

* Fix possible deadlock of Future when adding a listener after completed

### Motivation

There is a case that deadlock could happen for a `Future`. Assume there
is a `Promise` and its `Future`.

1. Call `Future::addListener` to add a listener that tries to acquire a
   user-provided mutex (`lock`).
2. Thread 1: Acquire `lock` first.
3. Thread 2: Call `Promise::setValue`, the listener will be triggered
   first before completed. Since `lock` is held by Thread 1, the
   listener will be blocked.
4. Thread 1: Call `Future::addListener`, since it detects the
   `InternalState::completed_` is true, it will call `get` to retrieve
   the result and value.

Then, deadlock happens:
- Thread 1 waits for `lock` is released, and then complete
  `InternalState::future_`.
- Thread 2 holds `lock` but wait for `InternalState::future_` is
  completed.

In a real world case, if we acquire a lock before
`ProducerImpl::closeAsync`, then another thread call `setValue` in
`ClientConnection::handleSuccess` and the callback of
`createProducerAsync` tries to acquire the lock, `handleSuccess` will be
blocked. Then in `closeAsync`, the current thread will be blocked in:

```c++
    cnx->sendRequestWithId(Commands::newCloseProducer(producerId_, requestId), requestId)
        .addListener([self, callback](Result result, const ResponseData&) { callback(result); });
```

The stacks:

```
Thread 1:
#11 0x00007fab80da2173 in pulsar::InternalState<...>::complete (this=0x3d53e7a10, result=..., value=...) at lib/Futre.h:61
#13 pulsar::ClientConnection::handleSuccess (this=this@entry=0x2214bc000, success=...) at lib/ClientConnection.cc:1552

Thread 2:
#8  get (result=..., this=0x3d53e7a10) at lib/Future.h:69
#9  pulsar::InternalState<...>::addListener (this=this@entry=0x3d53e7a10, listener=...) at lib/Future.h:51
#11 0x00007fab80e8dc4e in pulsar::ProducerImpl::closeAsync at lib/ProducerImpl.cc:794
```

There are two points that make the deadlock:
1. We use `completed_` to represent if the future is completed. However,
   after it's true, the future might not be completed because the value
   is not set and the listeners are not completed.
2. If `addListener` is called after it's completed, we still push the
   listener to `listeners_` so that previous listeners could be executed
   before the new listener. This guarantee is unnecessarily strong.

### Modifications

First, complete the future before calling the listeners.

Then, use an enum to represent the status:
- INITIAL: `complete` has not been called
- COMPLETING: when the 1st time `complete` is called, the status will
  change from INITIAL to COMPLETING
- COMPLETED: the future is completed.

Besides, implementation of `Future` is simplified.
#299 fixes a possible
mutex crash by introducing the `std::future`. However, the root cause is
the conditional variable is not used correctly:

> Even if the shared variable is atomic, it must be modified while owning the mutex to correctly publish the modification to the waiting thread.

See https://en.cppreference.com/w/cpp/thread/condition_variable

The simplest way to fix
#298 is just adding
`lock.lock()` before `state->condition.notify_all();`.

* Acquire lock again

* Add initial value
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Synchronous send might stuck or crash when sending many messages
4 participants