Test Windows debug build #13

BewareMyPower · 2023-04-13T12:13:57Z

No description provided.

### Motivation We observed server null `ClientConnection` accesses in test environment. See the `this=0x0` outputs in the following two typical stacks. ``` #8 bytesWritten (this=0xb8, size=371) at lib/SharedBuffer.h:166 #9 pulsar::ClientConnection::handleRead (this=0x0, err=..., bytesTransferred=371, minReadSize=4) at lib/ClientConnection.cc:609 ``` ``` #12 0x00007f33202933d2 in unique_lock (__m=..., this=0x7f3311c82800) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_mutex.h:197 #13 pulsar::ClientConnection::sendPendingCommands (this=0x0) at lib/ClientConnection.cc:1071 #14 0x00007f3320293d2d in pulsar::ClientConnection::handleSendPair (this=0x0, err=...) at lib/ClientConnection.cc:1066 ``` Though `shared_from_this()` is always passed to the `std::bind` function, when the method of `ClientConnection` is called, the pointer is still `null`. ### Modifications First, replace all `std::bind` calls with the lambda expression that catches `std::weak_ptr<ClientConnection>` and perform null checks explicitly on the value returned by the `lock()` method. Since now all asio callbacks don't hold a `shared_ptr`, the owner of the `ClientConnection` object should be `ConnectionPool`, i.e. the pool maintains some connections, while all asio callbacks use `weak_ptr` to test if the connection is present. Second, make `ClientConnection::getConnection` return `shared_ptr` rather than `weak_ptr` so that the caller side does not need to check if `lock()` returns null in the callback of this future. We cannot make `ConnectionPool::getConnectionAsync` return `shared_ptr` because it could return the future of `connectPromise_`, which is hold by `ClientConnection` itself. We should avoid holding a `shared_ptr` of `ClientConnection` because its owner is `ConnectionPool`.

### Motivation There is a case that deadlock could happen for a `Future`. Assume there is a `Promise` and its `Future`. 1. Call `Future::addListener` to add a listener that tries to acquire a user-provided mutex (`lock`). 2. Thread 1: Acquire `lock` first. 3. Thread 2: Call `Promise::setValue`, the listener will be triggered first before completed. Since `lock` is held by Thread 1, the listener will be blocked. 4. Thread 1: Call `Future::addListener`, since it detects the `InternalState::completed_` is true, it will call `get` to retrieve the result and value. Then, deadlock happens: - Thread 1 waits for `lock` is released, and then complete `InternalState::future_`. - Thread 2 holds `lock` but wait for `InternalState::future_` is completed. In a real world case, if we acquire a lock before `ProducerImpl::closeAsync`, then another thread call `setValue` in `ClientConnection::handleSuccess` and the callback of `createProducerAsync` tries to acquire the lock, `handleSuccess` will be blocked. Then in `closeAsync`, the current thread will be blocked in: ```c++ cnx->sendRequestWithId(Commands::newCloseProducer(producerId_, requestId), requestId) .addListener([self, callback](Result result, const ResponseData&) { callback(result); }); ``` The stacks: ``` Thread 1: #11 0x00007fab80da2173 in pulsar::InternalState<...>::complete (this=0x3d53e7a10, result=..., value=...) at lib/Futre.h:61 #13 pulsar::ClientConnection::handleSuccess (this=this@entry=0x2214bc000, success=...) at lib/ClientConnection.cc:1552 Thread 2: #8 get (result=..., this=0x3d53e7a10) at lib/Future.h:69 #9 pulsar::InternalState<...>::addListener (this=this@entry=0x3d53e7a10, listener=...) at lib/Future.h:51 #11 0x00007fab80e8dc4e in pulsar::ProducerImpl::closeAsync at lib/ProducerImpl.cc:794 ``` There are two points that make the deadlock: 1. We use `completed_` to represent if the future is completed. However, after it's true, the future might not be completed because the value is not set and the listeners are not completed. 2. If `addListener` is called after it's completed, we still push the listener to `listeners_` so that previous listeners could be executed before the new listener. This guarantee is unnecessarily strong. ### Modifications First, complete the future before calling the listeners. Then, use an enum to represent the status: - INITIAL: `complete` has not been called - COMPLETING: when the 1st time `complete` is called, the status will change from INITIAL to COMPLETING - COMPLETED: the future is completed. Besides, implementation of `Future` is simplified. apache#299 fixes a possible mutex crash by introducing the `std::future`. However, the root cause is the conditional variable is not used correctly: > Even if the shared variable is atomic, it must be modified while owning the mutex to correctly publish the modification to the waiting thread. See https://en.cppreference.com/w/cpp/thread/condition_variable The simplest way to fix apache#298 is just adding `lock.lock()` before `state->condition.notify_all();`.

apache#334) * Fix possible deadlock of Future when adding a listener after completed ### Motivation There is a case that deadlock could happen for a `Future`. Assume there is a `Promise` and its `Future`. 1. Call `Future::addListener` to add a listener that tries to acquire a user-provided mutex (`lock`). 2. Thread 1: Acquire `lock` first. 3. Thread 2: Call `Promise::setValue`, the listener will be triggered first before completed. Since `lock` is held by Thread 1, the listener will be blocked. 4. Thread 1: Call `Future::addListener`, since it detects the `InternalState::completed_` is true, it will call `get` to retrieve the result and value. Then, deadlock happens: - Thread 1 waits for `lock` is released, and then complete `InternalState::future_`. - Thread 2 holds `lock` but wait for `InternalState::future_` is completed. In a real world case, if we acquire a lock before `ProducerImpl::closeAsync`, then another thread call `setValue` in `ClientConnection::handleSuccess` and the callback of `createProducerAsync` tries to acquire the lock, `handleSuccess` will be blocked. Then in `closeAsync`, the current thread will be blocked in: ```c++ cnx->sendRequestWithId(Commands::newCloseProducer(producerId_, requestId), requestId) .addListener([self, callback](Result result, const ResponseData&) { callback(result); }); ``` The stacks: ``` Thread 1: #11 0x00007fab80da2173 in pulsar::InternalState<...>::complete (this=0x3d53e7a10, result=..., value=...) at lib/Futre.h:61 #13 pulsar::ClientConnection::handleSuccess (this=this@entry=0x2214bc000, success=...) at lib/ClientConnection.cc:1552 Thread 2: #8 get (result=..., this=0x3d53e7a10) at lib/Future.h:69 #9 pulsar::InternalState<...>::addListener (this=this@entry=0x3d53e7a10, listener=...) at lib/Future.h:51 #11 0x00007fab80e8dc4e in pulsar::ProducerImpl::closeAsync at lib/ProducerImpl.cc:794 ``` There are two points that make the deadlock: 1. We use `completed_` to represent if the future is completed. However, after it's true, the future might not be completed because the value is not set and the listeners are not completed. 2. If `addListener` is called after it's completed, we still push the listener to `listeners_` so that previous listeners could be executed before the new listener. This guarantee is unnecessarily strong. ### Modifications First, complete the future before calling the listeners. Then, use an enum to represent the status: - INITIAL: `complete` has not been called - COMPLETING: when the 1st time `complete` is called, the status will change from INITIAL to COMPLETING - COMPLETED: the future is completed. Besides, implementation of `Future` is simplified. apache#299 fixes a possible mutex crash by introducing the `std::future`. However, the root cause is the conditional variable is not used correctly: > Even if the shared variable is atomic, it must be modified while owning the mutex to correctly publish the modification to the waiting thread. See https://en.cppreference.com/w/cpp/thread/condition_variable The simplest way to fix apache#298 is just adding `lock.lock()` before `state->condition.notify_all();`. * Acquire lock again * Add initial value

BewareMyPower added 4 commits April 13, 2023 20:12

Test Windows debug build

b4dfc90

Debug

63500dd

Add debug logs

2a46a26

Fix

da4db68

BewareMyPower closed this Apr 14, 2023

BewareMyPower deleted the bewaremypower/win64-debug-build branch April 14, 2023 05:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Windows debug build #13

Test Windows debug build #13

BewareMyPower commented Apr 13, 2023

Test Windows debug build #13

Test Windows debug build #13

Conversation

BewareMyPower commented Apr 13, 2023