Release the read lock while creating connections in`refresh_connections` #191

barshaul · 2024-09-13T14:16:05Z

PR Description:

Main Changes:

Lock Management Improvement:
In the previous implementation, the read lock (inner.conn_lock.read()) was held throughout the entire connection refresh process (for all connections sent to refresh), including while attempting to establish connections (via get_or_create_conn). If connections were slow or timed out, the lock was held for an extended duration, blocking other tasks requiring a write lock.

The new implementation releases the read lock before making connection attempts. If the connection is successfully established, the read lock is reacquired to update the connection container. This approach ensures that other operations needing the lock (e.g., write operations) can proceed while connections are being established.
Unclear Deadlock Behavior:
A deadlock scenario was observed while testing the update_slotmap_moved branch (on amazon-contributing/redis-rs) during failover testing. The root cause of the deadlock remains unclear. The branch introduces changes that attempt to acquire a write lock on the connection container, which leads to the issue. However, even after removing the content of the update_upon_moved function (leaving only the lock acquisition), the deadlock persisted, suggesting that the problem isn't directly tied to the logic in the function itself.

It seems like there is an unusual race condition occurring, causing the lock to enter an undefined state where neither reads nor writes are able to acquire it. This lock state is leading to the deadlock, with all tasks attempting to use the lock getting blocked.

The issue arose in the following situation:

A failover is initiated by terminating a node.
refresh_connections is triggered and acquires the read lock, while get_or_create_conn is waiting for a connection to complete.
Meanwhile, update_upon_moved tries to acquire the write lock but is blocked since the read lock is held by refresh_connections.
After refresh_connections fails with a Connection refused (os error 111) and exits, the lock is not properly released.
Despite the function returning, the read lock remains stuck, resulting in a system-wide deadlock where both read and write lock tasks are stalled. Logs only show timeouts at the socket_listener level, with the internal redis-rs client being completely stuck.

Important: It is unclear why this "deadlock" occurs and why the lock isn't released after the function exits. Despite attempts to explicitly drop the lock right before the function returns, the issue persisted. However, with the new lock-release-before-connection strategy, the problem no longer appears.

Testing:

Failover Handling:
This issue and change were tested by simulating node failovers on the update_slotmap_moved branch, verifying that the client successfully recovers without getting stuck, allowing the system to quickly find the promoted replica and maintain operations.

We still need to investigate the root cause of the lock issue (looks like a tokio bug?), but this change resolves the deadlock and improving lock management.

Deadlock Test Logs:

13:51:07.460972  WARN No connection found for route `Route(1200, Master)`. Attempting redirection to a random node.
13:51:07.460991  INFO connectionCheck::RandomConnection
13:51:07.461007  INFO connectionCheck::RandomConnection acquired lock
13:51:07.461057  INFO connectionCheck::RandomConnection dropped lock
13:51:07.461107  INFO validate_all_user_connections acquired lock
13:51:07.461299  INFO validate_all_user_connections lock is dropped
13:51:07.461353  INFO validate_all_user_connections calls refresh_connections
13:51:07.461369  INFO Started refreshing connections to ["host:6379"]
13:51:07.461369  INFO refresh_connections acquired read lock
13:51:07.462057  INFO Creating TCP with TLS connection for node: "host:6379", IP: x.x.x.172
13:51:07.481366  WARN Received request error An error was signalled by the server - Moved: 1200 host:6379 on node "other_host:6379".
13:51:07.481535  INFO update_upon_moved_error is called, waiting to acquire write lock (no log after, meaning lock isn't acquired)
13:51:07.481583  INFO refresh_slots_inner is called, waiting to acquire read lock (no log after, meaning lock isn't acquired)
13:51:07.481904  WARN Failed to refresh connection for node host:6379. Error: `Connection refused (os error 111)`
13:51:07.481939  INFO refresh connections completed, function exits
13:51:07.481964  INFO validate_all_user_connections checks again if I can acquire read lock (no log after, meaning lock isn't acquired)
13:51:07.710412  WARN received error - timed out
13:51:07.710453  DEBUG received error - for callback 82
13:51:07.961725  WARN received error - timed out
13:51:07.961786  DEBUG received error - for callback 82
..... more and more time out errors raising from the socket_listener, the redis-rs client is stuck

… while creating a new connection

eifrah-aws · 2024-09-15T07:03:23Z

redis/src/cluster_async/mod.rs

-            )
-            .await;
+            tasks.push(async move {
+                let connections_container = inner.conn_lock.read().await;


Making a lock "public" is not a good idea. We should an atomic API and not the lock itself.

For example:

fn do_something() -> Result<(), Box<dyn Error>>{ let _lk = self.lock.write()?; ... }

if the "something" is complex, we should add an API:

fn write_lock_and_do<F>(callback: F) -> Result<(), Box<dyn Error>> where F: Fn() -> Result<(), Box<dyn Error>> { let _lk = self.lock.write()?; callback() }

this way we have a full control over the lock and we can avoid misuse of the lock

I think that's a good idea, lets do it in a seperate PR

redis/src/cluster_async/mod.rs

eifrah-aws · 2024-09-15T07:12:29Z

redis/src/cluster_async/mod.rs

+            match result {
+                (address, Ok(node)) => {
+                    let connections_container = inner.conn_lock.read().await;
+                    connections_container.replace_or_add_connection_for_address(address, node);


We should expose API for inner for this function (replace_or_add_connection_for_address) and avoid exposing the lock here

same as above

eifrah-aws · 2024-09-15T07:12:58Z

redis/src/cluster_async/mod.rs

+                    );
+                }
+            }
+        }
        info!("refresh connections completed");


Is this something that happens often? if it does, please move this to debug!

Changed refresh_connections to release the connections container lock…

35c4775

… while creating a new connection

barshaul changed the title ~~Changed refresh_connections to release the connections container lock…~~ Fix deadlock caused by refresh_connections by adjusting lock management Sep 13, 2024

barshaul requested review from eifrah-aws and ikolomi September 13, 2024 14:29

barshaul changed the title ~~Fix deadlock caused by refresh_connections by adjusting lock management~~ Release the read lock while creating connections inrefresh_connections Sep 14, 2024

eifrah-aws reviewed Sep 15, 2024

View reviewed changes

Fixed PR comments

ec85a7b

barshaul requested a review from eifrah-aws October 1, 2024 15:03

eifrah-aws approved these changes Oct 9, 2024

View reviewed changes

barshaul merged commit 53ea47a into amazon-contributing:main Oct 9, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release the read lock while creating connections in`refresh_connections` #191

Release the read lock while creating connections in`refresh_connections` #191

barshaul commented Sep 13, 2024 •

edited

Loading

eifrah-aws Sep 15, 2024

barshaul Oct 1, 2024 •

edited

Loading

eifrah-aws Sep 15, 2024

barshaul Oct 1, 2024

eifrah-aws Sep 15, 2024

barshaul Oct 1, 2024

Release the read lock while creating connections inrefresh_connections #191

Release the read lock while creating connections inrefresh_connections #191

Conversation

barshaul commented Sep 13, 2024 • edited Loading

PR Description:

Main Changes:

Testing:

eifrah-aws Sep 15, 2024

Choose a reason for hiding this comment

barshaul Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

eifrah-aws Sep 15, 2024

Choose a reason for hiding this comment

barshaul Oct 1, 2024

Choose a reason for hiding this comment

eifrah-aws Sep 15, 2024

Choose a reason for hiding this comment

barshaul Oct 1, 2024

Choose a reason for hiding this comment

Release the read lock while creating connections in`refresh_connections` #191

Release the read lock while creating connections in`refresh_connections` #191

barshaul commented Sep 13, 2024 •

edited

Loading

barshaul Oct 1, 2024 •

edited

Loading