-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release the read lock while creating connections inrefresh_connections
#191
Conversation
… while creating a new connection
refresh_connections
by adjusting lock management
refresh_connections
by adjusting lock management refresh_connections
redis/src/cluster_async/mod.rs
Outdated
) | ||
.await; | ||
tasks.push(async move { | ||
let connections_container = inner.conn_lock.read().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making a lock "public" is not a good idea. We should an atomic API and not the lock itself.
For example:
fn do_something() -> Result<(), Box<dyn Error>>{
let _lk = self.lock.write()?;
...
}
if the "something" is complex, we should add an API:
fn write_lock_and_do<F>(callback: F) -> Result<(), Box<dyn Error>>
where F: Fn() -> Result<(), Box<dyn Error>> {
let _lk = self.lock.write()?;
callback()
}
this way we have a full control over the lock and we can avoid misuse of the lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good idea, lets do it in a seperate PR
match result { | ||
(address, Ok(node)) => { | ||
let connections_container = inner.conn_lock.read().await; | ||
connections_container.replace_or_add_connection_for_address(address, node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should expose API for inner
for this function (replace_or_add_connection_for_address
) and avoid exposing the lock here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
redis/src/cluster_async/mod.rs
Outdated
); | ||
} | ||
} | ||
} | ||
info!("refresh connections completed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that happens often? if it does, please move this to debug!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
PR Description:
Main Changes:
Lock Management Improvement:
In the previous implementation, the read lock (
inner.conn_lock.read()
) was held throughout the entire connection refresh process (for all connections sent to refresh), including while attempting to establish connections (viaget_or_create_conn
). If connections were slow or timed out, the lock was held for an extended duration, blocking other tasks requiring a write lock.The new implementation releases the read lock before making connection attempts. If the connection is successfully established, the read lock is reacquired to update the connection container. This approach ensures that other operations needing the lock (e.g., write operations) can proceed while connections are being established.
Unclear Deadlock Behavior:
A deadlock scenario was observed while testing the
update_slotmap_moved
branch (on amazon-contributing/redis-rs) during failover testing. The root cause of the deadlock remains unclear. The branch introduces changes that attempt to acquire a write lock on the connection container, which leads to the issue. However, even after removing the content of theupdate_upon_moved
function (leaving only the lock acquisition), the deadlock persisted, suggesting that the problem isn't directly tied to the logic in the function itself.It seems like there is an unusual race condition occurring, causing the lock to enter an undefined state where neither reads nor writes are able to acquire it. This lock state is leading to the deadlock, with all tasks attempting to use the lock getting blocked.
The issue arose in the following situation:
refresh_connections
is triggered and acquires the read lock, whileget_or_create_conn
is waiting for a connection to complete.update_upon_moved
tries to acquire the write lock but is blocked since the read lock is held byrefresh_connections
.refresh_connections
fails with aConnection refused (os error 111)
and exits, the lock is not properly released.Important: It is unclear why this "deadlock" occurs and why the lock isn't released after the function exits. Despite attempts to explicitly drop the lock right before the function returns, the issue persisted. However, with the new lock-release-before-connection strategy, the problem no longer appears.
Testing:
This issue and change were tested by simulating node failovers on the
update_slotmap_moved
branch, verifying that the client successfully recovers without getting stuck, allowing the system to quickly find the promoted replica and maintain operations.We still need to investigate the root cause of the lock issue (looks like a tokio bug?), but this change resolves the deadlock and improving lock management.
Deadlock Test Logs: