-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blockstore should drop signals before validator exit #24025
Conversation
Co-authored-by: Trent Nelson <[email protected]>
Co-authored-by: Trent Nelson <[email protected]>
ledger/src/blockstore.rs
Outdated
@@ -444,8 +444,8 @@ impl Blockstore { | |||
block_height_cf, | |||
program_costs_cf, | |||
bank_hash_cf, | |||
new_shreds_signals: vec![], | |||
completed_slots_senders: vec![], | |||
new_shreds_signals: Mutex::<Vec<Sender<bool>>>::default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can say Mutex::default()
here.
is this solving test hanging? I don't recall the context. |
Codecov Report
@@ Coverage Diff @@
## master #24025 +/- ##
=======================================
Coverage 81.7% 81.7%
=======================================
Files 589 589
Lines 160603 160622 +19
=======================================
+ Hits 131236 131256 +20
+ Misses 29367 29366 -1 |
Yes. This one is a follow up of the #24007. In #24007, we fix the hang by using nonblocking call inside RpcCompletedSlotService. This change implement the idea suggested by @sakridge - when shutting down, we drop all the signal senders in blockstore, so that all receivers whether blocking or nonblocking will return on receive. In this case, I don't think RwLock and Mutex will make much difference. The vec inside the lock is only written twice, once at the initialization, and once cleared at the shutdown. And there is only one thread reading it inside blockstore. Let me add @carllin see if there is any other similar patterns in the validator. |
Hmm, is this going to fix anything? I feel like the validator The blockstore signals are created here: Lines 495 to 496 in 51b37f0
The consumers are:
Thus it seems like setting the Line 1006 in c6ef522
|
@carllin This one can be thought of as more of refactor and protection.
Also, while refactoring this, I notice another potential issue #24051 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HaoranYi understood, thanks for explaining!
* timeout for validator exits * clippy * print backtrace when panic * add backtrace package * increase time out to 30s * debug logging * make rpc complete service non blocking * reduce log level * remove logging * recv_timeout * remove backtrace * remove sleep * wip * remove unused variable * add comments * Update core/src/validator.rs Co-authored-by: Trent Nelson <[email protected]> * Update core/src/validator.rs Co-authored-by: Trent Nelson <[email protected]> * whitespace * more whitespace * fix build * clean up import * add mutex for signal senders in blockstore * remove mut * refactor: extract add signal functions * make blockstore signal private * let compiler infer mutex type Co-authored-by: Trent Nelson <[email protected]> (cherry picked from commit 6ba4e87) # Conflicts: # ledger/src/blockstore.rs
* timeout for validator exits * clippy * print backtrace when panic * add backtrace package * increase time out to 30s * debug logging * make rpc complete service non blocking * reduce log level * remove logging * recv_timeout * remove backtrace * remove sleep * wip * remove unused variable * add comments * Update core/src/validator.rs Co-authored-by: Trent Nelson <[email protected]> * Update core/src/validator.rs Co-authored-by: Trent Nelson <[email protected]> * whitespace * more whitespace * fix build * clean up import * add mutex for signal senders in blockstore * remove mut * refactor: extract add signal functions * make blockstore signal private * let compiler infer mutex type Co-authored-by: Trent Nelson <[email protected]> (cherry picked from commit 6ba4e87) # Conflicts: # ledger/src/blockstore.rs
#25326) * Blockstore should drop signals before validator exit (#24025) * timeout for validator exits * clippy * print backtrace when panic * add backtrace package * increase time out to 30s * debug logging * make rpc complete service non blocking * reduce log level * remove logging * recv_timeout * remove backtrace * remove sleep * wip * remove unused variable * add comments * Update core/src/validator.rs Co-authored-by: Trent Nelson <[email protected]> * Update core/src/validator.rs Co-authored-by: Trent Nelson <[email protected]> * whitespace * more whitespace * fix build * clean up import * add mutex for signal senders in blockstore * remove mut * refactor: extract add signal functions * make blockstore signal private * let compiler infer mutex type Co-authored-by: Trent Nelson <[email protected]> (cherry picked from commit 6ba4e87) # Conflicts: # ledger/src/blockstore.rs * fix conflicts Co-authored-by: HaoranYi <[email protected]>
Problem
blockstore holds two set of signal senders, and blockstore is Arc in many places, including RpcSubscription. During shutting down, a deadlock may occurs, when the signal receiver is using a blocking call and the blockstore ref count never goes to 0.
Summary of Changes
Fixes #