Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blockstore should drop signals before validator exit #24025

Merged
merged 28 commits into from
Apr 4, 2022

Conversation

HaoranYi
Copy link
Contributor

@HaoranYi HaoranYi commented Mar 31, 2022

Problem

blockstore holds two set of signal senders, and blockstore is Arc in many places, including RpcSubscription. During shutting down, a deadlock may occurs, when the signal receiver is using a blocking call and the blockstore ref count never goes to 0.

Summary of Changes

  • Add drop_signals function on blockstore. When exiting, drop the signals to avoid deadlock.
  • Change signals to be protected by mutex.

Fixes #

@HaoranYi HaoranYi changed the title Drop signal Blockstore should drop signals before validator exit Apr 1, 2022
@@ -444,8 +444,8 @@ impl Blockstore {
block_height_cf,
program_costs_cf,
bank_hash_cf,
new_shreds_signals: vec![],
completed_slots_senders: vec![],
new_shreds_signals: Mutex::<Vec<Sender<bool>>>::default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can say Mutex::default() here.

@jeffwashington
Copy link
Contributor

is this solving test hanging? I don't recall the context.
Does RwLock work better here or is Mutex appropriate? I'm not familiar with the usage patterns in the normal validator.

@codecov
Copy link

codecov bot commented Apr 1, 2022

Codecov Report

Merging #24025 (a3e8a33) into master (51b37f0) will increase coverage by 0.0%.
The diff coverage is 100.0%.

❗ Current head a3e8a33 differs from pull request most recent head c6ef522. Consider uploading reports for the commit c6ef522 to get more accurate results

@@           Coverage Diff           @@
##           master   #24025   +/-   ##
=======================================
  Coverage    81.7%    81.7%           
=======================================
  Files         589      589           
  Lines      160603   160622   +19     
=======================================
+ Hits       131236   131256   +20     
+ Misses      29367    29366    -1     

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Apr 1, 2022

is this solving test hanging? I don't recall the context. Does RwLock work better here or is Mutex appropriate? I'm not familiar with the usage patterns in the normal validator.

Yes. This one is a follow up of the #24007.

In #24007, we fix the hang by using nonblocking call inside RpcCompletedSlotService.

This change implement the idea suggested by @sakridge - when shutting down, we drop all the signal senders in blockstore, so that all receivers whether blocking or nonblocking will return on receive.

In this case, I don't think RwLock and Mutex will make much difference. The vec inside the lock is only written twice, once at the initialization, and once cleared at the shutdown. And there is only one thread reading it inside blockstore.

Let me add @carllin see if there is any other similar patterns in the validator.

@HaoranYi HaoranYi requested a review from carllin April 1, 2022 19:40
@carllin
Copy link
Contributor

carllin commented Apr 1, 2022

Hmm, is this going to fix anything? I feel like the validator exit bool should be sufficient for triggering shutdown.

The blockstore signals are created here:

ledger_signal_receiver,
completed_slots_receiver,

The consumers are:

  1. RPC
    let rpc_completed_slots_service = RpcCompletedSlotsService::spawn(
    completed_slots_receiver,
    rpc_subscriptions.clone(),
    exit.clone(),
    );
    which also gets an exit signal
  2. The tvu:
    ledger_signal_receiver,
    which also gets an exit signal

Thus it seems like setting the exit flag from the validator exit:

self.validator_exit.write().unwrap().exit();
should be sufficient

@HaoranYi
Copy link
Contributor Author

HaoranYi commented Apr 1, 2022

@carllin
You are right. This one doesn't fix anything since #24007 already have fixed the hang. The rpc code you listed above is recent change #24007, that is used to fix the hang on the blocking call.

This one can be thought of as more of refactor and protection.

  • We make the signal vec the blockstore private and expose API for access them
  • We clear the signal vec before shutdown, which protect potential future changes of blocking on receiver, and it might help to speedup the shutdown (no need to wait till timeout)

Also, while refactoring this, I notice another potential issue #24051
We could potentially missing waking up signals in replaying stage. Please help to take a look at this as well. Let me know what do you think. Thanks!

Copy link
Contributor

@carllin carllin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoranYi understood, thanks for explaining!

@HaoranYi HaoranYi merged commit 6ba4e87 into solana-labs:master Apr 4, 2022
@HaoranYi HaoranYi added the v1.10 label May 18, 2022
mergify bot pushed a commit that referenced this pull request May 18, 2022
* timeout for validator exits

* clippy

* print backtrace when panic

* add backtrace package

* increase time out to 30s

* debug logging

* make rpc complete service non blocking

* reduce log level

* remove logging

* recv_timeout

* remove backtrace

* remove sleep

* wip

* remove unused variable

* add comments

* Update core/src/validator.rs

Co-authored-by: Trent Nelson <[email protected]>

* Update core/src/validator.rs

Co-authored-by: Trent Nelson <[email protected]>

* whitespace

* more whitespace

* fix build

* clean up import

* add mutex for signal senders in blockstore

* remove mut

* refactor: extract add signal functions

* make blockstore signal private

* let compiler infer mutex type

Co-authored-by: Trent Nelson <[email protected]>
(cherry picked from commit 6ba4e87)

# Conflicts:
#	ledger/src/blockstore.rs
HaoranYi added a commit that referenced this pull request May 20, 2022
* timeout for validator exits

* clippy

* print backtrace when panic

* add backtrace package

* increase time out to 30s

* debug logging

* make rpc complete service non blocking

* reduce log level

* remove logging

* recv_timeout

* remove backtrace

* remove sleep

* wip

* remove unused variable

* add comments

* Update core/src/validator.rs

Co-authored-by: Trent Nelson <[email protected]>

* Update core/src/validator.rs

Co-authored-by: Trent Nelson <[email protected]>

* whitespace

* more whitespace

* fix build

* clean up import

* add mutex for signal senders in blockstore

* remove mut

* refactor: extract add signal functions

* make blockstore signal private

* let compiler infer mutex type

Co-authored-by: Trent Nelson <[email protected]>
(cherry picked from commit 6ba4e87)

# Conflicts:
#	ledger/src/blockstore.rs
mergify bot added a commit that referenced this pull request May 20, 2022
#25326)

* Blockstore should drop signals before validator exit (#24025)

* timeout for validator exits

* clippy

* print backtrace when panic

* add backtrace package

* increase time out to 30s

* debug logging

* make rpc complete service non blocking

* reduce log level

* remove logging

* recv_timeout

* remove backtrace

* remove sleep

* wip

* remove unused variable

* add comments

* Update core/src/validator.rs

Co-authored-by: Trent Nelson <[email protected]>

* Update core/src/validator.rs

Co-authored-by: Trent Nelson <[email protected]>

* whitespace

* more whitespace

* fix build

* clean up import

* add mutex for signal senders in blockstore

* remove mut

* refactor: extract add signal functions

* make blockstore signal private

* let compiler infer mutex type

Co-authored-by: Trent Nelson <[email protected]>
(cherry picked from commit 6ba4e87)

# Conflicts:
#	ledger/src/blockstore.rs

* fix conflicts

Co-authored-by: HaoranYi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants