ReplayStage: No More Clone SanitizedTransaction #3058

apfitzge · 2024-10-02T20:46:59Z

Problem

We clone the transactions in our batch in replay stage (in both unified scheduler and old scheduler)
In transition to RuntimeTransaction, we do not want to implement Clone for RuntimeTransaction
We don't really need to clone in ReplayStage and it's probably slowing us down quite a bit

Summary of Changes

Take ownership of entries in process_entries
Perform manual locking of entry transactions without creating batches
- Simplification of lock-retry to detect self-conflicting transactions (no loop!)
Pass locked and owned entries to process_batches
UnifiedScheduler path:
- manually unlock the transactions
- schedule_execution takes ownership rather than cloning
Old path:
- Flatten all locked entries into single vec. always batch/rebatch.
- ^ batches borrow from flat, but take responsibility of unlocking

Fixes #

bw-solana

I don't spot anything wrong, but I do have a couple of questions to help me understand what's happening

bw-solana · 2024-10-02T23:15:34Z

ledger/src/blockstore_processor.rs

    {
+        // unlock before sending to scheduler.
+        bank.unlock_accounts(transactions.iter().zip(lock_results.iter()));


Why is it okay to unlock accounts here? Do they not need to complete execution?

unified scheduler does its own lock management.
@ryoqun correct me if I'm wrong, but afaik the batches in original code are dropped soon after (and unlocked) since scheduling here is non-blocking anyway.

yeah, that's correct.

Probably outside the scope of this PR, but I could use an education on how this locking mechanism works. Like how do we make sure we execute things in the right order on replay? Do we guarantee the batches get executed in order once scheduled?

Like how do we make sure we execute things in the right order on replay?

Unified scheduler realizes this with 2 things: the event loop and SchedulingStateMachine.

The event loop is high-level driver of the state machine, starting from receiving tasks (= transactions) via new_task_receiver, which is sent from blockstore:

agave/unified-scheduler-pool/src/lib.rs

Lines 1031 to 1081 in 17346cd

select_biased! {

recv(finished_blocked_task_receiver) -> executed_task => {

let Some(executed_task) = Self::accumulate_result_with_timings(

&mut result_with_timings,

executed_task.expect("alive handler")

) else {

break 'nonaborted_main_loop;

};

state_machine.deschedule_task(&executed_task.task);

},

recv(dummy_unblocked_task_receiver) -> dummy => {

assert_matches!(dummy, Err(RecvError));

let task = state_machine

.schedule_next_unblocked_task()

.expect("unblocked task");

runnable_task_sender.send_payload(task).unwrap();

},

recv(new_task_receiver) -> message => {

assert!(!session_ending);

match message {

Ok(NewTaskPayload::Payload(task)) => {

sleepless_testing::at(CheckPoint::NewTask(task.task_index()));

if let Some(task) = state_machine.schedule_task(task) {

runnable_task_sender.send_aux_payload(task).unwrap();

}

}

Ok(NewTaskPayload::CloseSubchannel) => {

session_ending = true;

}

Ok(NewTaskPayload::OpenSubchannel(_context_and_result_with_timings)) =>

unreachable!(),

Err(RecvError) => {

// Mostly likely is that this scheduler is dropped for pruned blocks of

// abandoned forks...

// This short-circuiting is tested with test_scheduler_drop_short_circuiting.

break 'nonaborted_main_loop;

}

}

},

recv(finished_idle_task_receiver) -> executed_task => {

let Some(executed_task) = Self::accumulate_result_with_timings(

&mut result_with_timings,

executed_task.expect("alive handler")

) else {

break 'nonaborted_main_loop;

};

state_machine.deschedule_task(&executed_task.task);

},

};

and this is relevant impl code of SchedulingStateMachine:

agave/unified-scheduler-logic/src/lib.rs

Lines 657 to 755 in 17346cd

/// Schedules given `task`, returning it if successful.

///

/// Returns `Some(task)` if it's immediately scheduled. Otherwise, returns `None`,

/// indicating the scheduled task is blocked currently.

///

/// Note that this function takes ownership of the task to allow for future optimizations.

#[must_use]

pub fn schedule_task(&mut self, task: Task) -> Option<Task> {

self.total_task_count.increment_self();

self.active_task_count.increment_self();

self.try_lock_usage_queues(task)

}

#[must_use]

pub fn schedule_next_unblocked_task(&mut self) -> Option<Task> {

self.unblocked_task_queue.pop_front().inspect(|_| {

self.unblocked_task_count.increment_self();

})

}

/// Deschedules given scheduled `task`.

///

/// This must be called exactly once for all scheduled tasks to uphold both

/// `SchedulingStateMachine` and `UsageQueue` internal state consistency at any given moment of

/// time. It's serious logic error to call this twice with the same task or none at all after

/// scheduling. Similarly, calling this with not scheduled task is also forbidden.

///

/// Note that this function intentionally doesn't take ownership of the task to avoid dropping

/// tasks inside `SchedulingStateMachine` to provide an offloading-based optimization

/// opportunity for callers.

pub fn deschedule_task(&mut self, task: &Task) {

self.active_task_count.decrement_self();

self.handled_task_count.increment_self();

self.unlock_usage_queues(task);

}

#[must_use]

fn try_lock_usage_queues(&mut self, task: Task) -> Option<Task> {

let mut blocked_usage_count = ShortCounter::zero();

for context in task.lock_contexts() {

context.with_usage_queue_mut(&mut self.usage_queue_token, |usage_queue| {

let lock_result = if usage_queue.has_no_blocked_usage() {

usage_queue.try_lock(context.requested_usage)

} else {

LockResult::Err(())

};

if let Err(()) = lock_result {

blocked_usage_count.increment_self();

let usage_from_task = (context.requested_usage, task.clone());

usage_queue.push_blocked_usage_from_task(usage_from_task);

}

});

}

// no blocked usage count means success

if blocked_usage_count.is_zero() {

Some(task)

} else {

task.set_blocked_usage_count(&mut self.count_token, blocked_usage_count);

None

}

}

fn unlock_usage_queues(&mut self, task: &Task) {

for context in task.lock_contexts() {

context.with_usage_queue_mut(&mut self.usage_queue_token, |usage_queue| {

let mut unblocked_task_from_queue = usage_queue.unlock(context.requested_usage);

while let Some((requested_usage, task_with_unblocked_queue)) =

unblocked_task_from_queue

{

// When `try_unblock()` returns `None` as a failure of unblocking this time,

// this means the task is still blocked by other active task's usages. So,

// don't push task into unblocked_task_queue yet. It can be assumed that every

// task will eventually succeed to be unblocked, and enter in this condition

// clause as long as `SchedulingStateMachine` is used correctly.

if let Some(task) = task_with_unblocked_queue.try_unblock(&mut self.count_token)

{

self.unblocked_task_queue.push_back(task);

}

match usage_queue.try_lock(requested_usage) {

LockResult::Ok(()) => {

// Try to further schedule blocked task for parallelism in the case of

// readonly usages

unblocked_task_from_queue =

if matches!(requested_usage, RequestedUsage::Readonly) {

usage_queue.pop_unblocked_readonly_usage_from_task()

} else {

None

};

}

LockResult::Err(()) => panic!("should never fail in this context"),

}

}

});

}

}

Do we guarantee the batches get executed in order once scheduled?

yes, while maximally parallelizing executing tasks. order here is the order transactions are sent over the crossbeam channel.

it uses crossbeam order, but it also verifies the index is never out of order for conflicts, iirc.

index order is guaranteed with the way we send process entries in order (until conflict)

well, at the last minute, we removed the index verification: solana-labs#35286 (comment)

index order is guaranteed with the way we send process entries in order (until conflict)

this is true.

bw-solana · 2024-10-02T23:19:33Z

ledger/src/blockstore_processor.rs

@@ -496,7 +507,7 @@ fn rebatch_and_execute_batches(
    let target_batch_count = get_thread_count() as u64;

    let mut tx_batches: Vec<TransactionBatchWithIndexes<SanitizedTransaction>> = vec![];
-    let rebatched_txs = if total_cost > target_batch_count.saturating_mul(minimal_tx_cost) {


What's this change about?

What this was doing before was checkiing some minimum cost, then if below that use the original batches passed in instead of rebatching.

But in the new case, ownership was passed, so we don't have any original batches to fall back to. I opted to simplify and just always run the same logic.

Possible I could re-order the above "flatteniing logic" with total_cost check.
Then we could more easily keep this original "if {} else {}" rebatching - no need to flatten in the case its' below the threshold cost either.

think this messed up some tests because it rebatches unexpectedly. I'll take a shot at preserving the previous behavior

restored: 8eae1cc

ledger/src/blockstore_processor.rs

runtime/src/bank.rs

ryoqun · 2024-10-03T12:43:11Z

runtime/src/installed_scheduler_pool.rs

@@ -445,18 +446,18 @@ impl BankWithScheduler {
    /// Calling this will panic if the installed scheduler is Unavailable (the bank is
    /// wait_for_termination()-ed or the unified scheduler is disabled in the first place).
    // 'a is needed; anonymous_lifetime_in_impl_trait isn't stabilized yet...


could you remove the last line of this comment, because 'a gone: // 'a is needed; anonymous_lifetime_in_impl_trait isn't stabilized yet...

ledger/src/blockstore_processor.rs

ryoqun · 2024-10-03T13:03:35Z

runtime/src/installed_scheduler_pool.rs

+        transaction: SanitizedTransaction,
+        index: usize,


hehe, this is a remnant of never-shipped attempt of sneaking GAT into our code base... thanks for cleaning up.

runtime/src/installed_scheduler_pool.rs

ledger/src/blockstore_processor.rs

ryoqun · 2024-10-03T13:21:26Z

ledger/src/blockstore_processor.rs

                    for hash in &tick_hashes {
                        bank.register_tick(hash);
                    }
                    tick_hashes.clear();


nit: for hash in tick_hashes.drain(..) for the api use pattern consistency with the batches vec?

ledger/src/blockstore_processor.rs

apfitzge · 2024-10-03T15:11:01Z

did a force push since @yihau disabled the windows CI that was breaking. wasn't sure how else to get around it

ryoqun · 2024-10-04T04:03:56Z

unified-scheduler-pool/src/lib.rs

    ) -> ScheduleResult {
-        let task = SchedulingStateMachine::create_task(transaction.clone(), index, &mut |pubkey| {


byebye: .clone()

ledger/src/blockstore_processor.rs

ryoqun · 2024-10-04T14:57:09Z

unified-scheduler-pool/src/lib.rs

@@ -2672,7 +2673,9 @@ mod tests {
            .take(10000)
        {
            let scheduler = pool.take_scheduler(context.clone());
-            scheduler.schedule_execution(&(dummy_tx, index)).unwrap();
+            scheduler
+                .schedule_execution(dummy_tx.clone(), index)


nit: seems this .clone() isn't needed anymore.

i miss clippy's redundant_clone... ;)

this one is necessary because it's in a loop

ryoqun

lgtm with a nit.

thanks for largely simplifying the api even around unified scheduler.

bw-solana reviewed Oct 2, 2024

View reviewed changes