Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce ProgramCache write lock contention #1037

Merged
merged 3 commits into from
Apr 27, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions program-runtime/src/loaded_programs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -638,6 +638,7 @@ pub struct ProgramCacheForTxBatch {
/// The epoch of the last rerooting
pub latest_root_epoch: Epoch,
pub hit_max_limit: bool,
pub loaded_missing: bool,
Lichtso marked this conversation as resolved.
Show resolved Hide resolved
}

impl ProgramCacheForTxBatch {
Expand All @@ -654,6 +655,7 @@ impl ProgramCacheForTxBatch {
upcoming_environments,
latest_root_epoch,
hit_max_limit: false,
loaded_missing: false,
}
}

Expand All @@ -669,6 +671,7 @@ impl ProgramCacheForTxBatch {
upcoming_environments: cache.get_upcoming_environments_for_epoch(epoch),
latest_root_epoch: cache.latest_root_epoch,
hit_max_limit: false,
loaded_missing: false,
}
}

Expand Down Expand Up @@ -725,6 +728,10 @@ impl ProgramCacheForTxBatch {
self.replenish(*key, entry.clone());
})
}

pub fn is_empty(&self) -> bool {
self.entries.is_empty()
}
}

pub enum ProgramCacheMatchCriteria {
Expand Down
2 changes: 1 addition & 1 deletion runtime/src/bank.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4162,7 +4162,7 @@ impl Bank {
programs_modified_by_tx,
} = execution_result
{
if details.status.is_ok() {
if details.status.is_ok() && !programs_modified_by_tx.is_empty() {
Copy link
Member Author

@ryoqun ryoqun Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope this one is fairly uncontroversial. haha

let mut cache = self.transaction_processor.program_cache.write().unwrap();
Copy link
Member Author

@ryoqun ryoqun Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, i noticed that this write-lock is per-tx write-lock, if the batch contains 2 or more transactions, while writing this: #1037 (comment)

Copy link

@Lichtso Lichtso Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. How about:

if execution_results.iter().any(|execution_result| matches!(execution_result, TransactionExecutionResult::Executed { details, programs_modified_by_tx } if details.status.is_ok() && !programs_modified_by_tx.is_empty())) {
    let mut cache = self.transaction_processor.program_cache.write().unwrap();
    for execution_result in &execution_results {
        if let TransactionExecutionResult::Executed { programs_modified_by_tx, .. } = execution_result {
            cache.merge(programs_modified_by_tx);
        }
    }
}

Copy link
Member Author

@ryoqun ryoqun Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that incurs 2 pass looping for the worst case (totaling, O(2*N)). also rather heavy code duplication.

Considering !programs_modified_by_tx.is_empty() should be rare (unless malice), I think a quick and dirty memoization like this will be enough (this worst case's overall cost is O(Cm*N), where Cm << 2, Cm == memoization overhead cce3075

cache.merge(programs_modified_by_tx);
}
Expand Down
23 changes: 15 additions & 8 deletions svm/src/transaction_processor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -292,14 +292,20 @@ impl<FG: ForkGraph> TransactionBatchProcessor<FG> {

execution_time.stop();

const SHRINK_LOADED_PROGRAMS_TO_PERCENTAGE: u8 = 90;
self.program_cache
.write()
.unwrap()
.evict_using_2s_random_selection(
Percentage::from(SHRINK_LOADED_PROGRAMS_TO_PERCENTAGE),
self.slot,
);
// Skip eviction when there's no chance this particular tx batch has increased the size of
// ProgramCache entries. Note that this flag is deliberately defined, so that there's still
// at least one other batch, which will evict the program cache, even after the occurrences
// of cooperative loading.
if programs_loaded_for_tx_batch.borrow().loaded_missing {
Copy link
Member Author

@ryoqun ryoqun Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this one isn't so straightforward. better ideas are very welcome.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global cache can also grow via cache.merge(programs_modified_by_tx); above, not just by loading missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this? d50c11c

Copy link

@Lichtso Lichtso Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, but the number of insertions and evictions can still be unbalanced because it is only a boolean.
Also, maybe we should move eviction to the place where we merge in new deployments? That way they could share a write lock.

Copy link
Member Author

@ryoqun ryoqun Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, but the number of insertions and evictions can still be unbalanced because it is only a boolean.

I intentionally chosen boolean, thinking the number of insertions and evictions doesn't need to be balanced. That's because evict_using_2s_random_selection() continues to evict entries until they're under 90% of MAX_LOADED_ENTRY_COUNT(=256) just with a single invocation. So, we just need to ensure these are called with sufficient frequency/timings to avoid cache bomb dos attack.

Also, maybe we should move eviction to the place where we merge in new deployments? That way they could share a write lock.

This is possible and that looks appealing. however it isn't trivial. Firstly, load_and_execute_sanitized_transactions can be entered via 3 code path: replaying, banking, rpc tx simulation. I guess that's the reason this eviction is placed here to begin with as a the most shared code path for all of transaction executions?

The place where we merge in new deployments is the commit_transactions(), which isn't touched by the rpc tx simulation for obvious reason. So, moving this eviction there would expose unbounded program cache entry grow dos (theoretically; assumes no new blocks for extended duration). Also, replaying and banking take the commit code-path under slightly different semantics. so, needs a bit of care to move this eviction nevertheless, even if we ignore the rpc concern...

all that said, I think the current code change should be good enough and safe enough?

const SHRINK_LOADED_PROGRAMS_TO_PERCENTAGE: u8 = 90;
self.program_cache
.write()
.unwrap()
.evict_using_2s_random_selection(
Percentage::from(SHRINK_LOADED_PROGRAMS_TO_PERCENTAGE),
self.slot,
);
}

debug!(
"load: {}us execute: {}us txs_len={}",
Expand Down Expand Up @@ -395,6 +401,7 @@ impl<FG: ForkGraph> TransactionBatchProcessor<FG> {
}
// Submit our last completed loading task.
if let Some((key, program)) = program_to_store.take() {
loaded_programs_for_txs.as_mut().unwrap().loaded_missing = true;
if program_cache.finish_cooperative_loading_task(self.slot, key, program)
&& limit_to_load_programs
{
Expand Down
Loading