Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(engine): cleanup leaked context #13749

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

fgimenez
Copy link
Member

@fgimenez fgimenez commented Jan 9, 2025

After the introduction on std::thread::scope for creating StateRootTask, we started using Box::leak to keep alive values required to return and move around the state hook closure. This caused the variables in the kept context to cause errors like:

WARN Long-lived read transaction has been timed out

This PR includes changes to the parameters needed to create the blinded provider factory required by StateRootTask using Arc instead of references, which don't have a big performance impact and are automatically cleaned up without lifetime issues.

A provider transaction reference was one of the parameters required to create the blinded provider factory. Instead of it, we are now passing an Arc<Provider>, which can obtain the transaction reference as needed and doesn't impose additional lifetime constraints. This change requires further changes in all the code that use the cursors, limited to the trie crates.

@fgimenez fgimenez added A-blockchain-tree Related to sidechains, reorgs and pending blocks C-perf A change motivated by improving speed, memory usage or disk footprint labels Jan 9, 2025
@fgimenez fgimenez marked this pull request as ready for review January 9, 2025 10:33
@fgimenez fgimenez requested a review from shekhirin January 9, 2025 10:34
Copy link
Collaborator

@shekhirin shekhirin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works

@fgimenez fgimenez added this pull request to the merge queue Jan 9, 2025
@fgimenez fgimenez removed this pull request from the merge queue due to a manual request Jan 9, 2025
@fgimenez fgimenez force-pushed the fgimenez/cleanup-leaked-context branch 3 times, most recently from 7ebcaa8 to 227dc8b Compare January 12, 2025 15:33
Copy link

codspeed-hq bot commented Jan 12, 2025

CodSpeed Performance Report

Merging #13749 will degrade performances by 14.87%

Comparing fgimenez/cleanup-leaked-context (cd63408) with main (20a3a6a)

Summary

❌ 3 regressions
✅ 74 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main fgimenez/cleanup-leaked-context Change
hash builder[init size 10000 | update size 100 | num updates 1] 8.9 ms 10.2 ms -12.23%
hash builder[init size 10000 | update size 100 | num updates 3] 26.2 ms 30.1 ms -12.84%
hash builder[init size 10000 | update size 100 | num updates 5] 42.8 ms 50.3 ms -14.87%

Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes makes sense overall, but I assume the lifetime issue is more tricky because we use a borrowed tx for the HashedPostStateCursor etc.

which wouldn't be 'static that the roottask requires.

although we have ownership of the consistentview, so this problem seems to arise from us using tx_ref, if I understood this correctly we could perhaps consume the provider_ro and its tx (into_tx instead of tx_ref)?

Comment on lines 2276 to 2288
let provider_ro = consistent_view.provider_ro()?;
let nodes_sorted = state_root_config.nodes_sorted.clone();
let state_sorted = state_root_config.state_sorted.clone();
let prefix_sets = state_root_config.prefix_sets.clone();

let trie_cursor = Arc::new(InMemoryTrieCursorFactory::new(
DatabaseTrieCursorFactory::new(provider_ro.tx_ref()),
nodes_sorted,
));
let state_cursor = Arc::new(HashedPostStateCursorFactory::new(
DatabaseHashedCursorFactory::new(provider_ro.tx_ref()),
state_sorted,
));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the problematic section because this borrows and is not 'static

2276 | let provider_ro = consistent_view.provider_ro()?;
| ----------- binding provider_ro declared here
...
2286 | DatabaseHashedCursorFactory::new(provider_ro.tx_ref()),
| ^^^^^^^^^^^---------
| |
| borrowed value does not live long enough
| argument requires that provider_ro is borrowed for

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly, we previously had the same issue with nodes_sorted and state_sorted, was fixed by turning reference parameters into Arcs here a5103d5, currently in progress passing ownership of the provider factory as you mentioned

Comment on lines 288 to 291
BPF: BlindedProviderFactory + Send + Sync + 'static,
BPF::AccountNodeProvider: BlindedProvider + Send + Sync + 'static,
BPF::StorageNodeProvider: BlindedProvider + Send + Sync + 'static,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which would clash with these here I assume

@fgimenez fgimenez force-pushed the fgimenez/cleanup-leaked-context branch 3 times, most recently from 5f4e69d to a23771d Compare January 15, 2025 11:03
@fgimenez fgimenez requested a review from gakonst as a code owner January 15, 2025 11:03
@fgimenez fgimenez force-pushed the fgimenez/cleanup-leaked-context branch from a14c037 to b531bf7 Compare January 16, 2025 10:56
@fgimenez fgimenez requested a review from DaniPopes as a code owner January 16, 2025 13:23
@fgimenez fgimenez force-pushed the fgimenez/cleanup-leaked-context branch from af808c2 to 4131ec8 Compare January 16, 2025 15:35
Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, what's the main reason why we need Arc now in a bunch of places. this appears a bit invasive now

is this because of

pub struct DatabaseTrieCursorFactory<'a, TX>(&'a TX);

pub struct DatabaseHashedCursorFactory<'a, TX>(&'a TX);

?
but we need an owned instance?

because we previously could use both factories with a single reference and we're forced to own so we can clone+spawn, this then does not work for this pr.

what if we instead of having two factory types use only one and implement both traits for that,
another idea would be to not require &Tx and Tx: DbTx and instead use a new helper trait that can handle both &Tx and owned provider?

@@ -162,7 +162,7 @@ impl<C: ChainSpecParser<ChainSpec = ChainSpec>> Command<C> {

td += sealed_block.difficulty();
let mut executor = executor_provider.batch_executor(StateProviderDatabase::new(
LatestStateProviderRef::new(&provider_rw),
LatestStateProviderRef::new(provider_rw.clone()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why does this Ref type now requre an owned instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, LatestStateProviderRef and LatestStateProvider can be merged with the current approach, will address this depending on how we decide to go forward

@fgimenez fgimenez force-pushed the fgimenez/cleanup-leaked-context branch from 624d05e to 8d68b54 Compare January 17, 2025 08:42
@fgimenez
Copy link
Member Author

because we previously could use both factories with a single reference and we're forced to own so we can clone+spawn, this then does not work for this pr.

yes, that's exactly the case. moreover, several places use the from_tx method for creating both factories and other types that depend on them, for instance reth_trie_db::Proof

/// Create a new [Proof] instance from database transaction.
fn from_tx(tx: &'a TX) -> Self {
Self::new(DatabaseTrieCursorFactory::new(tx), DatabaseHashedCursorFactory::new(tx))
}

given that we are forced to own, having the ability to create the factories and related types from an owned instance is pervasive and affects a lot of places.

what if we instead of having two factory types use only one and implement both traits for that,
another idea would be to not require &Tx and Tx: DbTx and instead use a new helper trait that can handle both &Tx and owned provider?

thx a lot for the suggestions! will give them a try, i think the problem with the first one is that in any case we need an owned instance, and the changes would cascade as we have now, will let you know how it goes

@fgimenez fgimenez force-pushed the fgimenez/cleanup-leaked-context branch from 8d68b54 to c2f6d17 Compare January 17, 2025 16:32
@fgimenez
Copy link
Member Author

another idea would be to not require &Tx and Tx: DbTx and instead use a new helper trait that can handle both &Tx and owned provider?

Finally went for this second option, i already had in place a DatabaseRef trait that was being implemented for the provider, i've added a wrapper type for the transaction that also implements it. Now the cursor factories keep the old new constructor that accepts a &Tx (and keeps being used the same as it was) and have a new from_provider used in the engine. PTAL, still a bunch of changes but limited to the trie crates

@fgimenez fgimenez force-pushed the fgimenez/cleanup-leaked-context branch from a4c1aaf to 14ec3ef Compare January 17, 2025 17:39
@fgimenez
Copy link
Member Author

i've added a wrapper type for the transaction that also implements it.

a bit simpler now without the wrapper, implementing the helper trait directly on &Tx

@fgimenez fgimenez force-pushed the fgimenez/cleanup-leaked-context branch from 14ec3ef to a3dc40c Compare January 18, 2025 10:38
@fgimenez fgimenez force-pushed the fgimenez/cleanup-leaked-context branch from a3dc40c to cd63408 Compare January 18, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-blockchain-tree Related to sidechains, reorgs and pending blocks C-perf A change motivated by improving speed, memory usage or disk footprint
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants