Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Introduce tiled tx execution model instead of batched entries at banking/replaying #23548

Closed
ryoqun opened this issue Mar 9, 2022 · 13 comments
Closed
Labels
stale [bot only] Added to stale content; results in auto-close after a week.

Comments

@ryoqun
Copy link
Contributor

ryoqun commented Mar 9, 2022

(still draft; also i may be missing the obvious point; this is just rough idea)

Problem

Current execution model (batched transactions collectively locks all touched accounts as a coarse locking) is suboptimal regarding full cpu utilization in general and affects other unrelated transactions, given vastly-varying processing time of transactions.

We're trying to fix this by compute units. However, I think it's hard to correctly model the parallelization factor (i.e. singly-threaded IDO transactions's processing time should be weighted heavier than other similarly-heavy non-singly-threaded transactions).

I guess current design and implementation was assumed on a rather uniform tx load and easiness of batching primarily (for gpu).

Proposed Solution

At the very bottom of the story, leader's job/economic incentive is to saturate N cores as much as possible to collect tx fees.. So, let's design from that with fresh mind.

Current (NOTE: I have limited knowledge of banking stage, I'm just writing this with replay stage workings....):

image

Assume tx1{a,b} is a compute-heavy transactions writing to a common account. tx2{a,b} is light transactions writing to another common account. etc

So, completion of batch1 is blocked by tx1a. And we can't execute tx2b in batch2 even if tx2a finished already.

note that executions of transactions of a single entry is indeed parallelized by rayon pool. but i think the absurdly-long-running outlier transaction is hampering with cpu under-utilization. this could be even exaggerated with the requested (increased) computed units.

(Pre-)Proposed:

image

Obviously, we need a deterministic algo. for determining the boundaries of this variable/dynamic span thing. However, i think we can do this by recognizing a span as soon as the oldest started transaction finishes the execution (i.e. longest-running tx). And each transaction in spanN can be mapped with coreN and local index of coreN inside spanN.

then at high level, banking stage pool N threads and each threads fetch pending transactions as long as it's idling and the tx's account isn't contended.

In that way, highly-contended transactions are isolated by nature and parallelizable transactiona are processed as fast as possible.

note that inter-core account dependency will be represented as vclock to encode the proper execution ordering.

related: #23438

@ryoqun ryoqun changed the title pre-rfc: Introduce tiled spans tx execution model instead of batched entries pre-rfc: Introduce tiled span tx execution model instead of batched entries Mar 9, 2022
@nikhayes
Copy link

nikhayes commented Mar 9, 2022

I had posted a somewhat tangential idea related to processing of transactions in discord yesterday, but maybe these ideas can complement each other. I'm definitely not an expert on any of this... but this is my understanding so far and what I think might be able to help:

Continued Description of Problem
-Assuming each of the 2 current non-vote banking threads process their own batch (128 packets) and are executing groups of parallelizable transactions which are then converted into an entry, what seems to be the case is that the entire group of transactions being executed would be rate limited by the slowest transaction included in an entry (with speed correlating to compute units). During a Raydium IDO, it's likely that the Raydium IDO transactions fill up batches on both threads to a high degree, so you may have a few initial entries that contain some parallelizable transactions that are executed, but the rest of the batch will need to iterate through all the single Raydium transactions until the batch is emptied, and the second thread (which also has many Raydium txns presumably) will have already executed a few entries but is probably stuck due to the Raydium transaction write-lock from the first thread.
-What I would expect to see is that during a Raydium IDO, you have: A) A small spike in transactions each time a new batch is loaded, which thereafter generates smaller, single Raydium transaction entries B) slower entry generation on each thread due to high compute transactions rate limiting everything else C) fewer entries being generated by the second thread which is probably write-locked for long periods of time. Adding more banking threads will help a little here I think, but the issue is that they'll all still be filled up with a very high degree off Raydium transactions and will thus be stalled after initially processing a few parallelizable transactions.
*I think it would be cool to see more information about the entries generated over the course of a slot and also more details about replay, such as how many threads are being utilized to process certain entries, how long certain entries take to process, etc.

Proposed Solution to speed things up
Step 1: A compute budget is declared for transactions of all sizes rather than only when they exceed 200k compute. I'm not sure if it's possible, but if RPC nodes could generate an estimated compute budget for each transaction when a client wants to submit one, that would be ideal. If a transaction exceeds the compute declared by its compute budget, it fails in banking stage and is charged. If a transaction undershoots its compute budget, it is still charged the full compute budget. The onus to accurately estimate the required for a transaction would be on the RPC service and market forces would favor those RPC services that keep compute budget cheap, while still landing successful transactions.

Step 2: Earlier on in transaction processing, transactions would be sorted into groups based on the compute budget ranges they fall into. Non-vote banking threads are delegated the job of processing transactions that fall within a certain compute range (i.e. <5k, 5k to 20k, 20k to 100k, 100 to 200k, etc). A banking thread dealing with smaller transactions (i.e. 5k compute and below ---i.e. transfers/payments, pyth, etc) would only be rate limited in its execution by the slowest of these smaller transactions (rather than large Raydium transaction or something else), and would thus be free to quickly iterate through its batches (assuming threads can generate entries asynchronously from one another). I think giving transfers/payments more freedom to bypass log jams on other threads is crucial for reliably payment infrastructure. Other banking threads would handle transactions of other sizes (i.e. 5k to 20k, 20k to 100k, 100 to 200k, etc). One thing that I think is nice here, is that I think a lot of the non-parallelizable transactions will tend to fall within a similar compute range, and thus will litter other batches/threads with fewer write-locked packets/txns.

Other thoughts: If there is a candymachine NFT drop, or Raydium IDO, I'm assuming a lot of people will be adding fee prioritization to their transactions and you'll have exactly the same TPS drop as we see now. If fee prioritization is added to a Raydium txn for example though, with this new model, it would be isolated to the compute budget specific thread and wouldn't be cutting in front of line of all transactions, but just those within its own compute range. I'm sure there are a lot of constraints and concurrency/timing considerations that I'm missing here... but hopefully the gist of this idea might help spawn some other ideas...

@buffalu
Copy link
Contributor

buffalu commented Mar 11, 2022

i think the threadpool used in replay stage should automatically figure this out? its not super optimal if your batch size is small however (which they are). another thing to note is that transactions in an entry aren't currently executed in parallel, so we need to get to that first too.

see this comment for more context on small batch sizes: #22096 (comment)
see execute_batches_internal:

fn execute_batches_internal(

would love to get to this but don't have enough time in the day. have a branch somewhere that added max parallelism, might pull that out

@sakridge
Copy link
Contributor

note is that transactions in an entry aren't currently executed in parallel, so we need to get to that first too.

Yes they are. The code now tries to group transactions across batches into evenly-sized parallizeable chunks:

let rebatched_txs = if total_cost > target_batch_count.saturating_mul(minimal_tx_cost) {

@carllin
Copy link
Contributor

carllin commented Mar 16, 2022

heh @ryoqun this is for optimizing replay to guarantee parallelism right?

Yeah, for replay you can do this where you essentially schedule all the conflicting transactions into the same "span", and then schedule the "spans" to individual threads.

This completely removes the requirement that all transactions in an entry must be non-conflicting. Instead, replay stage can figure out how to optimize for concurrency.

Also, I ask a very similar question to this in my interviews, stop giving the answer away :)

@ryoqun
Copy link
Contributor Author

ryoqun commented Mar 16, 2022

heh @ryoqun this is for optimizing replay to guarantee parallelism right?

@carllin

Well, as i commented there, i'm thinking to change the banking stage this way and encode spans intead of entries at the broadcast stage. so, this is more aggressive proposal.

I'm planning to visualize the situation. but i'm guessing both banking stage and replay stage alike are needlessly throttled when there are many and heavy and contended transaction loads. I thought that's why blocks are small when the cluster is unstable, right?

also related: #22096 (comment) and #21883

@carllin
Copy link
Contributor

carllin commented Mar 16, 2022

Changing banking stage to do this might not be feasible due to fee priorities, see discussion here about how lower fee transactions might starve higher fee transactions: #23211 (comment)

Essentially lower-fee transactions in different spans could starve higher fee transactions in a different span if scheduled greedily

@nikhayes
Copy link

nikhayes commented Mar 17, 2022

Another thing that is tough with having only one larger span size is that a stream of small non-parallelizable transactions would be rate-limited by the time of the spans (which looks like it would be the longest transaction times?) --- assuming write-lock updates are made during the gap between spans.

I was trying to combine the span idea with my idea of threads that handled certain sized transactions (i.e. one thread just defaults to processing batches of small compute transactions, another thread defaults to somewhat larger ones, and so on). One way of reworking spans (maybe) and reducing the issue of rate limiting for the smaller non-parallelizable transactions could be to:

  1. have multiple span lengths (big to small) that are divisible by each other. I'm not sure if compute is the best way to time things, but imagine it is for a second... you could have the largest span be 1.2 million compute units (for largest transaction size), and in order from largest to smallest... 1.2 million, 600k, 300k, 150k, 75k. The smallest span time, 75k, would occur at a regular time interval that should accommodate 75k compute worth of transactions (again, I'm not sure if there's something that can be used to approximate execution time well) , and in the span you would default to selecting from a transaction pool with smaller compute transactions (i.e. 10k or less) and pack a group of smaller parallelizable transactions into the small span.

  2. threads are dedicated to a particular span size and there is redundancy here (so multiple threads for a particular span size) but the timing of these spans are staggered in some way (i.e. offset by 75k span interval). The 75k spans would be completely aligned though. Staggering the spans, but still aligning subsets could result in less variable entry sizes and consistent timing of entry generation (?). For there to be at least one large span that finishes at each span interval, it would be 1.2mil/75k = 16 threads. Ideally smaller span threads always grab smaller transactions because you wouldn't want a very small transaction to be grabbed by a large thread and in turn write lock things for the duration of the longer span if not necessary.

  3. Write-lock related decisions would occur each time the 75k-span timing ends, and only a subset of the spans would have finished executing their span and share the same gap. It's possible that a transaction might have taken longer and wasn't able to finish in time... in this case it would be retried in the next span perhaps. Otherwise, in this gap the spans would become loaded with new transactions by some form of transaction scheduling, which could also take into account fee-priority. Here, I like the idea of separating transactions into pools based on compute ranges that certain threads would default to. If compute is an ok proxy for timing decisions... then a thread, i.e. 600k span, could go straight for the 200k to 500k compute pool first, then work it's way down to the pools of smaller compute transactions to pack its span better. I imagine there would be a need to limit the amount of search time here so that it could go back to executing in its span, or the scheduling could be done elsewhere somehow while that thread is busy executing.

Anyway, I need to think about that more but just thought I'd add some ideas... not at all my field so I'm sure I'm unaware of a lot of things here.

@ryoqun
Copy link
Contributor Author

ryoqun commented Mar 18, 2022

Changing banking stage to do this might not be feasible due to fee priorities, see discussion here about how lower fee transactions might starve higher fee transactions: #23211 (comment)

Essentially lower-fee transactions in different spans could starve higher fee transactions in a different span if scheduled greedily

@carllin thanks for the pointer. i'll grok the details later

i think we can just interrupt the lower-fee transaction execution to avoid starvation as soon as the higher-fee transaction enters the scheduling stage. current impl is like a pessimistic locking, but we can lean towards a optimistic locking ala preemptive scheduling. (assuming wasted out-of-band(-from-consenses) cpu cycles are cheap)

(@nikhayes also, thanks for spending time on this with such enthusiasm. i'll reply later)

@ryoqun ryoqun changed the title pre-rfc: Introduce tiled span tx execution model instead of batched entries pre-rfc: Introduce tiled span tx execution model instead of batched entries at banking Apr 1, 2022
@CherryWorm
Copy link
Contributor

Not sure if this is the right issue to post this. I've been thinking a bit about how best to approximate this. I didn't yet come up with a good linear-time approximation which means this is probably too slow in praxis, however I think this is the best approach to meassure how well a block could have been packed:

  1. Let G = (V, E) be the induced graph, where V is the set of txs that have an edge in E if and only if they have a conflicting lock

  2. Find the largest independent set S in G, weighted by CU of the txs

  3. Remove all nodes from S in G

  4. Distribute all txs from S optimally across k cores (O(|S|log|S|), since no 2 txs in S have conflicts)

  5. Repeat 2-4 until G is empty or block is too long

Weighted independent set can either be computed optimally when the amount of txs is not too large (np-hard), or can be approximated (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.102.4678&rep=rep1&type=pdf) if this would take too long

@ryoqun
Copy link
Contributor Author

ryoqun commented Jun 26, 2022

i got tip from @carllin that @steviez is working on similar replay stage improvement. :)

also #transaction-scheduler on discord has some updates.

@ryoqun ryoqun changed the title pre-rfc: Introduce tiled span tx execution model instead of batched entries at banking Introduce tiled span tx execution model instead of batched entries at banking/replaying Sep 28, 2022
@ryoqun ryoqun changed the title Introduce tiled span tx execution model instead of batched entries at banking/replaying Introduce tiled tx execution model instead of batched entries at banking/replaying Sep 28, 2022
@Huzaifa696
Copy link

I had posted a somewhat tangential idea related to processing of transactions in discord yesterday, but maybe these ideas can complement each other. I'm definitely not an expert on any of this... but this is my understanding so far and what I think might be able to help:

Continued Description of Problem -Assuming each of the 2 current non-vote banking threads process their own batch (128 packets) and are executing groups of parallelizable transactions which are then converted into an entry, what seems to be the case is that the entire group of transactions being executed would be rate limited by the slowest transaction included in an entry (with speed correlating to compute units). During a Raydium IDO, it's likely that the Raydium IDO transactions fill up batches on both threads to a high degree, so you may have a few initial entries that contain some parallelizable transactions that are executed, but the rest of the batch will need to iterate through all the single Raydium transactions until the batch is emptied, and the second thread (which also has many Raydium txns presumably) will have already executed a few entries but is probably stuck due to the Raydium transaction write-lock from the first thread. -What I would expect to see is that during a Raydium IDO, you have: A) A small spike in transactions each time a new batch is loaded, which thereafter generates smaller, single Raydium transaction entries B) slower entry generation on each thread due to high compute transactions rate limiting everything else C) fewer entries being generated by the second thread which is probably write-locked for long periods of time. Adding more banking threads will help a little here I think, but the issue is that they'll all still be filled up with a very high degree off Raydium transactions and will thus be stalled after initially processing a few parallelizable transactions. *I think it would be cool to see more information about the entries generated over the course of a slot and also more details about replay, such as how many threads are being utilized to process certain entries, how long certain entries take to process, etc.

Proposed Solution to speed things up Step 1: A compute budget is declared for transactions of all sizes rather than only when they exceed 200k compute. I'm not sure if it's possible, but if RPC nodes could generate an estimated compute budget for each transaction when a client wants to submit one, that would be ideal. If a transaction exceeds the compute declared by its compute budget, it fails in banking stage and is charged. If a transaction undershoots its compute budget, it is still charged the full compute budget. The onus to accurately estimate the required for a transaction would be on the RPC service and market forces would favor those RPC services that keep compute budget cheap, while still landing successful transactions.

Step 2: Earlier on in transaction processing, transactions would be sorted into groups based on the compute budget ranges they fall into. Non-vote banking threads are delegated the job of processing transactions that fall within a certain compute range (i.e. <5k, 5k to 20k, 20k to 100k, 100 to 200k, etc). A banking thread dealing with smaller transactions (i.e. 5k compute and below ---i.e. transfers/payments, pyth, etc) would only be rate limited in its execution by the slowest of these smaller transactions (rather than large Raydium transaction or something else), and would thus be free to quickly iterate through its batches (assuming threads can generate entries asynchronously from one another). I think giving transfers/payments more freedom to bypass log jams on other threads is crucial for reliably payment infrastructure. Other banking threads would handle transactions of other sizes (i.e. 5k to 20k, 20k to 100k, 100 to 200k, etc). One thing that I think is nice here, is that I think a lot of the non-parallelizable transactions will tend to fall within a similar compute range, and thus will litter other batches/threads with fewer write-locked packets/txns.

Other thoughts: If there is a candymachine NFT drop, or Raydium IDO, I'm assuming a lot of people will be adding fee prioritization to their transactions and you'll have exactly the same TPS drop as we see now. If fee prioritization is added to a Raydium txn for example though, with this new model, it would be isolated to the compute budget specific thread and wouldn't be cutting in front of line of all transactions, but just those within its own compute range. I'm sure there are a lot of constraints and concurrency/timing considerations that I'm missing here... but hopefully the gist of this idea might help spawn some other ideas...

@nikhayes we have done some experiments on your logic of distributing txs based on CUs but we didnt see any major improvements. For example when we divide incoming txs into two CU ranges and then see inter-bucket conflict for a specifc window. It is almost around 30-90% based on what window size you choose.
do you have any proof of this approach or other findings that will support this?

@buffalu
Copy link
Contributor

buffalu commented May 5, 2023

jito-foundation/jito-solana#294 might be good inspo for this

@7layermagik
Copy link

I had posted a somewhat tangential idea related to processing of transactions in discord yesterday, but maybe these ideas can complement each other. I'm definitely not an expert on any of this... but this is my understanding so far and what I think might be able to help:
Continued Description of Problem -Assuming each of the 2 current non-vote banking threads process their own batch (128 packets) and are executing groups of parallelizable transactions which are then converted into an entry, what seems to be the case is that the entire group of transactions being executed would be rate limited by the slowest transaction included in an entry (with speed correlating to compute units). During a Raydium IDO, it's likely that the Raydium IDO transactions fill up batches on both threads to a high degree, so you may have a few initial entries that contain some parallelizable transactions that are executed, but the rest of the batch will need to iterate through all the single Raydium transactions until the batch is emptied, and the second thread (which also has many Raydium txns presumably) will have already executed a few entries but is probably stuck due to the Raydium transaction write-lock from the first thread. -What I would expect to see is that during a Raydium IDO, you have: A) A small spike in transactions each time a new batch is loaded, which thereafter generates smaller, single Raydium transaction entries B) slower entry generation on each thread due to high compute transactions rate limiting everything else C) fewer entries being generated by the second thread which is probably write-locked for long periods of time. Adding more banking threads will help a little here I think, but the issue is that they'll all still be filled up with a very high degree off Raydium transactions and will thus be stalled after initially processing a few parallelizable transactions. *I think it would be cool to see more information about the entries generated over the course of a slot and also more details about replay, such as how many threads are being utilized to process certain entries, how long certain entries take to process, etc.
Proposed Solution to speed things up Step 1: A compute budget is declared for transactions of all sizes rather than only when they exceed 200k compute. I'm not sure if it's possible, but if RPC nodes could generate an estimated compute budget for each transaction when a client wants to submit one, that would be ideal. If a transaction exceeds the compute declared by its compute budget, it fails in banking stage and is charged. If a transaction undershoots its compute budget, it is still charged the full compute budget. The onus to accurately estimate the required for a transaction would be on the RPC service and market forces would favor those RPC services that keep compute budget cheap, while still landing successful transactions.
Step 2: Earlier on in transaction processing, transactions would be sorted into groups based on the compute budget ranges they fall into. Non-vote banking threads are delegated the job of processing transactions that fall within a certain compute range (i.e. <5k, 5k to 20k, 20k to 100k, 100 to 200k, etc). A banking thread dealing with smaller transactions (i.e. 5k compute and below ---i.e. transfers/payments, pyth, etc) would only be rate limited in its execution by the slowest of these smaller transactions (rather than large Raydium transaction or something else), and would thus be free to quickly iterate through its batches (assuming threads can generate entries asynchronously from one another). I think giving transfers/payments more freedom to bypass log jams on other threads is crucial for reliably payment infrastructure. Other banking threads would handle transactions of other sizes (i.e. 5k to 20k, 20k to 100k, 100 to 200k, etc). One thing that I think is nice here, is that I think a lot of the non-parallelizable transactions will tend to fall within a similar compute range, and thus will litter other batches/threads with fewer write-locked packets/txns.
Other thoughts: If there is a candymachine NFT drop, or Raydium IDO, I'm assuming a lot of people will be adding fee prioritization to their transactions and you'll have exactly the same TPS drop as we see now. If fee prioritization is added to a Raydium txn for example though, with this new model, it would be isolated to the compute budget specific thread and wouldn't be cutting in front of line of all transactions, but just those within its own compute range. I'm sure there are a lot of constraints and concurrency/timing considerations that I'm missing here... but hopefully the gist of this idea might help spawn some other ideas...

@nikhayes we have done some experiments on your logic of distributing txs based on CUs but we didnt see any major improvements. For example when we divide incoming txs into two CU ranges and then see inter-bucket conflict for a specifc window. It is almost around 30-90% based on what window size you choose. do you have any proof of this approach or other findings that will support this?

Haha this was a while ago and just the idea of a non-technical person -- I'm not sure if/how transaction scheduling has changed since to be honest. Buffalu's comment and the transaction scheduler channel on discord are probably the best places to go for updates.

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label May 6, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

No branches or pull requests

8 participants