-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Introduce tiled tx execution model instead of batched entries at banking/replaying #23548
Comments
I had posted a somewhat tangential idea related to processing of transactions in discord yesterday, but maybe these ideas can complement each other. I'm definitely not an expert on any of this... but this is my understanding so far and what I think might be able to help: Continued Description of Problem Proposed Solution to speed things up Step 2: Earlier on in transaction processing, transactions would be sorted into groups based on the compute budget ranges they fall into. Non-vote banking threads are delegated the job of processing transactions that fall within a certain compute range (i.e. <5k, 5k to 20k, 20k to 100k, 100 to 200k, etc). A banking thread dealing with smaller transactions (i.e. 5k compute and below ---i.e. transfers/payments, pyth, etc) would only be rate limited in its execution by the slowest of these smaller transactions (rather than large Raydium transaction or something else), and would thus be free to quickly iterate through its batches (assuming threads can generate entries asynchronously from one another). I think giving transfers/payments more freedom to bypass log jams on other threads is crucial for reliably payment infrastructure. Other banking threads would handle transactions of other sizes (i.e. 5k to 20k, 20k to 100k, 100 to 200k, etc). One thing that I think is nice here, is that I think a lot of the non-parallelizable transactions will tend to fall within a similar compute range, and thus will litter other batches/threads with fewer write-locked packets/txns. Other thoughts: If there is a candymachine NFT drop, or Raydium IDO, I'm assuming a lot of people will be adding fee prioritization to their transactions and you'll have exactly the same TPS drop as we see now. If fee prioritization is added to a Raydium txn for example though, with this new model, it would be isolated to the compute budget specific thread and wouldn't be cutting in front of line of all transactions, but just those within its own compute range. I'm sure there are a lot of constraints and concurrency/timing considerations that I'm missing here... but hopefully the gist of this idea might help spawn some other ideas... |
i think the threadpool used in replay stage should automatically figure this out? its not super optimal if your batch size is small however (which they are). another thing to note is that transactions in an entry aren't currently executed in parallel, so we need to get to that first too. see this comment for more context on small batch sizes: #22096 (comment) solana/ledger/src/blockstore_processor.rs Line 260 in 3c68400
would love to get to this but don't have enough time in the day. have a branch somewhere that added max parallelism, might pull that out |
Yes they are. The code now tries to group transactions across batches into evenly-sized parallizeable chunks: solana/ledger/src/blockstore_processor.rs Line 340 in 3c68400
|
heh @ryoqun this is for optimizing replay to guarantee parallelism right? Yeah, for replay you can do this where you essentially schedule all the conflicting transactions into the same "span", and then schedule the "spans" to individual threads. This completely removes the requirement that all transactions in an entry must be non-conflicting. Instead, replay stage can figure out how to optimize for concurrency. Also, I ask a very similar question to this in my interviews, stop giving the answer away :) |
Well, as i commented there, i'm thinking to change the banking stage this way and encode spans intead of entries at the broadcast stage. so, this is more aggressive proposal. I'm planning to visualize the situation. but i'm guessing both banking stage and replay stage alike are needlessly throttled when there are many and heavy and contended transaction loads. I thought that's why blocks are small when the cluster is unstable, right? also related: #22096 (comment) and #21883 |
Changing banking stage to do this might not be feasible due to fee priorities, see discussion here about how lower fee transactions might starve higher fee transactions: #23211 (comment) Essentially lower-fee transactions in different spans could starve higher fee transactions in a different span if scheduled greedily |
Another thing that is tough with having only one larger span size is that a stream of small non-parallelizable transactions would be rate-limited by the time of the spans (which looks like it would be the longest transaction times?) --- assuming write-lock updates are made during the gap between spans. I was trying to combine the span idea with my idea of threads that handled certain sized transactions (i.e. one thread just defaults to processing batches of small compute transactions, another thread defaults to somewhat larger ones, and so on). One way of reworking spans (maybe) and reducing the issue of rate limiting for the smaller non-parallelizable transactions could be to:
Anyway, I need to think about that more but just thought I'd add some ideas... not at all my field so I'm sure I'm unaware of a lot of things here. |
@carllin thanks for the pointer. i'll grok the details later i think we can just interrupt the lower-fee transaction execution to avoid starvation as soon as the higher-fee transaction enters the scheduling stage. current impl is like a pessimistic locking, but we can lean towards a optimistic locking ala preemptive scheduling. (assuming wasted out-of-band(-from-consenses) cpu cycles are cheap) (@nikhayes also, thanks for spending time on this with such enthusiasm. i'll reply later) |
Not sure if this is the right issue to post this. I've been thinking a bit about how best to approximate this. I didn't yet come up with a good linear-time approximation which means this is probably too slow in praxis, however I think this is the best approach to meassure how well a block could have been packed:
Weighted independent set can either be computed optimally when the amount of txs is not too large (np-hard), or can be approximated (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.102.4678&rep=rep1&type=pdf) if this would take too long |
@nikhayes we have done some experiments on your logic of distributing txs based on CUs but we didnt see any major improvements. For example when we divide incoming txs into two CU ranges and then see inter-bucket conflict for a specifc window. It is almost around 30-90% based on what window size you choose. |
jito-foundation/jito-solana#294 might be good inspo for this |
Haha this was a while ago and just the idea of a non-technical person -- I'm not sure if/how transaction scheduling has changed since to be honest. Buffalu's comment and the transaction scheduler channel on discord are probably the best places to go for updates. |
(still draft; also i may be missing the obvious point; this is just rough idea)
Problem
Current execution model (batched transactions collectively locks all touched accounts as a coarse locking) is suboptimal regarding full cpu utilization in general and affects other unrelated transactions, given vastly-varying processing time of transactions.
We're trying to fix this by compute units. However, I think it's hard to correctly model the parallelization factor (i.e. singly-threaded IDO transactions's processing time should be weighted heavier than other similarly-heavy non-singly-threaded transactions).
I guess current design and implementation was assumed on a rather uniform tx load and easiness of batching primarily (for gpu).
Proposed Solution
At the very bottom of the story, leader's job/economic incentive is to saturate N cores as much as possible to collect tx fees.. So, let's design from that with fresh mind.
Current (NOTE: I have limited knowledge of banking stage, I'm just writing this with replay stage workings....):
Assume
tx1{a,b}
is a compute-heavy transactions writing to a common account.tx2{a,b}
is light transactions writing to another common account. etcSo, completion of
batch1
is blocked bytx1a
. And we can't executetx2b
inbatch2
even iftx2a
finished already.note that executions of transactions of a single entry is indeed parallelized by rayon pool. but i think the absurdly-long-running outlier transaction is hampering with cpu under-utilization. this could be even exaggerated with the requested (increased) computed units.
(Pre-)Proposed:
Obviously, we need a deterministic algo. for determining the boundaries of this variable/dynamic span thing. However, i think we can do this by recognizing a span as soon as the oldest started transaction finishes the execution (i.e. longest-running tx). And each transaction in spanN can be mapped with coreN and local index of coreN inside spanN.
then at high level, banking stage pool N threads and each threads fetch pending transactions as long as it's idling and the tx's account isn't contended.
In that way, highly-contended transactions are isolated by nature and parallelizable transactiona are processed as fast as possible.
note that inter-core account dependency will be represented as vclock to encode the proper execution ordering.
related: #23438
The text was updated successfully, but these errors were encountered: