Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation (small blocks) after updating from 1.8 to 1.9 #24163

Closed
antonya86 opened this issue Apr 7, 2022 · 5 comments
Closed
Labels
stale [bot only] Added to stale content; results in auto-close after a week.

Comments

@antonya86
Copy link

Problem

After upgrading our mainnet-beta validator from v1.8 to v1.9 we found that our validator started to produce blocks with a really low amount of transactions against average. We investigated the situation, fixed the issue and revealed a significant bunch of validators whose amount of transactions per leader block dropped by 25% after upgrading to v1.9. We posted our investigation on solana forum here

Although the situation is not a disaster, it would be great if someone from the Solana foundation responsible for cluster/validator performance looked at the data we gathered. Thanks in advance.

@sakridge
Copy link
Member

sakridge commented Apr 7, 2022

CC @taozhu-chicago

@tao-stones
Copy link
Contributor

Thank you @antonya86 for sharing the finding. Interesting that increased SOLANA_BANKING_THREADS had opposite effect with 1.9. Just to clarify, lowering banking threads to 4 restored 1.9 block producing performance. How many threads it was?

@antonya86
Copy link
Author

@taozhu-chicago when we restored banking thread to default value 4 the amount of transactions became normal - 1.5k in average.

How many threads it was?
Initially, it was extremelly high 24. We have a lot of cpu cores on our server so we can afford such value. On v1.8 it worked without any problem, I monitored our validator from time to time and saw that it produced decent blocks with up to 4.4k transactions per block.

After updating to v1.9 our validator started to produce blocks like that (a bunch of 4 slots):

block_1 7 transaction
block_2 300 transactions
block_3 500 transactions
block_4 9 transactions

Next, I lowered SOLANA_BANKING_THREADS to 8 and I saw that at least 1 block of 4 was OK. The situation was like:

block_1 7 transaction
block_2 300 transactions
block_3 2200 transactions
block_4 450 transactions

Finally, I set SOLANA_BANKING_THREADS to 4 and the validator started to produce blocks with average amount of transactions (~1.5k).

@nikhayes
Copy link

nikhayes commented Apr 7, 2022

Interesting that your examples happen to show that the first block often has the lowest amount of transactions, as well as the last one. I think I've noticed a similar pattern when scanning blocks... any theories Tao? It seems like leaders are being sent transactions quite ahead of their leader slots these days I think (someone from Mango said they forward 8 slots ahead I think), which makes it odd that their first slot would have so few txns? Might also interest @ckamm

Also, maybe if yall have the time @antonya86 it could be interesting to look at mean txns/block across the four leader slots with stats to see if it's a legitimate phenomenon 😅

@buffalu
Copy link
Contributor

buffalu commented May 15, 2022

found the issue last night after some discussion with @sakridge and @carllin

if cost model is exceeded, banking stage leaves those packets in the buffer. this causes the while !buffered_packet_batches.is_empty() to continually be hit instead of reading in new packets from the buffer. it might resolve itself when the cost model is reset on the next slot, but if you have >4 slots worth of transactions that saturate the cost model for an account, it won't read in new packets.

the same situation could also happen with lots of txs locking the same account returning AccountInUse error, but more likely to resolve itself after many iterations of that loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

No branches or pull requests

5 participants