-
Notifications
You must be signed in to change notification settings - Fork 4.5k
tpu-client: Speed up performance by awaiting all futures at once #32945
Conversation
} | ||
|
||
// Send them all at once | ||
let results = join_all(futures).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we transpose all those timeout wrappers by just timeout-wrapping the JoinAll
future here, or do we want the granularity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the granularity unfortunately.
We're mimicking the old behavior by waiting 10ms between each send, and that timeout encompasses the waiting too. So for example, if we set a global timeout of 5 seconds, and we want to send 800 transactions, the last 300 transactions will never get sent since they'll timeout just waiting their turn.
Codecov Report
@@ Coverage Diff @@
## master #32945 +/- ##
=========================================
- Coverage 82.0% 81.9% -0.1%
=========================================
Files 784 784
Lines 212534 212597 +63
=========================================
- Hits 174370 174328 -42
- Misses 38164 38269 +105 |
} | ||
|
||
// Send them all at once | ||
let results = join_all(futures).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we send all the transactions at the same time won't it create spiky traffic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually send them all at once! We wait the proper amount of time before each round of sending, just like before. The gain is that we await
all of the futures at once, so if one leader takes a long time for the send, it doesn't hold up the rest of the transactions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but I can see why that's confusing given the comment, I'll update it
I've tested this new implementation and results look slower than before I thought maybe it is wrong time for testnet and tried the same with previous PR, it showed time ~26s. I used the following procedure to deploy: #!/usr/bin/bash
SPL_DIR="../solana-program-library"
SO_FILE="${SPL_DIR}/target/deploy/spl_governance.so"
PROGRAM_KEYPAIR="test-program-id.json"
CLUSTER_ARG="-ut"
SOLANA="target/release/solana"
# Check if so file is there
if [ ! -f ${SO_FILE} ]; then
echo '${SO_FILE} was not found, build it'
cd "${SPL_DIR}/governance"
cargo build-sbf
cd -
# our file is ./target/deploy/spl_governance.so
fi
${SOLANA} ${CLUSTER_ARG} balance
du -sh ${SO_FILE}
${SOLANA} ${CLUSTER_ARG} --version
if [ ! -f ${PROGRAM_KEYPAIR} ]; then
echo '${PROGRAM_KEYPAIR} was not found, generate it'
solana-keygen new --no-bip39-passphrase -o ${PROGRAM_KEYPAIR}
fi
time ${SOLANA} ${CLUSTER_ARG} program deploy --program-id ${PROGRAM_KEYPAIR} ${SO_FILE} |
Ah, that's because program deploy uses the new method now on master, which still has slow sends. I wanted to address that side separately, since backporting this will fix 1.16 |
Right, if I modify |
) * tpu-client: Await all futures at once * Add timeout when sending to not waste time on down nodes * Update comment to make it clearer that we're not spiking (cherry picked from commit b42249f)
…ce (backport of #32945) (#32972) * tpu-client: Speed up performance by awaiting all futures at once (#32945) * tpu-client: Await all futures at once * Add timeout when sending to not waste time on down nodes * Update comment to make it clearer that we're not spiking (cherry picked from commit b42249f) * Add additional trait bounds required in 1.16 --------- Co-authored-by: Jon Cinque <[email protected]>
Problem
This is a less invasive version of #32844, which aims to solve the regression of slow deployments in 1.16.
Summary of Changes
5672758: Rather than detaching tasks and risking overloading the runtime, the solution here is to create all of the sending futures right from the start and await them all together. The performance is here is roughly the same as #32844, which is 2.5-3.5 seconds for 100 self-transfers.
cb76257: On top of that, I investigated the error found at #32844 (comment) -- if we can't connect to a node for some reason, there are lengthy timeouts and retries. This makes sense for validators who are trying to maintain a persistent connection to the network, but for users, they just want to fire off their messages. So as a simple fix, we timeout the future, which solves the issue.
Once this lands, we can look into doing something similar with the work done in #32638
Fixes #32595