Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

tpu-client: Speed up performance by awaiting all futures at once #32945

Merged
merged 3 commits into from
Aug 24, 2023

Conversation

joncinque
Copy link
Contributor

@joncinque joncinque commented Aug 22, 2023

Problem

This is a less invasive version of #32844, which aims to solve the regression of slow deployments in 1.16.

Summary of Changes

5672758: Rather than detaching tasks and risking overloading the runtime, the solution here is to create all of the sending futures right from the start and await them all together. The performance is here is roughly the same as #32844, which is 2.5-3.5 seconds for 100 self-transfers.

cb76257: On top of that, I investigated the error found at #32844 (comment) -- if we can't connect to a node for some reason, there are lengthy timeouts and retries. This makes sense for validators who are trying to maintain a persistent connection to the network, but for users, they just want to fire off their messages. So as a simple fix, we timeout the future, which solves the issue.

Once this lands, we can look into doing something similar with the work done in #32638

Fixes #32595

}

// Send them all at once
let results = join_all(futures).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we transpose all those timeout wrappers by just timeout-wrapping the JoinAll future here, or do we want the granularity?

Copy link
Contributor Author

@joncinque joncinque Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the granularity unfortunately.

We're mimicking the old behavior by waiting 10ms between each send, and that timeout encompasses the waiting too. So for example, if we set a global timeout of 5 seconds, and we want to send 800 transactions, the last 300 transactions will never get sent since they'll timeout just waiting their turn.

@codecov
Copy link

codecov bot commented Aug 23, 2023

Codecov Report

Merging #32945 (8550207) into master (e1972f0) will decrease coverage by 0.1%.
The diff coverage is 0.0%.

@@            Coverage Diff            @@
##           master   #32945     +/-   ##
=========================================
- Coverage    82.0%    81.9%   -0.1%     
=========================================
  Files         784      784             
  Lines      212534   212597     +63     
=========================================
- Hits       174370   174328     -42     
- Misses      38164    38269    +105     

}

// Send them all at once
let results = join_all(futures).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we send all the transactions at the same time won't it create spiky traffic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually send them all at once! We wait the proper amount of time before each round of sending, just like before. The gain is that we await all of the futures at once, so if one leader takes a long time for the send, it doesn't hold up the rest of the transactions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but I can see why that's confusing given the comment, I'll update it

@KirillLykov
Copy link
Contributor

KirillLykov commented Aug 23, 2023

I've tested this new implementation and results look slower than before [1m11s, 2m48s, 2m59s, 1m14s] (#32595 (comment))

I thought maybe it is wrong time for testnet and tried the same with previous PR, it showed time ~26s.

I used the following procedure to deploy:

#!/usr/bin/bash
SPL_DIR="../solana-program-library"
SO_FILE="${SPL_DIR}/target/deploy/spl_governance.so"
PROGRAM_KEYPAIR="test-program-id.json"
CLUSTER_ARG="-ut"
SOLANA="target/release/solana"

# Check if so file is there
if [ ! -f ${SO_FILE} ]; then
    echo '${SO_FILE} was not found, build it'
    cd "${SPL_DIR}/governance"
    cargo build-sbf
    cd -
    # our file is ./target/deploy/spl_governance.so
fi

${SOLANA} ${CLUSTER_ARG} balance
du -sh ${SO_FILE}

${SOLANA} ${CLUSTER_ARG} --version
if [ ! -f ${PROGRAM_KEYPAIR} ]; then
    echo '${PROGRAM_KEYPAIR} was not found, generate it'
    solana-keygen new --no-bip39-passphrase -o ${PROGRAM_KEYPAIR}
fi
time ${SOLANA} ${CLUSTER_ARG} program deploy --program-id ${PROGRAM_KEYPAIR} ${SO_FILE}

@joncinque
Copy link
Contributor Author

I've tested this new implementation and results look slower than before [1m11s, 2m48s, 2m59s, 1m14s] (#32595 (comment))

Ah, that's because program deploy uses the new method now on master, which still has slow sends. I wanted to address that side separately, since backporting this will fix 1.16

@KirillLykov
Copy link
Contributor

I've tested this new implementation and results look slower than before [1m11s, 2m48s, 2m59s, 1m14s] (#32595 (comment))

Ah, that's because program deploy uses the new method now on master, which still has slow sends. I wanted to address that side separately, since backporting this will fix 1.16

Right, if I modify cli/src/program.rs to use the old method to deploy program, I get [0m58s, 0m23s, 0m36s, 0m21s]

@joncinque joncinque added the v1.16 PRs that should be backported to v1.16 label Aug 24, 2023
@joncinque joncinque merged commit b42249f into solana-labs:master Aug 24, 2023
@joncinque joncinque deleted the tpufaster branch August 24, 2023 11:04
mergify bot pushed a commit that referenced this pull request Aug 24, 2023
)

* tpu-client: Await all futures at once

* Add timeout when sending to not waste time on down nodes

* Update comment to make it clearer that we're not spiking

(cherry picked from commit b42249f)
joncinque added a commit that referenced this pull request Aug 25, 2023
…ce (backport of #32945) (#32972)

* tpu-client: Speed up performance by awaiting all futures at once (#32945)

* tpu-client: Await all futures at once

* Add timeout when sending to not waste time on down nodes

* Update comment to make it clearer that we're not spiking

(cherry picked from commit b42249f)

* Add additional trait bounds required in 1.16

---------

Co-authored-by: Jon Cinque <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
v1.16 PRs that should be backported to v1.16
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Solana program deploys on devnet with Solana 1.16 are slow
3 participants