Leader qos #23257

tao-stones · 2022-02-21T14:29:32Z

Problem

Leader to prioritize transactions with fee-per-cu and/or sender stake weights

Summary of Changes

Refactor banking_stage buffer into its own model
Deserialize packet upon receiving into VersionedTransaction, store in buffer;
Add tx weighting stage
Add function to get packet locators from buffer, as well as their sender stakes;
Add weighted shuffle function
Add function to sanitize VersionedTransactions in chunk,
To drop lowe-priority packets based on eviction strategy instead of fifo, to make room for new packet_batch
Refactor consume_buffered_packets to prioritize SanitizedTransactions in fee-per-cu (still wip) then stake shuffled order, send to bank for processing in chunk.
refactor forwarding with fee-per-cu prioritization, and per write accounts.

tao-stones · 2022-02-21T23:02:05Z

This is still work-in-progress to actually get base-fee to calculate fee-per-cu, also to add/merge metrics back. But Just want to put all parts together to get a sense of performance after changing how banking_stage process packets.

The banking-bench shows some regression: mean tps go from 55k -> 53k.

The 1-hour perf test has more pronounced regression, mean goes from 24K -> 14K. This is from one sampling. Need to add metrics back to zoom-in.

Regarding performance, the overhead added here (eg sorting packets instead of linearly traverse packet_bacthes, additional functions of fee-per-cu, and stake weight etc) are going to have negative impact. A suggestion is to reduce buffer size from current 500_000 packets to 4 slots worth packets (~6,400 packets), and increase transaction processing threads from current 2 to 4, should make a good difference. These numbers of course need to be played around.

This is a rather big change, even it is till wip, I'd appreciate if more eyes to start looking at it.
@sakridge @aeyakovenko @carllin @jstarry @behzadnouri @t-nelson

tao-stones · 2022-02-21T23:23:05Z

@buffalu ^^^

buffalu · 2022-02-22T01:05:37Z

skimming through right now, this is my understanding of changes:

calc stake weight on a per-batch before sigverify
all packets still run through sigverify w/ 10k discard.
packets into deque does some type of replacement where it evicts low staked packets with high staked ones when you reach the max deque capacity.
processing buffered packets prioritizes by "fee" then stake weight (TODO, probably need to deserialize into one of the tx structs?)
seems like the rest mostly continues as normal.

stream of consciousness/messy notes:

locator thing felt weird at first, but seems better on second re-read. nice to avoid shuffling around memory?
wonder if better to calc stake weight after sigverify? going to be faster when you hit that MAX_SIGVERIFY_BATCH limit in sigverify, unless you wanna use it to discard excess based on stake weight. that would probably incentivize high-staked nodes to spam though, but i guess then you can find that out + people can pull stake weight from whoever does that (if not private stake or entities on that validator are making lots of $$$). that would probably be hard to figure out though if there's a bunch of stake-weighted forwarding + some type of relaying going on.
create_evictioin_locators typo
packet replacement could also be susceptible to spam by high-staked validators? but same point above applies i think.
is this going to cause a race for validators to get the most stake weight? high staked = faster blocks + faster tx processing.
will probably incentivize higher staked validators to spam right before the validator runs as it seems like most time is spent in consume_buffered_packets anyway, so it'll get sorted once and spend the entire time processing vs. pulling in new txs.

going to take another look in the AM, but initial thoughts above. hard problem to solve; i still feel like we wouldn't need all this if entries were packed better (can always be added later) + executed in parallel properly, but only going on intuition and don't have any data for this lol

t-nelson · 2022-02-22T02:22:18Z

I'd raised this in the issue that Toly raised last week, but I'm wondering how we get a benefit from stake-weighted QoS on the TPU port given that staked validators typically only submit transactions on TpuFwd and TpuVote rather than Tpu, where we're seeing congestion

tao-stones · 2022-02-22T16:00:04Z

locator thing felt weird at first, but seems better on second re-read. nice to avoid shuffling around memory?

Yea, using locator to avoid additional copies of packet_batch.

tao-stones · 2022-02-22T16:03:45Z

The main motivation with sender stake weighted shuffling is to allow (some) valid transactions go through in bot spamming condition. I can see your concern of incentivizing high staked validators to spam. This change can be discussed more.

t-nelson · 2022-02-22T16:49:16Z

The main motivation with sender stake weighted shuffling is to allow (some) valid transactions go through in bot spamming condition. I can see your concern of incentivizing high staked validators to spam. This change can be discussed more.

Sorry, that's not my concern at all. My concern is that staked validators rarely submit transactions to the TPU port, so stake-weighting transactions there will have little to no effect. That is, nearly all TXs coming in to the TPU port are from unstaked nodes

buffalu · 2022-02-22T20:36:20Z

The main motivation with sender stake weighted shuffling is to allow (some) valid transactions go through in bot spamming condition. I can see your concern of incentivizing high staked validators to spam. This change can be discussed more.

Sorry, that's not my concern at all. My concern is that staked validators rarely submit transactions to the TPU port, so stake-weighting transactions there will have little to no effect. That is, nearly all TXs coming in to the TPU port are from unstaked nodes

you are right, but wasn't it suggested that people could forward to high-staked validators and they could do QoS and submit? not totally sure what problem that solves right now tho

t-nelson · 2022-02-22T20:57:26Z

wasn't it suggested that people could forward to high-staked validators and they could do QoS and submit?

Maybe. Kinda need that before qos though. There's no port to accept these transactions on today, nor logic discern them later in the pipeline (nor even a reliable way to discern?)

jstarry · 2022-02-25T08:07:05Z

I feel like there is still a lot of discussion required for the scope of this change. And this is a very critical change. I think it'd be more efficient if we got consensus through a proposal first. I personally have little interest in reading thousands of lines of refactorings and changes we're not sure about yet and are very impactful to tx ingestion. That said, drafts like this are useful for checking perf implications and can aid the proposal discussion.

tao-stones · 2022-02-25T14:50:04Z

I feel like there is still a lot of discussion required for the scope of this change. And this is a very critical change. I think it'd be more efficient if we got consensus through a proposal first. I personally have little interest in reading thousands of lines of refactorings and changes we're not sure about yet and are very impactful to tx ingestion. That said, drafts like this are useful for checking perf implications and can aid the proposal discussion.

It indeed is a lot of code change, I wanted to put all things together, at least as proof of concept, so the discussions would be more concrete. Also trying to keep commits rather self-contained, so they can be reviewed/discussed individually, and revert/cherry-picked if necessary.

While the refactoring is large change, the motivation of it is rather simple, as described in PR comments. We should continue discussion at #23207.

Just as a side note wrt performance, its 1-hour perf test is inline with what we have now at 22k tps (https://solanalabs.slack.com/archives/CP2L2S4KV/p1645779722062819), and banking-bench is at 93K tps.

t-nelson · 2022-02-25T17:19:35Z

I feel like there is still a lot of discussion required for the scope of this change. And this is a very critical change. I think it'd be more efficient if we got consensus through a proposal first.

gm 🤝

#23211 (comment)

tao-stones · 2022-02-25T17:56:07Z

I feel like there is still a lot of discussion required for the scope of this change. And this is a very critical change. I think it'd be more efficient if we got consensus through a proposal first.

gm 🤝

#23211 (comment)

What the proposal should be? About the idea of prioritizing by fee/CU, or the implementation of it, or ideas of leader QoS in general?
I feel fee-based prioritization is a simple idea as @aeyakovenko described in comment. Not sure if I have more to add to it.
I'll be more than happy to write up the implementation details in this PR, if that's what we want to discuss. I do want to have more eyes on this change. Like I said elsewhere, while the idea is simple, the refactoring is more involving.

…der stakes

…ifo, to make room for new packet_batch

…on in chunk, instead of iterating through packet_batches;

…reads

…sorting; add test

…account limit. then forward by bucket to up to 10 block-limit CUs

carllin · 2022-03-05T05:26:29Z

core/src/unprocessed_packet_batches.rs

@@ -0,0 +1,1205 @@
+use {


+1 for refactoring this, but I think this can be a separate PR that moves the code into this file that gets checked in first.

Otherwise it's hard here to tell what new functions were added and what code was just migrated

carllin · 2022-03-05T05:29:29Z

core/src/packet_forward_manager.rs

+        let mut result = Vec::<&'a Packet>::new();
+        self.forwardable_packets
+            .iter()
+            .for_each(|v| result.extend(v));
+        result


this would probably be more efficient as a iter().collect::<Vec<&'a Packet>>() instead of multiple calls to extend()

carllin · 2022-03-05T05:31:01Z

core/src/packet_forward_manager.rs

+        cost: &TransactionCost,
+        packet: &'a Packet,
+    ) {
+        // try to sort the `transaction` into one of outbound (virtual) blocks


It's unclear what a "virtual" block is referring to here as that term is never defined, a more concrete definition would be helpful.

carllin · 2022-03-05T05:38:17Z

core/src/packet_forward_manager.rs

+    cost_trackers: Vec<CostTracker>,
+    forwardable_packets: Vec<Vec<&'a Packet>>,


This seems like it could be condensed into a single Vec<ForwardableBlock>, where a ForwardableBlock looks like:

struct ForwardableBlock { cost_tracker: CostTracker, forwardable_packets: Vec<&'a Packet> }

carllin · 2022-03-05T05:41:21Z

core/src/packet_forward_manager.rs

+        prioritize_by_fee(unprocessed_packet_batches, &locators, working_bank)
+    }
+
+    fn sort_into_buckets(


nit: sort_into_buckets -> insert_into_forwardable_blocks

carllin · 2022-03-05T05:48:55Z

core/src/packet_forward_manager.rs

+        unprocessed_packet_batches: &'a UnprocessedPacketBatches,
+        working_bank: Option<Arc<Bank>>,
+    ) -> Vec<PacketLocator> {
+        let mut locators = Vec::<PacketLocator>::with_capacity(TOTAL_BUFFERED_PACKETS);


nit: locators -> unforwarded_packets_locators

carllin · 2022-03-05T05:55:18Z

core/src/unprocessed_packet_batches.rs

+pub struct DeserializedPacketBatch {
+    pub packet_batch: PacketBatch,
+    pub forwarded: bool,
+    // indexes of valid packets in batch, and their corrersponding deserialized_packet


nit: corrersponding -> corresponding

carllin · 2022-03-05T05:57:23Z

core/src/unprocessed_packet_batches.rs

+    #[allow(dead_code)]
+    pub batch_index: usize,
+    #[allow(dead_code)]
+    pub packet_index: usize,
+}
+
+// hold deserialized messages, as well as computed message_hash and other things needed to create
+// SanitizedTransaction
+#[derive(Debug, Default)]
+pub struct DeserializedPacket {
+    #[allow(dead_code)]
+    pub versioned_transaction: VersionedTransaction,
+
+    #[allow(dead_code)]
+    pub message_hash: Hash,
+
+    #[allow(dead_code)]
+    pub is_simple_vote: bool,


dead_code tag doesn't seem necessary

carllin · 2022-03-05T05:57:58Z

core/src/packet_forward_manager.rs

@@ -0,0 +1,427 @@
+use {


It seems like all the forwarding changes can also be broken into a separate PR after the main prioritization change lands

carllin · 2022-03-05T06:04:09Z

core/src/unprocessed_packet_batches.rs

+        .collect()
+}
+
+// to comute (addition_fee + base_fee / requested_cu) for packet identified by `locator`


nit: comute -> compute

carllin · 2022-03-05T06:11:21Z

core/src/unprocessed_packet_batches.rs

+                .unprocessed_packets
+                .keys()
+                .for_each(|packet_index| {
+                    let p = &packet_batch.packets[*packet_index];


This pattern seems to come up a lot, let's add a new function to UnprocessedPacketBatches:

pub fn packet_iter(&self) -> impl Iterator<Item = (usize, &Packet)> { self.unprocessed_packets .keys() .map(|index| (*index, &self.packet_batch.packets[*index])) }

carllin · 2022-03-05T06:52:39Z

core/src/unprocessed_packet_batches.rs

+    let mut stakes = Vec::<u64>::with_capacity(TOTAL_BUFFERED_PACKETS);
+    let mut locators = Vec::<PacketLocator>::with_capacity(TOTAL_BUFFERED_PACKETS);


TOTAL_BUFFERED_PACKETS seems excessively large for most cases, could just not preallocate or figure out the size beforehand

carllin · 2022-03-05T07:05:50Z

core/src/packet_forward_manager.rs

+        unprocessed_packet_batches: &'a UnprocessedPacketBatches,
+        working_bank: Option<Arc<Bank>>,
+    ) -> Vec<PacketLocator> {
+        let mut locators = Vec::<PacketLocator>::with_capacity(TOTAL_BUFFERED_PACKETS);


This TOTAL_BUFFERED_PACKETS seems unnecessarily large. We can instead collect over an iterator:

let unforwarded_packets_locators: Vec<PacketLocator> = unprocessed_packet_batches.iter().enumerate().filter_map( |(batch_index, deserialized_packet_batch)| { if !deserialized_packet_batch.forwarded { Some((batch_index, deserialized_packet_batch)) } else { None } }, ).flat_map(|(batch_index, deserialized_packet_batch)| { deserialized_packet_batch .unprocessed_packets .keys() .map(move |packet_index| { PacketLocator { batch_index, packet_index: *packet_index, } }) }).collect();

carllin · 2022-03-07T03:44:26Z

core/src/unprocessed_packet_batches.rs

+        )?;
+        let total_fee = bank.get_fee_for_message(&sanitized_message)?;
+
+        // TODO update bank to get_fee_and_cu_for_message() to avoid calling compute_budget again


Was going to comment this as well 😃 , process_message() seems unnecessary here

carllin · 2022-03-07T03:47:00Z

core/src/unprocessed_packet_batches.rs

+            // if unable to compute fee-per-cu for the packet, put it to the `0` bucket
+            0u64
+        };


If we're unable to compute the fee-per-cu doesn't that mean the packet was corrupted and we can drop it entirely?

carllin · 2022-03-07T04:30:23Z

core/src/packet_forward_manager.rs

+        let prioritized_forwardable_packet_locators = Self::prioritize_unforwarded_packets_by_fee(
+            unprocessed_packet_batches,
+            Some(working_bank.clone()),
+        );
+
+        // if we have a bank to work with, then sort packets by write-account buckets
+        let (transactions, sanitized_locators) = sanitize_transactions(
+            unprocessed_packet_batches,
+            &prioritized_forwardable_packet_locators,
+            &working_bank.feature_set,
+            working_bank.vote_only_bank(),
+            working_bank.as_ref(),
+        );
+        let transactions_costs = qos_service.compute_transaction_costs(transactions.iter());


Let's add a comment describing the steps here:

Compute the fee per cu and sort transactions by this priority for the unprocessed_packet_batches queue in prioritize_unforwarded_packets_by_fee()

Lock cost model, clone every transaction from the unprocessed queue, deserialize all of these transactions, compute cost model cost in compute_transaction_costs()

Using the cost model costs from 2. split the unprocessed_packet_batches into separate chunks, where each chunk represents a block filled with transactions with costs that add up to the block cost limit

Some perf implications:

If the unprocessed_packet_batches was already a heap sorted by fee-per-cu priority which was updated when new transactions were added, we could avoid the sort here

Unprocessed buffer could be huge, calculating all this metadata might be expensive. Because there's already a limit on the number of packet bytes we actually forward here:

solana/core/src/banking_stage.rs

Lines 483 to 500 in 86e2f72

data_budget.update(INTERVAL_MS, |bytes| {

std::cmp::min(

bytes.saturating_add(MAX_BYTES_PER_INTERVAL),

MAX_BYTES_BUDGET,

)

});

let packet_vec: Vec<_> = packets

.iter()

.filter_map(|p| {

if !p.meta.forwarded() && data_budget.take(p.meta.size) {

Some((&p.data[..p.meta.size], tpu_forwards))

} else {

None

}

})

.collect();

, it might be good at the beginning to truncate the buffer up to a sane number of packets that will actually be forwarded.

If we do 2, I don't think we need to compute the costs of all these transactions, just forward up to X bytes from the buffer to avoid all the steps/cost needed to clone/deserialize every transaction and run the cost model

carllin · 2022-03-07T04:42:56Z

@taozhu-chicago still in the middle of reviewing this, but before you start resolving comments let's start breaking this up into pieces that can be checked in:

Move unprocessed_packet_batches from BankingStage into its own module
TransactionWeightingStage changes
The core BankingStage changes for fee prioritization, without the forwarding changes
The forwarding changes

carllin · 2022-03-07T04:54:33Z

core/src/transaction_weighting_stage.rs

+    fn apply_weights(batches: &mut [PacketBatch], ip_to_stake: &HashMap<IpAddr, u64>) {
+        batches.into_par_iter().for_each(|batch| {
+            batch.packets.par_iter_mut().for_each(|packet| {
+                packet.meta.weight = *ip_to_stake.get(&packet.meta.addr().ip()).unwrap_or(&0);


is this weight always just the stake of the sender? If so, does it make sense to:

rename this weight field to sender_stake, and all function names with the word weight to sender_stake

rename this entire stage to FindPacketSenderStakeStage, since we're not really calculating a "weight" (like a fee) and we are operating on packets, not transactions

tao-stones · 2022-03-07T15:44:30Z

breaking this up into pieces that can be checked in:

Sounds good. All commits in this pr provide an idea of the "end product" and allow to test it out end-to-end. But does make reviews much harder. I'll break it out as suggested for reviewing.

tao-stones · 2022-03-07T16:20:33Z

Move unprocessed_packet_batches from BankingStage into its own module

#23508

tao-stones · 2022-03-07T16:30:41Z

2. TransactionWeightingStage changes

the original PR: #21953
Do we want to reopen it and review there? The reason I closed it and moved its review here is to put it into context.

tao-stones · 2022-03-07T17:26:51Z

breaking this up into pieces that can be checked in:

Sounds good. All commits in this pr provide an idea of the "end product" and allow to test it out end-to-end. But does make reviews much harder. I'll break it out as suggested for reviewing.

Having second-thought while trying to cherry-pick commits in this PR into separate PRs. Other than the first two commits, all the other commits are built on top of each other, which means new smaller PRs are depends on (therefore include) previous PRs, doesn't look makes review any easier than review commit-by-commit. wdyt @carllin ?

carllin · 2022-03-07T20:18:19Z

@taozhu-chicago I think the code is complex enough and the reviews will be long enough where it would be good to break up those dependencies between commits into individual PR's, even though it might be a PITA.

tao-stones · 2022-03-07T23:33:39Z

@taozhu-chicago I think the code is complex enough and the reviews will be long enough where it would be good to break up those dependencies between commits into individual PR's, even though it might be a PITA.

Alright, PITA it is. Let's do it

stale · 2022-04-16T13:43:12Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2022-04-25T13:37:18Z

This stale pull request has been automatically closed. Thank you for your contributions.

tao-stones added the work in progress This isn't quite right yet label Feb 21, 2022

tao-stones force-pushed the leader-qos-2 branch from ee61c65 to 0968872 Compare February 21, 2022 22:38

tao-stones mentioned this pull request Feb 21, 2022

Changing workflow of consuming buffered packets #23086

Closed

tao-stones mentioned this pull request Feb 22, 2022

Network Stability rollout for quic/qos/fee markets #23211

Closed

15 tasks

tao-stones force-pushed the leader-qos-2 branch 2 times, most recently from 3c2465f to 5af8e7c Compare February 23, 2022 08:32

tao-stones force-pushed the leader-qos-2 branch from 8a29354 to eeee7e9 Compare February 25, 2022 22:40

tao-stones and others added 10 commits February 25, 2022 16:51

deserialize packets during receving and buffering

e74c5a7

Add tx weighting stage

5900f84

add function to get packet locators from buffer, as well as their sen…

d58a81a

…der stakes

Add weighted shuffle function

6092b09

Add function to sanitize transaction in chunk size

f30a6d7

To drop lowe-priority packets based on eviction strategy instead of f…

88d7106

…ifo, to make room for new packet_batch

Refactor consume_buffered_packets to prioritize and process transacti…

f8a6c41

…on in chunk, instead of iterating through packet_batches;

Testing with half total buffered packets count, and 4 more banking th…

ce2b3e8

…reads

using last checked bank to calculate packet total_fee/request_cu for …

a277cd8

…sorting; add test

When forwarding buffered packets, sort by fee-per-cu, then bucket by …

b5998cd

…account limit. then forward by bucket to up to 10 block-limit CUs

carllin reviewed Mar 5, 2022

View reviewed changes

carllin reviewed Mar 7, 2022

View reviewed changes

tao-stones mentioned this pull request Mar 7, 2022

leader-qos part 1: move buffer from banking_stage to its own module #23508

Merged

tao-stones mentioned this pull request Mar 7, 2022

Optional Cli parameter to add additional-fee to solana ping #23513

Merged

stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 16, 2022

stale bot closed this Apr 25, 2022

		cost_trackers: Vec<CostTracker>,
		forwardable_packets: Vec<Vec<&'a Packet>>,

		let mut stakes = Vec::<u64>::with_capacity(TOTAL_BUFFERED_PACKETS);
		let mut locators = Vec::<PacketLocator>::with_capacity(TOTAL_BUFFERED_PACKETS);

	data_budget.update(INTERVAL_MS, \|bytes\| {
	std::cmp::min(
	bytes.saturating_add(MAX_BYTES_PER_INTERVAL),
	MAX_BYTES_BUDGET,
	)
	});

	let packet_vec: Vec<_> = packets
	.iter()
	.filter_map(\|p\| {
	if !p.meta.forwarded() && data_budget.take(p.meta.size) {
	Some((&p.data[..p.meta.size], tpu_forwards))
	} else {
	None
	}
	})
	.collect();

Leader qos #23257

Leader qos #23257

Conversation

tao-stones commented Feb 21, 2022 • edited Loading

Problem

Summary of Changes

tao-stones commented Feb 21, 2022

tao-stones commented Feb 21, 2022

buffalu commented Feb 22, 2022

t-nelson commented Feb 22, 2022

tao-stones commented Feb 22, 2022

tao-stones commented Feb 22, 2022

t-nelson commented Feb 22, 2022

buffalu commented Feb 22, 2022

t-nelson commented Feb 22, 2022

jstarry commented Feb 25, 2022

tao-stones commented Feb 25, 2022

t-nelson commented Feb 25, 2022

tao-stones commented Feb 25, 2022

carllin Mar 5, 2022 • edited Loading

Choose a reason for hiding this comment

carllin Mar 5, 2022 • edited Loading

Choose a reason for hiding this comment

carllin Mar 5, 2022 • edited Loading

Choose a reason for hiding this comment

carllin Mar 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin Mar 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin Mar 5, 2022 • edited Loading

Choose a reason for hiding this comment

carllin Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

carllin Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

carllin commented Mar 7, 2022 • edited Loading

carllin Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

tao-stones commented Mar 7, 2022

tao-stones commented Mar 7, 2022

tao-stones commented Mar 7, 2022

tao-stones commented Mar 7, 2022

carllin commented Mar 7, 2022

tao-stones commented Mar 7, 2022

stale bot commented Apr 16, 2022

stale bot commented Apr 25, 2022

tao-stones commented Feb 21, 2022 •

edited

Loading

carllin Mar 5, 2022 •

edited

Loading

carllin Mar 5, 2022 •

edited

Loading

carllin Mar 5, 2022 •

edited

Loading

carllin Mar 5, 2022 •

edited

Loading

carllin Mar 5, 2022 •

edited

Loading

carllin Mar 5, 2022 •

edited

Loading

carllin Mar 7, 2022 •

edited

Loading

carllin Mar 7, 2022 •

edited

Loading

carllin Mar 7, 2022 •

edited

Loading

carllin commented Mar 7, 2022 •

edited

Loading

carllin Mar 7, 2022 •

edited

Loading