Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader qos #23257

Closed
wants to merge 11 commits into from
Closed

Leader qos #23257

wants to merge 11 commits into from

Conversation

tao-stones
Copy link
Contributor

@tao-stones tao-stones commented Feb 21, 2022

Problem

Leader to prioritize transactions with fee-per-cu and/or sender stake weights

Summary of Changes

  • Refactor banking_stage buffer into its own model
  • Deserialize packet upon receiving into VersionedTransaction, store in buffer;
  • Add tx weighting stage
  • Add function to get packet locators from buffer, as well as their sender stakes;
  • Add weighted shuffle function
  • Add function to sanitize VersionedTransactions in chunk,
  • To drop lowe-priority packets based on eviction strategy instead of fifo, to make room for new packet_batch
  • Refactor consume_buffered_packets to prioritize SanitizedTransactions in fee-per-cu (still wip) then stake shuffled order, send to bank for processing in chunk.
  • refactor forwarding with fee-per-cu prioritization, and per write accounts.

Fixes #23207

@tao-stones
Copy link
Contributor Author

This is still work-in-progress to actually get base-fee to calculate fee-per-cu, also to add/merge metrics back. But Just want to put all parts together to get a sense of performance after changing how banking_stage process packets.

The banking-bench shows some regression: mean tps go from 55k -> 53k.

The 1-hour perf test has more pronounced regression, mean goes from 24K -> 14K. This is from one sampling. Need to add metrics back to zoom-in.

Regarding performance, the overhead added here (eg sorting packets instead of linearly traverse packet_bacthes, additional functions of fee-per-cu, and stake weight etc) are going to have negative impact. A suggestion is to reduce buffer size from current 500_000 packets to 4 slots worth packets (~6,400 packets), and increase transaction processing threads from current 2 to 4, should make a good difference. These numbers of course need to be played around.

This is a rather big change, even it is till wip, I'd appreciate if more eyes to start looking at it.
@sakridge @aeyakovenko @carllin @jstarry @behzadnouri @t-nelson

@tao-stones
Copy link
Contributor Author

@buffalu ^^^

@buffalu
Copy link
Contributor

buffalu commented Feb 22, 2022

skimming through right now, this is my understanding of changes:

  • calc stake weight on a per-batch before sigverify
  • all packets still run through sigverify w/ 10k discard.
  • packets into deque does some type of replacement where it evicts low staked packets with high staked ones when you reach the max deque capacity.
  • processing buffered packets prioritizes by "fee" then stake weight (TODO, probably need to deserialize into one of the tx structs?)
  • seems like the rest mostly continues as normal.

stream of consciousness/messy notes:

  • locator thing felt weird at first, but seems better on second re-read. nice to avoid shuffling around memory?
  • wonder if better to calc stake weight after sigverify? going to be faster when you hit that MAX_SIGVERIFY_BATCH limit in sigverify, unless you wanna use it to discard excess based on stake weight. that would probably incentivize high-staked nodes to spam though, but i guess then you can find that out + people can pull stake weight from whoever does that (if not private stake or entities on that validator are making lots of $$$). that would probably be hard to figure out though if there's a bunch of stake-weighted forwarding + some type of relaying going on.
  • create_evictioin_locators typo
  • packet replacement could also be susceptible to spam by high-staked validators? but same point above applies i think.
  • is this going to cause a race for validators to get the most stake weight? high staked = faster blocks + faster tx processing.
  • will probably incentivize higher staked validators to spam right before the validator runs as it seems like most time is spent in consume_buffered_packets anyway, so it'll get sorted once and spend the entire time processing vs. pulling in new txs.

going to take another look in the AM, but initial thoughts above. hard problem to solve; i still feel like we wouldn't need all this if entries were packed better (can always be added later) + executed in parallel properly, but only going on intuition and don't have any data for this lol

@t-nelson
Copy link
Contributor

I'd raised this in the issue that Toly raised last week, but I'm wondering how we get a benefit from stake-weighted QoS on the TPU port given that staked validators typically only submit transactions on TpuFwd and TpuVote rather than Tpu, where we're seeing congestion

@tao-stones
Copy link
Contributor Author

locator thing felt weird at first, but seems better on second re-read. nice to avoid shuffling around memory?

Yea, using locator to avoid additional copies of packet_batch.

@tao-stones
Copy link
Contributor Author

The main motivation with sender stake weighted shuffling is to allow (some) valid transactions go through in bot spamming condition. I can see your concern of incentivizing high staked validators to spam. This change can be discussed more.

@t-nelson
Copy link
Contributor

The main motivation with sender stake weighted shuffling is to allow (some) valid transactions go through in bot spamming condition. I can see your concern of incentivizing high staked validators to spam. This change can be discussed more.

Sorry, that's not my concern at all. My concern is that staked validators rarely submit transactions to the TPU port, so stake-weighting transactions there will have little to no effect. That is, nearly all TXs coming in to the TPU port are from unstaked nodes

@buffalu
Copy link
Contributor

buffalu commented Feb 22, 2022

The main motivation with sender stake weighted shuffling is to allow (some) valid transactions go through in bot spamming condition. I can see your concern of incentivizing high staked validators to spam. This change can be discussed more.

Sorry, that's not my concern at all. My concern is that staked validators rarely submit transactions to the TPU port, so stake-weighting transactions there will have little to no effect. That is, nearly all TXs coming in to the TPU port are from unstaked nodes

you are right, but wasn't it suggested that people could forward to high-staked validators and they could do QoS and submit? not totally sure what problem that solves right now tho

@t-nelson
Copy link
Contributor

wasn't it suggested that people could forward to high-staked validators and they could do QoS and submit?

Maybe. Kinda need that before qos though. There's no port to accept these transactions on today, nor logic discern them later in the pipeline (nor even a reliable way to discern?)

@tao-stones tao-stones force-pushed the leader-qos-2 branch 2 times, most recently from 3c2465f to 5af8e7c Compare February 23, 2022 08:32
@jstarry
Copy link
Member

jstarry commented Feb 25, 2022

I feel like there is still a lot of discussion required for the scope of this change. And this is a very critical change. I think it'd be more efficient if we got consensus through a proposal first. I personally have little interest in reading thousands of lines of refactorings and changes we're not sure about yet and are very impactful to tx ingestion. That said, drafts like this are useful for checking perf implications and can aid the proposal discussion.

@tao-stones
Copy link
Contributor Author

I feel like there is still a lot of discussion required for the scope of this change. And this is a very critical change. I think it'd be more efficient if we got consensus through a proposal first. I personally have little interest in reading thousands of lines of refactorings and changes we're not sure about yet and are very impactful to tx ingestion. That said, drafts like this are useful for checking perf implications and can aid the proposal discussion.

It indeed is a lot of code change, I wanted to put all things together, at least as proof of concept, so the discussions would be more concrete. Also trying to keep commits rather self-contained, so they can be reviewed/discussed individually, and revert/cherry-picked if necessary.

While the refactoring is large change, the motivation of it is rather simple, as described in PR comments. We should continue discussion at #23207.

Just as a side note wrt performance, its 1-hour perf test is inline with what we have now at 22k tps (https://solanalabs.slack.com/archives/CP2L2S4KV/p1645779722062819), and banking-bench is at 93K tps.

@t-nelson
Copy link
Contributor

I feel like there is still a lot of discussion required for the scope of this change. And this is a very critical change. I think it'd be more efficient if we got consensus through a proposal first.

gm 🤝

#23211 (comment)

@tao-stones
Copy link
Contributor Author

I feel like there is still a lot of discussion required for the scope of this change. And this is a very critical change. I think it'd be more efficient if we got consensus through a proposal first.

gm 🤝

#23211 (comment)

What the proposal should be? About the idea of prioritizing by fee/CU, or the implementation of it, or ideas of leader QoS in general?
I feel fee-based prioritization is a simple idea as @aeyakovenko described in comment. Not sure if I have more to add to it.
I'll be more than happy to write up the implementation details in this PR, if that's what we want to discuss. I do want to have more eyes on this change. Like I said elsewhere, while the idea is simple, the refactoring is more involving.

@@ -0,0 +1,1205 @@
use {
Copy link
Contributor

@carllin carllin Mar 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for refactoring this, but I think this can be a separate PR that moves the code into this file that gets checked in first.

Otherwise it's hard here to tell what new functions were added and what code was just migrated

Comment on lines +65 to +69
let mut result = Vec::<&'a Packet>::new();
self.forwardable_packets
.iter()
.for_each(|v| result.extend(v));
result
Copy link
Contributor

@carllin carllin Mar 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would probably be more efficient as a iter().collect::<Vec<&'a Packet>>() instead of multiple calls to extend()

cost: &TransactionCost,
packet: &'a Packet,
) {
// try to sort the `transaction` into one of outbound (virtual) blocks
Copy link
Contributor

@carllin carllin Mar 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what a "virtual" block is referring to here as that term is never defined, a more concrete definition would be helpful.

Comment on lines +14 to +15
cost_trackers: Vec<CostTracker>,
forwardable_packets: Vec<Vec<&'a Packet>>,
Copy link
Contributor

@carllin carllin Mar 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be condensed into a single Vec<ForwardableBlock>, where a ForwardableBlock looks like:

struct ForwardableBlock {
  cost_tracker: CostTracker,
  forwardable_packets: Vec<&'a Packet>
}

prioritize_by_fee(unprocessed_packet_batches, &locators, working_bank)
}

fn sort_into_buckets(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sort_into_buckets -> insert_into_forwardable_blocks

unprocessed_packet_batches: &'a UnprocessedPacketBatches,
working_bank: Option<Arc<Bank>>,
) -> Vec<PacketLocator> {
let mut locators = Vec::<PacketLocator>::with_capacity(TOTAL_BUFFERED_PACKETS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: locators -> unforwarded_packets_locators

pub struct DeserializedPacketBatch {
pub packet_batch: PacketBatch,
pub forwarded: bool,
// indexes of valid packets in batch, and their corrersponding deserialized_packet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: corrersponding -> corresponding

Comment on lines +43 to +60
#[allow(dead_code)]
pub batch_index: usize,
#[allow(dead_code)]
pub packet_index: usize,
}

// hold deserialized messages, as well as computed message_hash and other things needed to create
// SanitizedTransaction
#[derive(Debug, Default)]
pub struct DeserializedPacket {
#[allow(dead_code)]
pub versioned_transaction: VersionedTransaction,

#[allow(dead_code)]
pub message_hash: Hash,

#[allow(dead_code)]
pub is_simple_vote: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead_code tag doesn't seem necessary

@@ -0,0 +1,427 @@
use {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like all the forwarding changes can also be broken into a separate PR after the main prioritization change lands

.collect()
}

// to comute (addition_fee + base_fee / requested_cu) for packet identified by `locator`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comute -> compute

Comment on lines +192 to +195
.unprocessed_packets
.keys()
.for_each(|packet_index| {
let p = &packet_batch.packets[*packet_index];
Copy link
Contributor

@carllin carllin Mar 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern seems to come up a lot, let's add a new function to UnprocessedPacketBatches:

    pub fn packet_iter(&self) -> impl Iterator<Item = (usize, &Packet)> {
        self.unprocessed_packets
            .keys()
            .map(|index| (*index, &self.packet_batch.packets[*index]))
    }

Comment on lines +185 to +186
let mut stakes = Vec::<u64>::with_capacity(TOTAL_BUFFERED_PACKETS);
let mut locators = Vec::<PacketLocator>::with_capacity(TOTAL_BUFFERED_PACKETS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOTAL_BUFFERED_PACKETS seems excessively large for most cases, could just not preallocate or figure out the size beforehand

unprocessed_packet_batches: &'a UnprocessedPacketBatches,
working_bank: Option<Arc<Bank>>,
) -> Vec<PacketLocator> {
let mut locators = Vec::<PacketLocator>::with_capacity(TOTAL_BUFFERED_PACKETS);
Copy link
Contributor

@carllin carllin Mar 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TOTAL_BUFFERED_PACKETS seems unnecessarily large. We can instead collect over an iterator:

        let unforwarded_packets_locators: Vec<PacketLocator> =
            unprocessed_packet_batches.iter().enumerate().filter_map(
                |(batch_index, deserialized_packet_batch)| {
                    if !deserialized_packet_batch.forwarded {
                        Some((batch_index, deserialized_packet_batch))
                    } else {
                        None
                    }
                },
            ).flat_map(|(batch_index, deserialized_packet_batch)| {
                deserialized_packet_batch
                    .unprocessed_packets
                    .keys()
                    .map(move |packet_index| {
                        PacketLocator {
                            batch_index,
                            packet_index: *packet_index,
                        }
                    })
            }).collect();

)?;
let total_fee = bank.get_fee_for_message(&sanitized_message)?;

// TODO update bank to get_fee_and_cu_for_message() to avoid calling compute_budget again
Copy link
Contributor

@carllin carllin Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was going to comment this as well 😃 , process_message() seems unnecessary here

Comment on lines +250 to +252
// if unable to compute fee-per-cu for the packet, put it to the `0` bucket
0u64
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're unable to compute the fee-per-cu doesn't that mean the packet was corrupted and we can drop it entirely?

Comment on lines +38 to +51
let prioritized_forwardable_packet_locators = Self::prioritize_unforwarded_packets_by_fee(
unprocessed_packet_batches,
Some(working_bank.clone()),
);

// if we have a bank to work with, then sort packets by write-account buckets
let (transactions, sanitized_locators) = sanitize_transactions(
unprocessed_packet_batches,
&prioritized_forwardable_packet_locators,
&working_bank.feature_set,
working_bank.vote_only_bank(),
working_bank.as_ref(),
);
let transactions_costs = qos_service.compute_transaction_costs(transactions.iter());
Copy link
Contributor

@carllin carllin Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment describing the steps here:

  1. Compute the fee per cu and sort transactions by this priority for the unprocessed_packet_batches queue in prioritize_unforwarded_packets_by_fee()
  2. Lock cost model, clone every transaction from the unprocessed queue, deserialize all of these transactions, compute cost model cost in compute_transaction_costs()
  3. Using the cost model costs from 2. split the unprocessed_packet_batches into separate chunks, where each chunk represents a block filled with transactions with costs that add up to the block cost limit

Copy link
Contributor

@carllin carllin Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some perf implications:

  1. If the unprocessed_packet_batches was already a heap sorted by fee-per-cu priority which was updated when new transactions were added, we could avoid the sort here
  2. Unprocessed buffer could be huge, calculating all this metadata might be expensive. Because there's already a limit on the number of packet bytes we actually forward here:
    data_budget.update(INTERVAL_MS, |bytes| {
    std::cmp::min(
    bytes.saturating_add(MAX_BYTES_PER_INTERVAL),
    MAX_BYTES_BUDGET,
    )
    });
    let packet_vec: Vec<_> = packets
    .iter()
    .filter_map(|p| {
    if !p.meta.forwarded() && data_budget.take(p.meta.size) {
    Some((&p.data[..p.meta.size], tpu_forwards))
    } else {
    None
    }
    })
    .collect();
    , it might be good at the beginning to truncate the buffer up to a sane number of packets that will actually be forwarded.
  3. If we do 2, I don't think we need to compute the costs of all these transactions, just forward up to X bytes from the buffer to avoid all the steps/cost needed to clone/deserialize every transaction and run the cost model

@carllin
Copy link
Contributor

carllin commented Mar 7, 2022

@taozhu-chicago still in the middle of reviewing this, but before you start resolving comments let's start breaking this up into pieces that can be checked in:

  1. Move unprocessed_packet_batches from BankingStage into its own module
  2. TransactionWeightingStage changes
  3. The core BankingStage changes for fee prioritization, without the forwarding changes
  4. The forwarding changes

fn apply_weights(batches: &mut [PacketBatch], ip_to_stake: &HashMap<IpAddr, u64>) {
batches.into_par_iter().for_each(|batch| {
batch.packets.par_iter_mut().for_each(|packet| {
packet.meta.weight = *ip_to_stake.get(&packet.meta.addr().ip()).unwrap_or(&0);
Copy link
Contributor

@carllin carllin Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this weight always just the stake of the sender? If so, does it make sense to:

  1. rename this weight field to sender_stake, and all function names with the word weight to sender_stake
  2. rename this entire stage to FindPacketSenderStakeStage, since we're not really calculating a "weight" (like a fee) and we are operating on packets, not transactions

@tao-stones
Copy link
Contributor Author

breaking this up into pieces that can be checked in:

Sounds good. All commits in this pr provide an idea of the "end product" and allow to test it out end-to-end. But does make reviews much harder. I'll break it out as suggested for reviewing.

@tao-stones
Copy link
Contributor Author

  1. Move unprocessed_packet_batches from BankingStage into its own module

#23508

@tao-stones
Copy link
Contributor Author

2. TransactionWeightingStage changes

the original PR: #21953
Do we want to reopen it and review there? The reason I closed it and moved its review here is to put it into context.

@tao-stones
Copy link
Contributor Author

breaking this up into pieces that can be checked in:

Sounds good. All commits in this pr provide an idea of the "end product" and allow to test it out end-to-end. But does make reviews much harder. I'll break it out as suggested for reviewing.

Having second-thought while trying to cherry-pick commits in this PR into separate PRs. Other than the first two commits, all the other commits are built on top of each other, which means new smaller PRs are depends on (therefore include) previous PRs, doesn't look makes review any easier than review commit-by-commit. wdyt @carllin ?

@carllin
Copy link
Contributor

carllin commented Mar 7, 2022

@taozhu-chicago I think the code is complex enough and the reviews will be long enough where it would be good to break up those dependencies between commits into individual PR's, even though it might be a PITA.

@tao-stones
Copy link
Contributor Author

@taozhu-chicago I think the code is complex enough and the reviews will be long enough where it would be good to break up those dependencies between commits into individual PR's, even though it might be a PITA.

Alright, PITA it is. Let's do it

@stale
Copy link

stale bot commented Apr 16, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 16, 2022
@stale
Copy link

stale bot commented Apr 25, 2022

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement leader prioritizing packets by fee per CU
6 participants