Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for "Extended" Transaction Packets #20691

Closed
steviez opened this issue Oct 14, 2021 · 18 comments
Closed

Add Support for "Extended" Transaction Packets #20691

steviez opened this issue Oct 14, 2021 · 18 comments

Comments

@steviez
Copy link
Contributor

steviez commented Oct 14, 2021

Problem

At the moment, our codebase limits the size of transactions to what can fit into an

/// Maximum over-the-wire size of a Transaction
/// 1280 is IPv6 minimum MTU
/// 40 bytes is the size of the IPv6 header
/// 8 bytes is the size of the fragment header
pub const PACKET_DATA_SIZE: usize = 1280 - 40 - 8;

While this limit is believed to help us maintain reliability and speed over the network, it also constrains what can be done in a single transaction.

This issue discusses some possible approaches to allow for larger transactions. It should also be noted that this proposal is only discussing larger packets in the context of forwarding transactions to the leader; we do not intend to change other aspects such as shred distribution through turbine.

Related Items

There have been several issues / discussions on related topics; for the sake of completeness, they are:

  • Some discussion on Discord
  • How we serialize / partition Entries into Shreds:
  • Previous discussion on transaction size being limiting
    • Transaction size restriction limits program composability #17102
    • A proposal document came out of above issue
      • One of the proposed solutions involved compressing account inputs for a specific scenario where a large number of account inputs ate up space. This could work but doesn't solve general problem if a transaction is large from items besides account inputs.
      • Some of the other ideas in the document discuss some various on-chain approaches; this document covers those cases well so I won't elaborate on those more here but rather point out on-chain as being an option.
  • Cloudfare has a blog post that is good primer on MTU and the issues that may arise in the wild

Proposal 1: Send Larger Packets and Rely on OS/IP Level Fragmentation & Reassembly

A quick and easy solution would be to simply push larger packets into our socket send() calls. While it is hypothetically possible that some hosts may have channels that support larger MTU's throughout the entire route, for the general case, it is probably safer to assume there is a hop on the path with an MTU in the 1280 ballpark (maybe 1500). In this case, IP fragmentation would take over and break the original packet into an appropriate number of fragments.

In theory, our nodes can tune their interface to limit MTU to 1280 bytes (even if the physical interface could support more). Doing so should (I think) force fragmentation to occur on the client when calling send for large transactions. If that is correct, then it seems that ensuring fragmentation occurs on the host (and not left up to the discretion of a random router along the path) would mitigate lots of the MTU concerns.

Pros:

  • Lower complexity to implement; we get the fragmentation & reassembly logic for free

Cons:

  • IP fragmentation makes it such that a receive socket won't return a packet until all fragments have arrived.
    • This exacerbates any packet drop rates / issues since reassembly is an all-or-nothing thing
    • Storing fragments in kernel buffers takes resources and could open us up to some level of DoS attack if a node starting spraying a large number of incomplete fragment sets

Proposal 2: Implement Fragmentation & Reassembly In Our Stack

The idea here is that we could replicate something similar to what IP fragmentation & reassembly does in our protocol. That is, our client side code would break a transaction into the proper number of fragments (where 1 fragment is a valid option for transactions that fit within one PACKET_DATA_SIZE size packet) and validators would be responsible for reassembly.

A validator would need to maintain a buffer of fragments received. As fragments are received, a validator could take several approaches for forwarding that fragment on:

  • Send the fragment immediately with the assumption that other fragments will show up soon
  • Collect fragments until the entire transaction has been collected and then forward the complete set
    To avoid DoS attack vectors, we could specifically limit the max size of this buffer, or institute a TTL on fragments (or both).
One possibly implementation of this buffer might be with a LinkedHashMap. Ignoring locks and etc, this might look like:
struct PacketFragmentMeta {
    first_insert: UnixTimestamp,
    num_expected: u8,
    num_received: u8,
    fragments: BTreeSet<Fragment>
}
type PacketFragmentBuffer = LinkedHashMap<Signature, PacketFragmentMeta>;

fn insert_fragment(&mut buffer: PacketFragmentBuffer, fragment: Fragment) {
    if let Some(&entry) = buffer.get_mut(&fragment.signature) {
        // We've already reveived some fragments
        let new_fragment = entry.fragments.insert(fragment);
        if new_fragment {
            entry.num_received += 1;
            if entry.num_recieved == entry.num_expected {
                // All fragments are here, do something ...
            }
        }
    } else {
        // Make a new entry
        let fragments = BTreeSetSet::new();
        fragments.insert(fragment);
        buffer.insert(
            fragment.signature,
            PacketFragmentMeta {
                first_insert: timestamp(),
                num_expected: fragment.num_expected()
                num_received: 1,
                fragments
            }
        );
    }
}

fn purge_fragment_buffer(
    &mut buffer: PacketFragmentBuffer,
    maxEvictionTime: UnixTimestamp
) {
    // LinkedHashMap tracks insertion and returns iterator in that order
    while !buffer.empty() {
        let oldest = buffer.front();
        if &oldest.1.first_insert < maxEvictionTime {
            buffer.pop_front();
        } else {
            return;
        }
    }
}

LinkedHashMap documentation

These fragmented packets would require some additional metadata for reassembly & tracking. This might look something like:

struct TransactionFragmentHeader {
    pub fragment_index: u8,
    pub num_fragments: u8,
    pub payload_size: u16
}

Similar to shreds, using erasure coding is another string we could pull if we were seeing heavy packet loss.

Pros:

  • We have more control over the policies of what we do with fragments (when we forward, buffer size, buffer eviction algorithm, etc)

Cons:

  • Additional complexity of managing fragmentation & reassembly in our stack
    • Possibly reinventing the wheel a bit ?

Proposal 3: Build Up Large Transaction On Chain

Similar to the the ideas mentioned in this document, we could potentially build up transactions on chain and then execute them once they are complete. I wasn't as in sync with the discussion on this issue, so maybe someone else can chime in, but this seems to be a good bit more complicated than the other two options.

@steviez
Copy link
Contributor Author

steviez commented Oct 14, 2021

@jbiseda @t-nelson @jstarry @mvines @ryoqun @sakridge - You all commented in the Discord thread about this topic, so asking anyone who has thoughts on this to chime in. I've gone back and forth on this a little, but am currently leaning towards option 1.

In theory, our nodes can tune their interface to limit MTU to 1280 bytes (even if the physical interface could support more). Doing so should (I think) force fragmentation to occur on the client when calling send for large transactions. If that is correct, then it seems that ensuring fragmentation occurs on the host (and not left up to the discretion of a random router along the path) would mitigate lots of the MTU concerns.

Calling specific attention to the above snippet - if I understand how this works correctly, forcing fragmentation on the host (down from what most of our nodes have as 1500 by default) should mitigate the MTU concerns as all of our packets will leave the host at the size that we've already been using.

@t-nelson
Copy link
Contributor

It's unclear to me how configuring the network interface to fragment is an improvement over fragmentation occurring elsewhere on the route. AFAIK, there's no signal sent back up to the app layer in either case, so no way to remediate

@jbiseda
Copy link
Contributor

jbiseda commented Oct 15, 2021

By forcing fragmentation on the client side before the first hop it seems like you're assuming that some routers may not fragment packets themselves but will forward fragmented packets without issues. I'm not sure if this is the case. For firewalls/NAT which may want to reassemble the entire packet for inspection they may have their own rules which will drop fragments for performance/memory irrespective of which host initially fragmented the packet.

I think running some tests to measure packet loss with larger packet sizes would inform your decision. If you have an idea of the payload size that you want to target for a single packet you could do some testing with iperf... for example between two GCE machines I see 0% packet loss with 60000 byte packets. From my home machine to a GCE machine I see:

iperf3 -c 34.127.79.141 -p 1234 -u -4 -l 10000
...
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  1.26 MBytes  1.06 Mbits/sec  0.000 ms  0/132 (0%)  sender
[  5]   0.00-10.02  sec  1.25 MBytes  1.05 Mbits/sec  23160653.720 ms  0/131 (0%)  receiver

iperf3 -c 34.127.79.141 -p 1234 -u -4 -l 15000
...
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  1.26 MBytes  1.06 Mbits/sec  0.000 ms  0/88 (0%)  sender
[  5]   0.00-10.02  sec  1.04 MBytes   874 Kbits/sec  978175346.561 ms  14/87 (16%)  receiver

iperf3 -c 34.127.79.141 -p 1234 -u -4 -l 20000
...
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  1.26 MBytes  1.06 Mbits/sec  0.000 ms  0/66 (0%)  sender
[  5]   0.00-10.03  sec  0.00 Bytes  0.00 bits/sec  0.000 ms  0/0 (0%)  receiver

You probably also want to consider that blocking IP fragments in some way may be a common configuration. For example, using a docker instance on my home machine to GCE instance I see 0% packet loss for 1470 byte payload but 100% loss for 1480 byte payload. So I assume the docker interface just isn't handling fragments.

What is the fallback mechanism if relying on IP fragmentation between the hosts fails? If you need to implement a fallback mechanism then it may make sense to just rely on that as the primary mechanism.

@steviez
Copy link
Contributor Author

steviez commented Oct 15, 2021

AFAIK, there's no signal sent back up to the app layer in either case, so no way to remediate

In response to t-nelson, this is the status quo for our existing transactions, right?

By forcing fragmentation on the client side before the first hop it seems like you're assuming that some routers may not fragment packets themselves but will forward fragmented packets without issues. I'm not sure if this is the case.

Yes, this was my assumption and (now) understood that is was an incorrect one. So, my rebuttal to t-nelson's first sentence is invalid.

I think running some tests to measure packet loss with larger packet sizes would inform your decision. If you have an idea of the payload size that you want to target for a single packet you could do some testing with iperf.

Nice, I didn't know about this tool; thanks for sharing.

You probably also want to consider that blocking IP fragments in some way may be a common configuration. For example, using a docker instance on my home machine to GCE instance I see 0% packet loss for 1470 byte payload but 100% loss for 1480 byte payload. So I assume the docker interface just isn't handling fragments.

What is the fallback mechanism if relying on IP fragmentation between the hosts fails? If you need to implement a fallback mechanism then it may make sense to just rely on that as the primary mechanism.

Hmm yeah, good points. As we discussed in DM's, Docker may be configurable, but it seems like there may always be "one more case" where this doesn't work. And, it seems like just about everyone else avoids IP fragmentation like the plague, so maybe rolling it ourselves is the move. Rolling it ourselves would allow for additional features like erasure coding that could make things more robust

@jbiseda
Copy link
Contributor

jbiseda commented Oct 15, 2021

Hmm yeah, good points. As we discussed in DM's, Docker may be configurable, but it seems like there may always be "one more case" where this doesn't work. And, it seems like just about everyone else avoids IP fragmentation like the plague, so maybe rolling it ourselves is the move. Rolling it ourselves would allow for additional features like erasure coding that could make things more robust

QUIC (runs on top of UDP) decided to avoid IP fragmentation: https://datatracker.ietf.org/doc/html/draft-ietf-quic-transport-05#section-9

   All QUIC packets SHOULD be sized to fit within the estimated PMTU to
   avoid IP fragmentation or packet drops.  To optimize bandwidth
   efficiency, endpoints SHOULD use Packetization Layer PMTU Discovery
   ([RFC4821]) and MAY use PMTU Discovery ([RFC1191], [RFC1981]) for
   detecting the PMTU, setting the PMTU appropriately, and storing the
   result of previous PMTU determinations.

@mvines
Copy link
Member

mvines commented Oct 18, 2021

Thanks for this @steviez, here's my high-level take:

Proposal 1: Send Larger Packets and Rely on OS/IP Level Fragmentation & Reassembly

Do this first and get it running on testnet ASAP (master/v1.9 branch). We can then use real-world metrics data to inform us about the frequency of packet drops/etc. This is basically a prerequisite for Proposal 2 anyway.

Proposal 2: Implement Fragmentation & Reassembly In Our Stack

Forwarding transactions between leaders seems like the first place we'd want to implement our own solution here. This could benefit transactions that fit in a single UDP datagram as well, if we aggregate and run erasure over the batch of transactions.

Proposal 3: Build Up Large Transaction On Chain

I don't believe this approach is viable in all use cases, in particular the payment use case where payment confirmation must be reached as quickly as possible. The user does not want to wait multiple seconds for a large transaction to be built up on chain. There may be ways to overcome this but it is significantly more complex for both the validator software, program developer and the front-end

@steviez
Copy link
Contributor Author

steviez commented Oct 18, 2021

Do this first and get it running on testnet ASAP (master/v1.9 branch).

Ignoring the scenarios where fragments may get dropped/blocked/etc (and thus the entire packet), is anyone majorly opposed to introducing changes that would most likely cause IP level fragmentation on testnet. I know it is TESTnet, but are there any snags for why we absolutely shouldn't do this on testnet?

This is basically a prerequisite for Proposal 2 anyway.

Good point, the majority of the work (if not all of it) for Proposal 1 will be needed for Proposal 2.

@mvines
Copy link
Member

mvines commented Oct 18, 2021

I know it is TESTnet, but are there any snags for why we absolutely shouldn't do this on testnet?

Nope. Proposal 1 should only affect the submission/forwarding of transactions that are larger than the current max size. Such transactions currently fail 100%, so anything less is an improvement.

@sakridge
Copy link
Member

I know it is TESTnet, but are there any snags for why we absolutely shouldn't do this on testnet?

Nope. Proposal 1 should only affect the submission/forwarding of transactions that are larger than the current max size. Such transactions currently fail 100%, so anything less is an improvement.

Do we have a fee structure that makes sense for these larger packets? I would prefer the comprehensive fees are at least in place before we have this.

@mvines
Copy link
Member

mvines commented Oct 18, 2021

I would prefer the comprehensive fees are at least in place before we have this.

Fee structure changes can be part of this development or run in parallel and doesn't block getting Proposal 1 running on testnet. But I agree, larger transactions probably deserve a higher fee before devnet/mainnet. I was crudely thinking 2x the fee, since there's at worst 2x more bytes to sig verify

@leoluk
Copy link
Contributor

leoluk commented Dec 23, 2021

And, it seems like just about everyone else avoids IP fragmentation like the plague [...]

From a BGP network operator's point of view, this is the correct take - fragmentation is generally considered a measure of very last resort due to the various problems it has:

  • It requires stateful reassembly, which is great for DDoS attacks. For this reason alone, many networks simply drop fragments altogether (and being able to do stateless DoS mitigation is particularly import for TPU/TPUfwd!).
  • All-or-nothing delivery, no retransmits - it's a very dumb protocol and the transport layer (i.e. TCP/QUIC/our custom UDP protocol...) can do much better by reducing its MSS.
  • Fragments aren't sized efficiently - if you're one byte over the threshold, then you get a single-byte fragment. This results in many very small fragments, considerably reducing effective throughput.
  • Performance ranges from great to abysmal and the application has zero insight into what's happening.
  • Weird interactions with packet level load balancers.
  • IPv6 basically doesn't support fragmentation at all.
  • (and more I can't remember right now)

The open internet runs on a MTU of exactly 1500, by convention. It's never larger than 1500 (jumbo frames are exclusively used in private networks for specialty use cases). Inside the internet backbone, it is never smaller than 1500, either, which extends to most hosting providers. Things only start getting funny inside eyeball and enterprise networks, with various overlay networks stealing some of it.

So, practically, state of the art is everyone on the internet just pretty much blindly assuming that the MTU is 1500, and hoping for the best for when it's not. Sometimes fragmentation works, sometimes Path MTU Discovery works, and sometimes eyeball providers with MTUs less than 1500 have funny workarounds like MSS Clamping, where routers will rewrite TCP packets to reduce the effective MSS - avoiding relying on either fragmentation or PMTUD, simply ignoring anything that isn't TCP.

IPv6 effectively has no support for fragmentation, so it has to rely on only on PMTUD or application-level hacks like MSS Clamping... neither of which work reliably 100% (or even 80%) of the time.

Real-world deployments therefore often end up reducing MTU to 1280 like we do right now. Yep... it's a mess.

This is why QUIC is taking the approach of doing its own trial-and-error MTU Discovery, avoiding both fragmentation and PMTUD.

For our use case, it should be safe to assume a 1500 MTU instead of 1280. Most nodes run in data centers, and clients on funny eyeball networks shouldn't be sending raw TPU packets anyway, they should be using an RPC node via TCP.

If we want to go beyond 1500 bytes, we absolutely cannot rely on IP Fragmentation and need a smarter application protocol.

@mvines
Copy link
Member

mvines commented Dec 23, 2021

Sadly we do have use cases that create transactions over 1500 bytes

@leoluk
Copy link
Contributor

leoluk commented Dec 23, 2021

I'd argue that we definitely need a smarter protocol, then (or use an existing one like QUIC). Relying on fragmentation would exclude many validators, performance would be suboptimal and we'd back ourselves into a corner relying on a IPv4-only feature.

@t-nelson
Copy link
Contributor

t-nelson commented Feb 9, 2022

Death by QUIC here? Seems like we can relax the wire limit as soon as QUIC and comprehensive fees are in (probably enough for SPL Token CT with just QUIC)

@steviez
Copy link
Contributor Author

steviez commented Feb 9, 2022

Death by QUIC here? Seems like we can relax the wire limit as soon as QUIC and comprehensive fees are in (probably enough for SPL Token CT with just QUIC)

Yes, I believe that is the route we're taking although I'll defer to @ryleung-solana for final confirmation.

@ryleung-solana
Copy link
Contributor

Yes, I believe that is the route we are taking right now. After Quic support, we may still need some extra work to minimize the memory impact of making the packet structs bigger, but I think the general plan is to support this via Quic.

@steviez
Copy link
Contributor Author

steviez commented Feb 9, 2022

Yes, I believe that is the route we are taking right now. After Quic support, we may still need some extra work to minimize the memory impact of making the packet structs bigger, but I think the general plan is to support this via Quic.

Good point on packet structs; this issue was mostly discussing the networking aspect so I'm going to close it and we can open a separate issue for handling variable size packets elsewhere.

@steviez steviez closed this as completed Feb 9, 2022
@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants