-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for "Extended" Transaction Packets #20691
Comments
@jbiseda @t-nelson @jstarry @mvines @ryoqun @sakridge - You all commented in the Discord thread about this topic, so asking anyone who has thoughts on this to chime in. I've gone back and forth on this a little, but am currently leaning towards option 1.
Calling specific attention to the above snippet - if I understand how this works correctly, forcing fragmentation on the host (down from what most of our nodes have as 1500 by default) should mitigate the MTU concerns as all of our packets will leave the host at the size that we've already been using. |
It's unclear to me how configuring the network interface to fragment is an improvement over fragmentation occurring elsewhere on the route. AFAIK, there's no signal sent back up to the app layer in either case, so no way to remediate |
By forcing fragmentation on the client side before the first hop it seems like you're assuming that some routers may not fragment packets themselves but will forward fragmented packets without issues. I'm not sure if this is the case. For firewalls/NAT which may want to reassemble the entire packet for inspection they may have their own rules which will drop fragments for performance/memory irrespective of which host initially fragmented the packet. I think running some tests to measure packet loss with larger packet sizes would inform your decision. If you have an idea of the payload size that you want to target for a single packet you could do some testing with iperf... for example between two GCE machines I see 0% packet loss with 60000 byte packets. From my home machine to a GCE machine I see:
You probably also want to consider that blocking IP fragments in some way may be a common configuration. For example, using a docker instance on my home machine to GCE instance I see 0% packet loss for 1470 byte payload but 100% loss for 1480 byte payload. So I assume the docker interface just isn't handling fragments. What is the fallback mechanism if relying on IP fragmentation between the hosts fails? If you need to implement a fallback mechanism then it may make sense to just rely on that as the primary mechanism. |
In response to t-nelson, this is the status quo for our existing transactions, right?
Yes, this was my assumption and (now) understood that is was an incorrect one. So, my rebuttal to t-nelson's first sentence is invalid.
Nice, I didn't know about this tool; thanks for sharing.
Hmm yeah, good points. As we discussed in DM's, Docker may be configurable, but it seems like there may always be "one more case" where this doesn't work. And, it seems like just about everyone else avoids IP fragmentation like the plague, so maybe rolling it ourselves is the move. Rolling it ourselves would allow for additional features like erasure coding that could make things more robust |
QUIC (runs on top of UDP) decided to avoid IP fragmentation: https://datatracker.ietf.org/doc/html/draft-ietf-quic-transport-05#section-9
|
Thanks for this @steviez, here's my high-level take:
Do this first and get it running on testnet ASAP (master/v1.9 branch). We can then use real-world metrics data to inform us about the frequency of packet drops/etc. This is basically a prerequisite for Proposal 2 anyway.
Forwarding transactions between leaders seems like the first place we'd want to implement our own solution here. This could benefit transactions that fit in a single UDP datagram as well, if we aggregate and run erasure over the batch of transactions.
I don't believe this approach is viable in all use cases, in particular the payment use case where payment confirmation must be reached as quickly as possible. The user does not want to wait multiple seconds for a large transaction to be built up on chain. There may be ways to overcome this but it is significantly more complex for both the validator software, program developer and the front-end |
Ignoring the scenarios where fragments may get dropped/blocked/etc (and thus the entire packet), is anyone majorly opposed to introducing changes that would most likely cause IP level fragmentation on testnet. I know it is TESTnet, but are there any snags for why we absolutely shouldn't do this on testnet?
Good point, the majority of the work (if not all of it) for Proposal 1 will be needed for Proposal 2. |
Nope. Proposal 1 should only affect the submission/forwarding of transactions that are larger than the current max size. Such transactions currently fail 100%, so anything less is an improvement. |
Do we have a fee structure that makes sense for these larger packets? I would prefer the comprehensive fees are at least in place before we have this. |
Fee structure changes can be part of this development or run in parallel and doesn't block getting Proposal 1 running on testnet. But I agree, larger transactions probably deserve a higher fee before devnet/mainnet. I was crudely thinking 2x the fee, since there's at worst 2x more bytes to sig verify |
From a BGP network operator's point of view, this is the correct take - fragmentation is generally considered a measure of very last resort due to the various problems it has:
The open internet runs on a MTU of exactly 1500, by convention. It's never larger than 1500 (jumbo frames are exclusively used in private networks for specialty use cases). Inside the internet backbone, it is never smaller than 1500, either, which extends to most hosting providers. Things only start getting funny inside eyeball and enterprise networks, with various overlay networks stealing some of it. So, practically, state of the art is everyone on the internet just pretty much blindly assuming that the MTU is 1500, and hoping for the best for when it's not. Sometimes fragmentation works, sometimes Path MTU Discovery works, and sometimes eyeball providers with MTUs less than 1500 have funny workarounds like MSS Clamping, where routers will rewrite TCP packets to reduce the effective MSS - avoiding relying on either fragmentation or PMTUD, simply ignoring anything that isn't TCP. IPv6 effectively has no support for fragmentation, so it has to rely on only on PMTUD or application-level hacks like MSS Clamping... neither of which work reliably 100% (or even 80%) of the time. Real-world deployments therefore often end up reducing MTU to 1280 like we do right now. Yep... it's a mess. This is why QUIC is taking the approach of doing its own trial-and-error MTU Discovery, avoiding both fragmentation and PMTUD. For our use case, it should be safe to assume a 1500 MTU instead of 1280. Most nodes run in data centers, and clients on funny eyeball networks shouldn't be sending raw TPU packets anyway, they should be using an RPC node via TCP. If we want to go beyond 1500 bytes, we absolutely cannot rely on IP Fragmentation and need a smarter application protocol. |
Sadly we do have use cases that create transactions over 1500 bytes |
I'd argue that we definitely need a smarter protocol, then (or use an existing one like QUIC). Relying on fragmentation would exclude many validators, performance would be suboptimal and we'd back ourselves into a corner relying on a IPv4-only feature. |
Death by QUIC here? Seems like we can relax the wire limit as soon as QUIC and comprehensive fees are in (probably enough for SPL Token CT with just QUIC) |
Yes, I believe that is the route we're taking although I'll defer to @ryleung-solana for final confirmation. |
Yes, I believe that is the route we are taking right now. After Quic support, we may still need some extra work to minimize the memory impact of making the packet structs bigger, but I think the general plan is to support this via Quic. |
Good point on packet structs; this issue was mostly discussing the networking aspect so I'm going to close it and we can open a separate issue for handling variable size packets elsewhere. |
This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs. |
Problem
At the moment, our codebase limits the size of transactions to what can fit into an
solana/sdk/src/packet.rs
Lines 9 to 13 in 13462d6
While this limit is believed to help us maintain reliability and speed over the network, it also constrains what can be done in a single transaction.
This issue discusses some possible approaches to allow for larger transactions. It should also be noted that this proposal is only discussing larger packets in the context of forwarding transactions to the leader; we do not intend to change other aspects such as shred distribution through turbine.
Related Items
There have been several issues / discussions on related topics; for the sake of completeness, they are:
Entries
intoShreds
:Entry
are distributed over the network and verified by nodes. While modifications to transaction could affect the two items above, they are not the topic of this discussion.Proposal 1: Send Larger Packets and Rely on OS/IP Level Fragmentation & Reassembly
A quick and easy solution would be to simply push larger packets into our socket
send()
calls. While it is hypothetically possible that some hosts may have channels that support larger MTU's throughout the entire route, for the general case, it is probably safer to assume there is a hop on the path with an MTU in the 1280 ballpark (maybe 1500). In this case, IP fragmentation would take over and break the original packet into an appropriate number of fragments.In theory, our nodes can tune their interface to limit MTU to 1280 bytes (even if the physical interface could support more). Doing so should (I think) force fragmentation to occur on the client when calling send for large transactions. If that is correct, then it seems that ensuring fragmentation occurs on the host (and not left up to the discretion of a random router along the path) would mitigate lots of the MTU concerns.
Pros:
Cons:
Proposal 2: Implement Fragmentation & Reassembly In Our Stack
The idea here is that we could replicate something similar to what IP fragmentation & reassembly does in our protocol. That is, our client side code would break a transaction into the proper number of fragments (where 1 fragment is a valid option for transactions that fit within one
PACKET_DATA_SIZE
size packet) and validators would be responsible for reassembly.A validator would need to maintain a buffer of fragments received. As fragments are received, a validator could take several approaches for forwarding that fragment on:
To avoid DoS attack vectors, we could specifically limit the max size of this buffer, or institute a TTL on fragments (or both).
One possibly implementation of this buffer might be with a LinkedHashMap. Ignoring locks and etc, this might look like:
LinkedHashMap documentation
These fragmented packets would require some additional metadata for reassembly & tracking. This might look something like:
Similar to shreds, using erasure coding is another string we could pull if we were seeing heavy packet loss.
Pros:
Cons:
Proposal 3: Build Up Large Transaction On Chain
Similar to the the ideas mentioned in this document, we could potentially build up transactions on chain and then execute them once they are complete. I wasn't as in sync with the discussion on this issue, so maybe someone else can chime in, but this seems to be a good bit more complicated than the other two options.
The text was updated successfully, but these errors were encountered: