Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impose configurable hard limit on message sizes #4280

Open
warner opened this issue Jan 11, 2022 · 9 comments
Open

impose configurable hard limit on message sizes #4280

warner opened this issue Jan 11, 2022 · 9 comments
Labels
audit-zestival Vulnerability assessment of ERTP + Zoe cosmic-swingset package: cosmic-swingset enhancement New feature or request resource-exhaustion Threats to availability from resource exhaustion attacks security SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Jan 11, 2022

What is the Problem Being Solved?

One defense against #4276 is to limit message sizes. This depends upon #3269 to remove the important use case for large contract-installation messages, but once that's done, we feel comfortable rejecting messages larger than, say, 1MB in length. Our general expectation is that messages should be shorter than 10kB, since the chain is not the place to be processing large amounts of data, but that will probably change over time.

There are at least half a dozen places where a message size limit might be imposed:

  • adding inbound messages to the Mailbox device (which can accept multiple messages at once)
  • the mailbox device queues individual messages to vat-vattp
  • the kernel accepts those syscall.sends from the mailbox device and puts them on the run-queue
  • vat-vattp receives the message and routes it to vat-comms
    • causing another syscall.send
  • vat-comms receives the message, unpacks the wire format, and routes it to a local vat
    • causing another syscall.send
  • the message arrives at e.g. vat-zoe

There are small formatting differences at the boundaries between the different layers. For example, when vat-comms receives $seqnum:$ackSeqNum:deliver:${remoteTargetSlot}:${method}:${remoteResultSlot}:${slots..};${args.body}, it parses out everything, maps targets and slots through the per-remote c-list, then performs a syscall.send with a lot of the same arguments (including a verbatim copy of args.body, which requires doing a JSON.stringify of a structure with those pieces. We can count the number of copies of args.body to estimate the short-term memory needs, and we can estimate the size of all the other parts to get a sense of how large the syscall.send will be as a function of the incoming dispatch.deliver arguments. Because of this overhead, a 1_000_000 byte hard limit on the Mailbox-side message will still occasionally allow a 1_000_005 byte message at syscall.send. If we think the variation in overhead can't be exploited, then we could rely upon the Mailbox limit and not have the kernel impose an additional one, or e.g. have the kernel impose a 2MB limit internally.

All limits need to be configurable by the host application, so our chain can use a governance process to raise them later.

Description of the Design

I'm thinking that the host application (which decides when to write inbound messages into the Mailbox device) could impose the limit, in which case we might not need anything else.

If we choose to have the kernel impose a message limit on all vats, then we should add a config knob like sendMessageBodyLimit, with the highest allowable msg.body size that will be accepted on any syscall.send, syscall.resolve, or syscall.invoke. And a controller.setSendMessageBodyLimit() to update it later. The config default should be 1_000_000.

Security Considerations

We should implement this now, without solving the full problem (the perfect is the enemy of the good), but we should also think through the myriad ways to sidestep or trick this limit into doing something surprising. In other protocols, having validation or limits applied at multiple points is a great bug farm, because e.g. if the difference in parsing between the first and second checks means only a small set of messages get caught by the second check, it will be under-tested. If the limit really is critical, like if the second step will die if it ever gets 1_000_001 bytes, then the check must be repeated at each step, as long as there is any possibility of variation.

We should also be clear what we think this is protecting. By setting the limit at N, we're claiming confidence that we can survive anything <= N. We are not claiming that we know of an attack which would be enabled by N + 1, and we're not claiming to be happy about imposing such a small limit. From a "providing features" point of view, we'd like to have as high a limit as possible, or no limit at all. From a "defending against unexpected attacks" pov, we'd make the limit as small as we can get away with.

Our general plan is to impose fees that rise (possibly super-linearly). If these are painfully expensive by the time they reach the hard limits, that will provide a clear incentive to do smaller things, or break them up into smaller pieces, both of which I think match our "do big things off-chain" story.

Test Plan

Unit tests at each of the component boundaries to show both = N passing and = N+1 failing cases.

@warner warner added enhancement New feature or request SwingSet package: SwingSet cosmic-swingset package: cosmic-swingset labels Jan 11, 2022
@dckc dckc added audit-zestival Vulnerability assessment of ERTP + Zoe security Zoe package: Zoe ERTP package: ERTP and removed Zoe package: Zoe labels Jan 12, 2022
@erights erights added the resource-exhaustion Threats to availability from resource exhaustion attacks label Jan 18, 2022
@Tartuffo Tartuffo added the MN-1 label Jan 20, 2022
@Tartuffo
Copy link
Contributor

Tartuffo commented Jan 26, 2022

This ticket needs to be split between Swingset and cosmic-swingset. @warner please help come up with a plan.

@Tartuffo Tartuffo removed the ERTP package: ERTP label Feb 2, 2022
@Tartuffo
Copy link
Contributor

Tartuffo commented Feb 2, 2022

@warner This needs an estimate.

@Tartuffo
Copy link
Contributor

@mhofman does this need to get done in the Cosmic Swingset layer? In Go or JS?

@mhofman
Copy link
Member

mhofman commented May 12, 2022

We probably should do it as early as possible, so in Go

@mhofman
Copy link
Member

mhofman commented May 12, 2022

Probably something to discuss at the new kernel meeting.

@warner
Copy link
Member Author

warner commented Jul 13, 2022

At today's kernel meeting, we figured:

  • @FUDCo added a liveslots limit that prevents vat userspace from making syscalls with capdata whose body.length is greater than 10M
    • for most JS engines, this is counting UTF-16 code units, but I think XS might violate the spec a bit (storing UTF-8 internally) and might report codepoints instead (so e.g. a codepoint that requires a surrogate pair might report .length === 2 in Node.js but .length === 1 in XS
  • he also added a check at the kernel-worker netstring layer that imposes a 12MB limit on the payload of kernel-inbound netstrings
  • from the liveslots check to the netstring encoding, the methargs capdata (including .body) of an outbound send will be:
    • wrapped in a VatSyscallObject (adding properties for the send tag, the target object vref, the result promise vref, and some array/object wrapping)
    • wrapped in a ['syscall', vso] command object
    • encoded to JSON , which will escape backslashes and double-quotes, which could nearly double the length of the string
    • encoded to UTF-8 with TextEncoder, which could expand astral-plane codepoints by 2-4x (although if the original is in UTF-16, then we might really be looking at 2 code units expanding to 3 bytes.. needs investigation)
  • we'd like the liveslots guard to always trip before the netstring limit (because hitting the netstring limit will currently panic the kernel, as it looks just like a worker dying for unknown reasons, and we can't afford to let such deaths be treated as a consensus failure)
    • so we probably need to increase the netstring limit

Also:

  • the threat to Zoe (that motivated this ticket) is memory usage caused by incoming deliveries
  • if we can guarantee that Zoe will never receive a message larger than X, and if Zoe performs enough input validation to avoid holding on to attacker-supplied data larger than Y, then Zoe's memory usage is basically the sum of Y for all outstanding operations that require retaining that data
  • if we can guarantee that the mailbox device will never receive a message larger than Z, and we make the MN-1 assumption of no malicious contracts or vats, then we can examine the encoding/decoding that occurs between mailbox and vat deliveries to compute Y = f(Z)
  • we believe that cosmos already imposes a size limit on signed transactions (call it C), which motivated the "contract fragmentation/reassembly" work (where a larger contract bundle is broken up into pieces, each piece sent to the chain in a separate signed transaction, and a special reassembly vat stitches the pieces back together into the complete bundle, before sending the bundle to zoe for installation)
    • @JimLarson do you know what this limit is?
    • we believe it must be around 2-3MB because that's about the size of our larger contracts, and if they were smaller than the limit, nobody would be working on reassembly)
  • given the encoding/decoding that occurs between cosmos transactions and mailbox device writes, we can compute Z = g(C)
  • then the combination of limit C, and the expansion/contraction limits in f() and g(), might tell us that we're already meeting our goal of limiting the size of a message that arrives at zoe
  • so if we meet that limit, and zoe doesn't accumulate attacker-provided data, then we're protecting zoe from externally-provoked death by memory exhaustion

There is a lot more depth to this sort of limiting (and some sort of economic metering is the long-term defense: maybe you can give zoe large arguments but you have to pay for it, and zoe uses those funds to pay for the resources it needs to hold them internally). But the part that we care about for MN-1 is protecting Zoe and other infrastructure vats from external attack. The liveslots/netstring limits put a sharp edge behind these vats, and attackers can push those vats up against that edge.

So our action items are:

  • compute f() and g(), look up C, figure out the resulting limits X/Y/Z
  • check how Zoe handles inbound arguments for storage accumulation threats
  • then either close this ticket or push out the deeper analysis to MN-1.1 or -2

@mhofman
Copy link
Member

mhofman commented Jul 14, 2022

  • which could expand astral-plane codepoints by 2-4x (although if the original is in UTF-16, then we might really be looking at 2 code units expanding to 3 bytes.. needs investigation)

Oh yeah if XS reports the length in codepoints then a it can technically go 4x instead of at worse going 3x between .length and the number of bytes.

@Tartuffo
Copy link
Contributor

We decided that the existing cosmos message size limit might be sufficient.

We need to figure out the expansion factor from that end of the stack to Zoe, and make sure we are safe for MN-1. If so, we can move this ticket to MN-1.1.

@warner
Copy link
Member Author

warner commented Aug 23, 2022

Note: the netstring code can accept a limit, but we don't yet set one, pending a decision about what the limit ought to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit-zestival Vulnerability assessment of ERTP + Zoe cosmic-swingset package: cosmic-swingset enhancement New feature or request resource-exhaustion Threats to availability from resource exhaustion attacks security SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

5 participants