impose configurable hard limit on message sizes #4280

warner · 2022-01-11T23:41:28Z

What is the Problem Being Solved?

One defense against #4276 is to limit message sizes. This depends upon #3269 to remove the important use case for large contract-installation messages, but once that's done, we feel comfortable rejecting messages larger than, say, 1MB in length. Our general expectation is that messages should be shorter than 10kB, since the chain is not the place to be processing large amounts of data, but that will probably change over time.

There are at least half a dozen places where a message size limit might be imposed:

adding inbound messages to the Mailbox device (which can accept multiple messages at once)
the mailbox device queues individual messages to vat-vattp
the kernel accepts those syscall.sends from the mailbox device and puts them on the run-queue
vat-vattp receives the message and routes it to vat-comms
- causing another syscall.send
vat-comms receives the message, unpacks the wire format, and routes it to a local vat
- causing another syscall.send
the message arrives at e.g. vat-zoe

There are small formatting differences at the boundaries between the different layers. For example, when vat-comms receives $seqnum:$ackSeqNum:deliver:${remoteTargetSlot}:${method}:${remoteResultSlot}:${slots..};${args.body}, it parses out everything, maps targets and slots through the per-remote c-list, then performs a syscall.send with a lot of the same arguments (including a verbatim copy of args.body, which requires doing a JSON.stringify of a structure with those pieces. We can count the number of copies of args.body to estimate the short-term memory needs, and we can estimate the size of all the other parts to get a sense of how large the syscall.send will be as a function of the incoming dispatch.deliver arguments. Because of this overhead, a 1_000_000 byte hard limit on the Mailbox-side message will still occasionally allow a 1_000_005 byte message at syscall.send. If we think the variation in overhead can't be exploited, then we could rely upon the Mailbox limit and not have the kernel impose an additional one, or e.g. have the kernel impose a 2MB limit internally.

All limits need to be configurable by the host application, so our chain can use a governance process to raise them later.

Description of the Design

I'm thinking that the host application (which decides when to write inbound messages into the Mailbox device) could impose the limit, in which case we might not need anything else.

If we choose to have the kernel impose a message limit on all vats, then we should add a config knob like sendMessageBodyLimit, with the highest allowable msg.body size that will be accepted on any syscall.send, syscall.resolve, or syscall.invoke. And a controller.setSendMessageBodyLimit() to update it later. The config default should be 1_000_000.

Security Considerations

We should implement this now, without solving the full problem (the perfect is the enemy of the good), but we should also think through the myriad ways to sidestep or trick this limit into doing something surprising. In other protocols, having validation or limits applied at multiple points is a great bug farm, because e.g. if the difference in parsing between the first and second checks means only a small set of messages get caught by the second check, it will be under-tested. If the limit really is critical, like if the second step will die if it ever gets 1_000_001 bytes, then the check must be repeated at each step, as long as there is any possibility of variation.

We should also be clear what we think this is protecting. By setting the limit at N, we're claiming confidence that we can survive anything <= N. We are not claiming that we know of an attack which would be enabled by N + 1, and we're not claiming to be happy about imposing such a small limit. From a "providing features" point of view, we'd like to have as high a limit as possible, or no limit at all. From a "defending against unexpected attacks" pov, we'd make the limit as small as we can get away with.

Our general plan is to impose fees that rise (possibly super-linearly). If these are painfully expensive by the time they reach the hard limits, that will provide a clear incentive to do smaller things, or break them up into smaller pieces, both of which I think match our "do big things off-chain" story.

Test Plan

Unit tests at each of the component boundaries to show both = N passing and = N+1 failing cases.

The text was updated successfully, but these errors were encountered:

Tartuffo · 2022-01-26T19:58:34Z

This ticket needs to be split between Swingset and cosmic-swingset. @warner please help come up with a plan.

Tartuffo · 2022-02-02T19:53:33Z

@warner This needs an estimate.

Tartuffo · 2022-05-12T18:44:21Z

@mhofman does this need to get done in the Cosmic Swingset layer? In Go or JS?

mhofman · 2022-05-12T18:45:10Z

We probably should do it as early as possible, so in Go

mhofman · 2022-05-12T18:45:39Z

Probably something to discuss at the new kernel meeting.

warner · 2022-07-13T23:45:10Z

At today's kernel meeting, we figured:

@FUDCo added a liveslots limit that prevents vat userspace from making syscalls with capdata whose body.length is greater than 10M
- for most JS engines, this is counting UTF-16 code units, but I think XS might violate the spec a bit (storing UTF-8 internally) and might report codepoints instead (so e.g. a codepoint that requires a surrogate pair might report .length === 2 in Node.js but .length === 1 in XS
he also added a check at the kernel-worker netstring layer that imposes a 12MB limit on the payload of kernel-inbound netstrings
from the liveslots check to the netstring encoding, the methargs capdata (including .body) of an outbound send will be:
- wrapped in a VatSyscallObject (adding properties for the send tag, the target object vref, the result promise vref, and some array/object wrapping)
- wrapped in a ['syscall', vso] command object
- encoded to JSON , which will escape backslashes and double-quotes, which could nearly double the length of the string
- encoded to UTF-8 with TextEncoder, which could expand astral-plane codepoints by 2-4x (although if the original is in UTF-16, then we might really be looking at 2 code units expanding to 3 bytes.. needs investigation)
we'd like the liveslots guard to always trip before the netstring limit (because hitting the netstring limit will currently panic the kernel, as it looks just like a worker dying for unknown reasons, and we can't afford to let such deaths be treated as a consensus failure)
- so we probably need to increase the netstring limit

Also:

the threat to Zoe (that motivated this ticket) is memory usage caused by incoming deliveries
if we can guarantee that Zoe will never receive a message larger than X, and if Zoe performs enough input validation to avoid holding on to attacker-supplied data larger than Y, then Zoe's memory usage is basically the sum of Y for all outstanding operations that require retaining that data
if we can guarantee that the mailbox device will never receive a message larger than Z, and we make the MN-1 assumption of no malicious contracts or vats, then we can examine the encoding/decoding that occurs between mailbox and vat deliveries to compute Y = f(Z)
we believe that cosmos already imposes a size limit on signed transactions (call it C), which motivated the "contract fragmentation/reassembly" work (where a larger contract bundle is broken up into pieces, each piece sent to the chain in a separate signed transaction, and a special reassembly vat stitches the pieces back together into the complete bundle, before sending the bundle to zoe for installation)
- @JimLarson do you know what this limit is?
- we believe it must be around 2-3MB because that's about the size of our larger contracts, and if they were smaller than the limit, nobody would be working on reassembly)
given the encoding/decoding that occurs between cosmos transactions and mailbox device writes, we can compute Z = g(C)
then the combination of limit C, and the expansion/contraction limits in f() and g(), might tell us that we're already meeting our goal of limiting the size of a message that arrives at zoe
so if we meet that limit, and zoe doesn't accumulate attacker-provided data, then we're protecting zoe from externally-provoked death by memory exhaustion

There is a lot more depth to this sort of limiting (and some sort of economic metering is the long-term defense: maybe you can give zoe large arguments but you have to pay for it, and zoe uses those funds to pay for the resources it needs to hold them internally). But the part that we care about for MN-1 is protecting Zoe and other infrastructure vats from external attack. The liveslots/netstring limits put a sharp edge behind these vats, and attackers can push those vats up against that edge.

So our action items are:

compute f() and g(), look up C, figure out the resulting limits X/Y/Z
check how Zoe handles inbound arguments for storage accumulation threats
then either close this ticket or push out the deeper analysis to MN-1.1 or -2

mhofman · 2022-07-14T00:30:41Z

which could expand astral-plane codepoints by 2-4x (although if the original is in UTF-16, then we might really be looking at 2 code units expanding to 3 bytes.. needs investigation)

Oh yeah if XS reports the length in codepoints then a it can technically go 4x instead of at worse going 3x between .length and the number of bytes.

Tartuffo · 2022-07-14T17:00:48Z

We decided that the existing cosmos message size limit might be sufficient.

We need to figure out the expansion factor from that end of the stack to Zoe, and make sure we are safe for MN-1. If so, we can move this ticket to MN-1.1.

warner · 2022-08-23T18:51:06Z

Note: the netstring code can accept a limit, but we don't yet set one, pending a decision about what the limit ought to be.

warner added enhancement New feature or request SwingSet package: SwingSet cosmic-swingset package: cosmic-swingset labels Jan 11, 2022

dckc added audit-zestival Vulnerability assessment of ERTP + Zoe security Zoe package: Zoe ERTP package: ERTP and removed Zoe package: Zoe labels Jan 12, 2022

erights added the resource-exhaustion Threats to availability from resource exhaustion attacks label Jan 18, 2022

Tartuffo added the MN-1 label Jan 20, 2022

Tartuffo assigned warner Jan 26, 2022

Tartuffo removed the ERTP package: ERTP label Feb 2, 2022

Tartuffo removed the MN-1 label Feb 7, 2022

warner assigned mhofman and unassigned warner Feb 22, 2022

Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022

warner mentioned this issue Apr 27, 2022

kernel parameter configuration API and chain-side controls #5237

Closed

dckc mentioned this issue May 4, 2022

install-bundle transaction message #4396

Closed

Tartuffo assigned warner and unassigned mhofman Jul 14, 2022

Tartuffo added the possible-post-launch label Jul 19, 2022

dckc mentioned this issue Jul 20, 2022

handle named bundles in kernel config and core-bootstrap #4374

Open

Tartuffo removed the possible-post-launch label Jul 20, 2022

dckc mentioned this issue Jul 25, 2022

install Inter protocol defi contracts after boostrap using publishBundle #5819

Closed

Tartuffo removed this from the Mainnet 1 RC0 milestone Aug 10, 2022

warner mentioned this issue Aug 25, 2022

mark all core vats as "critical" #6051

Closed

Tartuffo added migrate-product-backlog and removed migrate-product-backlog labels Nov 17, 2022

warner removed their assignment Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

impose configurable hard limit on message sizes #4280

impose configurable hard limit on message sizes #4280

warner commented Jan 11, 2022

Tartuffo commented Jan 26, 2022 •

edited

Loading

Tartuffo commented Feb 2, 2022

Tartuffo commented May 12, 2022

mhofman commented May 12, 2022

mhofman commented May 12, 2022

warner commented Jul 13, 2022

mhofman commented Jul 14, 2022

Tartuffo commented Jul 14, 2022

warner commented Aug 23, 2022

impose configurable hard limit on message sizes #4280

impose configurable hard limit on message sizes #4280

Comments

warner commented Jan 11, 2022

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan

Tartuffo commented Jan 26, 2022 • edited Loading

Tartuffo commented Feb 2, 2022

Tartuffo commented May 12, 2022

mhofman commented May 12, 2022

mhofman commented May 12, 2022

warner commented Jul 13, 2022

mhofman commented Jul 14, 2022

Tartuffo commented Jul 14, 2022

warner commented Aug 23, 2022

Tartuffo commented Jan 26, 2022 •

edited

Loading