Constant slices #584

graydon · 2022-11-22T20:38:29Z

We should add a new variant of RawVal called CSlice that is a constant (and lazy) version of Bytes, with the following structure in its 60 bits:

[8 bits] - Bytes subtype (assuming we do Object subtypes. This limits subtypes to 256 possibilities, which is likely still far more than we'll ever need.
[12 bits] - Numeric identifier for currently-running contract. This imposes a new (but I think fairly reasonable) limit of 4096 contracts loaded per Host.
[20 bits] - Offset in the constant data of the contract. This imposes a new (but I think fairly reasonable) size limit of 1MB on each contract.
[20 bits] - Length in the constant data of the contract. Same limit.

(Comments welcome below fiddling with these limits up and down -- I can imagine setting them differently, they just have to add up to 60 bits)

Numeric identifiers are assigned dynamically by the host as it loads contracts. Numeric identifier 0 is reserved to mean "the currently-running contract" and any time it's passed to a host function it's rewritten to use the actual assigned numeric identifier of that contract.

The point of CSlice is to occupy a space somewhere in-between Object(Bytes(...)) and Symbol:

Like Symbol it can be calculated at compile time in the guest, embedded in const structures or emitted as a literal in the instruction stream.
Like Symbol it can be pushed very cheaply into the debug-event buffer on the host, and if it's never inspected further it does not cause any host-side allocations. The host does not need to worry about the memory it points to being short-lived in the guest, because the guest is promising that it is stored in the guest's constant-data section.
Like Object(Bytes(...)) it can point to arbitrarily-long strings. When converted to an XDR SCVal for returning from a contract or emitting into the event list, it's treated as though it was Object(Bytes(...)), just constructed on-demand.

This might be hard to implement (or even impossible) depending on how easy or hard it is to correlate constant linear-memory addresses with offsets of the originating data, in the WASM data section. It might require an extra step of address-conversion when converting a "contract identity number 0" input-CSlice.

We might also need to use a special host-side representation that's handles native-testing mode transparently (i.e. so that a &'static str or &'static [u8] can be used through EnvBase).

This, along with simply allowing Bytes arguments (or CSlices) in any host functions we currently restrict to Symbol arguments, represents my best-guess about how to solve #463 . It's not perfect, since CSlice is still Host-relative, whereas Symbol is more universal, but since it's confined to RawVal rather than bleeding into the XDR, I think it's probably something we can live with.

The text was updated successfully, but these errors were encountered:

graydon · 2022-11-22T20:39:59Z

(I should also note that this is orthogonal to / complementary to making byte-slicing itself reuse underlying data. I.e. taking a slice of a CSlice should yield another CSlice, and we should make Bytes refcounted so that multiple slices into the same Bytes don't need to duplicate the data)

graydon · 2022-12-01T21:03:49Z

I've done a little more investigation into this and while I still think it will work, there are a couple more limitations that will restrict it:

Data sections may have multiple segments. They don't often and they don't often have a lot but they don't always have a single data segment. I've seen examples with 2, 4, or even a couple dozen segments. Usually only 1 or 2 though. It depends on which linker you run it through (lld will merge segments) and various other mystery factors in the producer.
Data segments might be positioned at nontrivial offsets that are hard to evaluate as constants in the runtime. I've never seen an example of this but the file format allows it.

So .. I still think this is basically ok and I'm going to try prototyping it anyways, but it might need to be a slightly-more-partial function, which fails with either Bytes (if it's targeting a use case where that's acceptable) or either silent or noisy failure (such as debug-logging with a format string -- probably best to fail mostly-silently, maybe put in an "unknown message" placeholder or such).

We can also reduce the odds of the first issue causing a problem by reserving a few bits to identify the data segment, eg. support "up to 16" segments or something. That will probably handle all real-world cases.

graydon · 2023-02-16T07:49:45Z

I've done a lot of experimenting with this and concluded it's not worthwhile. At least not unless we get feedback that it's really hurting us.

Storing on the guest (i.e. inside a RawVal) is theoretically possible but in practice just too finnicky; it requires baking-in assumptions about the beginning of the data segment (defined here but like do we want to bake that in?), and also doesn't work if you compile in native mode (so you need another special case), and puts size limits on slices and the number of contracts loaded, and requires that we fastidiously translate all RawVal module references to anchor their owning module when they cross the guest/host boundary.
Storing on the host works, but it still requires making a dynamic host call (so then we're out of the world of const fns in the guest) as well as allocating a host object (16 bytes minimum probably more like 32 or 64) and then doing memory-coordinate translation on the slice (locating the spanning data segment) and then storing something like a 4-tuple of {Rc,segment,pos,len} which is another 32 bytes or so. My guess is that the great majority of strings won't be big enough for this to be any smaller or much less work than just .. allocating a buffer and copying the string. And the codepaths are much simpler if we do that, don't need to special-case the type.
Storing on the host gives some opportunities for interning by 4-tuple, and I .. went a ways into exploring that also (see here) but in the end I just .. couldn't really convince myself all the extra work was going to pay for itself. These are very small programs that run for very short amounts of time. I think it's probably a better bet to try to stay out of their way, and if they want to cache an object for frequent reuse, they can just .. cache the object.

graydon added the enhancement New feature or request label Nov 22, 2022

graydon mentioned this issue Nov 22, 2022

Consider alternative forms for Symbol #463

Closed

graydon self-assigned this Nov 22, 2022

leighmcculloch mentioned this issue Dec 2, 2022

new post: comparing NEAR, Soroban, & CosmWasm SDKs AhaLabs/ahalabs.github.io#11

Merged

sisuresh mentioned this issue Dec 16, 2022

Use single balances in the Stellar Asset Contract #615

Merged

graydon mentioned this issue Jan 28, 2023

Replace ScObject with ScOption. Unify everything under ScVal. stellar/stellar-xdr#64

Closed

graydon mentioned this issue Feb 9, 2023

Value representation overhaul #679

Closed

graydon closed this as completed Feb 16, 2023

graydon mentioned this issue Feb 16, 2023

Log / diagnostics from guest memory static strings #229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constant slices #584

Constant slices #584

graydon commented Nov 22, 2022

graydon commented Nov 22, 2022

graydon commented Dec 1, 2022 •

edited

Loading

graydon commented Feb 16, 2023

Constant slices #584

Constant slices #584

Comments

graydon commented Nov 22, 2022

graydon commented Nov 22, 2022

graydon commented Dec 1, 2022 • edited Loading

graydon commented Feb 16, 2023

graydon commented Dec 1, 2022 •

edited

Loading