Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider alternative forms for Symbol #463

Closed
Tracked by #679
leighmcculloch opened this issue Sep 21, 2022 · 17 comments · Fixed by stellar/rs-soroban-sdk#655
Closed
Tracked by #679

Consider alternative forms for Symbol #463

leighmcculloch opened this issue Sep 21, 2022 · 17 comments · Fixed by stellar/rs-soroban-sdk#655

Comments

@leighmcculloch
Copy link
Member

leighmcculloch commented Sep 21, 2022

Symbol is a 60-bit encoding of a 10 character string with a limited character set of a-zA-Z0-9_.

The value is stored in a u64 and so is efficient to pass around.

We use it in a lot of places:

  • Function names.
  • Type names, field names.
  • Enum variant names.
  • As keys to data storage.
  • so on

However, the 10 character limit is at times annoying. It can sometimes cause us to write code that looks like a human error. e.g. contrct_id.

We should consider other ways we could encode strings into Symbol's that would give us longer strings, more descriptive function names, variants, type names, etc.

cc @tomerweller @graydon

@leighmcculloch
Copy link
Member Author

One option I've heard discussed is something similar to how Ethereum uses the first 4-bytes of a hash of function names.

We could make Symbol a 60-bit hash of the string.

This would work in contexts where we could easily detect collisions, such as function names, but would be harder enforce in broader contexts like type names, enum variant names, etc.

@leighmcculloch
Copy link
Member Author

Another option I've heard discussed (cc @graydon?) is that we store the full length strings in a section of the WASM and reference them somehow. The host would be able to use those references to load the full string out of the data section of the WASM so we would only ever be transmitting to/from the host a handle.

@tomerweller
Copy link

Thanks. It seems like a lot of developers are running into 10-char limit issue, based on discord questions

@graydon
Copy link
Contributor

graydon commented Sep 22, 2022

I think there are several viable options here but they all have certain challenges:

  1. We use hashes. This makes symbols useless for diagnostics / debugging, which I think is actually fairly serious. At present they can be used fairly casually and informatively, provide useful meaning. We also have to deal with collision probability separate from the issue of just organizing namespaces. Two names might collide / be equal in hash-space even if they look superficially different in source code.
  2. We put names in data sections of contracts and then reference them by linear-memory address or wasm global number or whatever. This makes symbols relative to a specific wasm, which has lots of problems: they can't be compared in any context outside of host functions while their owning guest VM is active (not in the guest, not in stellar-core outside of an activation, not in XDR function-invocation transaction inputs, not when captured in error buffers or emitted-events). It also represents a bit of an implicit information-flow / trust-risk between wasms, if wasm A passes a symbol to wasm B, it's implicitly telling wasm B it should read bytes out of A's linear memory, which might be cleverly malformed in such a way as to confuse B.
  3. We put names in the enclosing transaction (or operation) XDR and then reference them by number inside the wasm. This is maybe a bit easier in some contexts but harder in others. Mostly same sort of concerns as case 2.
  4. We expand the size of RawVal to 2 words, letting us have bigger symbols overall. This has a number of advantages: we can change other cramped RawVal cases, such as having a richer "default number type" such as "fixed96" (fixed-point 96 bit with 8 bit scale factor, covering all reasonable "money-like numbers" as well as subsuming u63 and the i64 & u64 boxed types as special cases -- this has been the default number type in Excel for ages), pushes the symbol length-limit up to something more like 16 (easily) or 20 (awkwardly) characters, and potentially gives us a codesize win somewhere down the line (though not immediately, see following drawbacks). This is still on balance probably my favourite option, and it's possible today but has one big drawback: it requires us to commit to today's core-wasm not-especially-great ABI for returning small structs (i.e. a 2-word RawVal) from host functions -- the guest allocates a bit of scratch linear memory, passes its address, then the post writes the RawVal into that location -- because at present the "multivalues" wasm extension required to do this nicely is not presently 100% supported by rust and/or llvm. In the future when multivalue is supported we'd then probably wind up having to support separate sets of dispatch functions and separate ABIs for wasm code. Not a killer issue, but annoying. It would also only get us to 16 characters, and maybe that's not really enough for users anyways?

None of these are great but as I say I think I'd lean towards case 4 if the various wrinkles involved are all ironable-out (and if 16 or 20 chars is "enough"). WDYT?

@graydon
Copy link
Contributor

graydon commented Sep 22, 2022

Investigating a little more, it seems likely rust is using an ABI similar-to (or identical-to) https://github.com/WebAssembly/tool-conventions/blob/main/BasicCABI.md which .. we could probably adapt to on the dispatch-function side. One interesting thing that occurs to me here -- especially around trying to recover the codesize difference and avoid passing around huge values with only a few bits set -- is that .. it'd potentially open up a bit of design-space for, say, a 128-bit or even 256-bit BigRawVal and a 32-bit SmallRawVal mixed together in arg-lists, without a lot of additional effort, since the BigRawVal would be passed by-address, and a linear-memory address is 32 bit. So N args -- be they BigRawVal or SmallRawVal -- would still be N u32 args, just with different (type-directed) unmarshalling. Then you'd only need to pass BigRawVals for args that needed them: symbols, maybe hashes or giant crypto numbers or such, but not small numbers, booleans, statics, error codes, object handles or the like, those could all remain as u32s.

@leighmcculloch
Copy link
Member Author

I think 20 characters make a huge difference over 10. I agree with all the downsides in every approach though, and I'm not convinced the solutions we know of are a good tradeoff to living with the symbols we have.

@graydon
Copy link
Contributor

graydon commented Sep 22, 2022

Following this along: we could make BigRawVal be, say, a u32 tag + u256 body and SmallRawVal be a packed u8 tag and a u24 body. Plenty of room for object references -- no contract is going to make more than 16 million objects -- and fine for all the statics and error codes and so forth. Then you could pass hashes and 256-bit bignums as BigRawVal, as well as 42-char symbols (or even simplify things and relax the definition of "symbol" from six-bit [a-zA-Z0-9]-character strings to "any 32 bytes you like" or "UTF-8 strings" or such).

@leighmcculloch
Copy link
Member Author

leighmcculloch commented Sep 22, 2022

This model sounds compelling. Could these values be used everywhere Symbols could be used today, or are we talking only for function inputs?

@leighmcculloch
Copy link
Member Author

As a hold over / alternative to making a change to RawVal/Symbol, I've opened this change that makes the error people see when their Symbol is too big, easier to digest: stellar/rs-soroban-sdk#655.

@graydon
Copy link
Contributor

graydon commented Sep 22, 2022

@leighmcculloch it'd work .. I think pretty much everywhere. It might actually be hard to pass data originating in the host into the guest, like for symbols or hashes as args in incoming contract-invocation calls, since we'd need to allocate some guest memory to deposit the incoming BigRawVals into. There might be a way to make this work, not sure. Will think about it more.

Anyway, in general "small structs that are multiple words long" are a normal thing in Rust, in either the Rust spoken in the world of the host or the world of the guest. The awkward interface we're dealing with in soroban is just "the wasm guest-to-host and host-to-guest invocation interfaces, and whatever ABI / calling conventions make of them".

We're currently dealing with that interface as a thing that's easiest to traverse with "a sequence of repr(transparent) u64 rust values" because .. that's easy to predict the mapping of / bidirectionally map: each u64 word in the Rust arg list maps 1:1 to a u64 word in the wasm invocation interfaces. A more complex structure is possible if (big if) we can figure out what Rust's going to generate on the guest side and intercept-and-unpack it in the dispatch functions on the guest side.

@graydon
Copy link
Contributor

graydon commented Sep 22, 2022

(Update: yeah, it looks like we can just push the guest stack pointer from the host, so we can put things on its linear memory before calling it, so it should basically work everywhere)

@graydon
Copy link
Contributor

graydon commented Sep 22, 2022

It would also generally increase guest codesize to be throwing around lots of 256-bit / 4-word values in the guest. In the host it probably won't register much, but I'd expect a it could be problematic in the guest if a large majority of values of of this sort. Of course, if most values are still SmallRawVal u32s it might be fairly harmless (polymorphic containers probably only happen on the host anyway -- how often does a user deal with an unknown-concrete-type RawVal?), or it might even be a win in some cases. I'm not totally sure about bignum-and-small-hash-bytes cases. I think for most ops you'll be getting them done cheaper on the host and passing around an object reference by small-number in the guest, but there might be some patterns that have the opposite character. It's not easy to know exactly where to draw the boundary. I'd be happy to take the experiment further (assuming we ever have adequate time!)

@leighmcculloch
Copy link
Member Author

I'd be happy to take the experiment further (assuming we ever have adequate time!)

Given how much better symbol too long errors are since stellar/rs-soroban-sdk#655 I don't think we should rush the solution for this. Maybe this is a good thing to explore after we get FutureNet deployed.

@leighmcculloch
Copy link
Member Author

leighmcculloch commented Sep 22, 2022

if most values are still SmallRawVal u32s it might be fairly harmless (polymorphic containers probably only happen on the host anyway -- how often does a user deal with an unknown-concrete-type RawVal?)

I see a lot of UDT and rich types in our examples. We're heavily using BigInt, Map, and Vec types to the point where any optimizations we may have designed for i64, u32, or i32 seem somewhat meaningless. The only small type we use frequently is u32 and that's because we use it internally in host functions like vec_len, and those values don't even need to be RawVals.

Given that, I think may we should rethink the 7bit tag space. The 1-bit u63 might be better spent on something else. The u32 bit, i32 bit, and the BitSet bit, are also seeming to be relatively unused, and these types could all go. That's 4 bits of the 7 bits I don't think we're leveraging enough to warrant them.

Would it be a better tradeoff to have that BigRawVal more often, or host fns more often, because right now we're doing a lot of host fns instead of BigRawVal.

@graydon
Copy link
Contributor

graydon commented Sep 22, 2022

Nit: it's not a "7-bit" tag-space, it's a 1 + 3 = 4-bit tag-space, with 1 primary case of u63/others then 8 other-cases switched on the next 3 bits. You're saying we can probably part with some of those cases, but it's only really useful to free up a full bit at a time, i.e. part with powers-of-2 worth. If there are 4 cases to lose, we could free up a bit and move to 2-bit other-tags say, but .. eh .. 1 extra bit doesn't win us anything in the data payloads.

Anyway, I agree all this is post-FutureNet. It's something I wanted to spend a little time revising my understanding of, and I spent that time yesterday, I'm happy to set it down until late-fall / pre-finalization. This should all be fairly invisible to users, aside from "some codesize improvements, fewer size restrictions on symbols, and maybe fewer random unused datatypes".

(Out of all of this I'm most interested in the various ways large-but-not-absurd numbers -- u256 / literal-32-byte-binary -- get used, and how people would want to use them if we could anticipate their preferences / support the uses adequately)

@graydon
Copy link
Contributor

graydon commented Nov 22, 2022

See #584

@graydon
Copy link
Contributor

graydon commented Mar 9, 2023

As of #682 this is done, we support up to 32-char symbols (and that limit is somewhat arbitrary, just set to "something reasonable" for keys/topics/function names and local u8 buffers to hold them in no-alloc builds)

@graydon graydon closed this as completed Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants