-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial compact serialization proposal #131
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,6 @@ | |
|
||
This document serves as an entry point for understanding all of the data structures in filecoin. | ||
|
||
TODO: this should also include, or reference, how each data structure is serialized precisely. | ||
|
||
## Address | ||
|
||
|
@@ -29,6 +28,7 @@ For most objects referenced by Filecoin, a Content Identifier (CID for short) is | |
|
||
CIDs are serialized by applying binary multibase encoding, then encoding that as a CBOR byte array with a tag of 42. | ||
|
||
|
||
## Block | ||
|
||
A block represents an individual point in time that the network may achieve consensus on. It contains (via merkle links) the full state of the system, references to the previous state, and some notion of a 'weight' for deciding which block is the 'best'. | ||
|
@@ -71,10 +71,6 @@ type Block struct { | |
} | ||
``` | ||
|
||
### Serialization | ||
|
||
Blocks are currently serialized simply by CBOR marshaling them, using lower-camel-cased field names. | ||
|
||
## Message | ||
|
||
```go | ||
|
@@ -85,9 +81,9 @@ type Message struct { | |
// When receiving a message from a user account the nonce in | ||
// the message must match the expected nonce in the from actor. | ||
// This prevents replay attacks. | ||
Nonce Integer | ||
Nonce Uint64 | ||
|
||
Value Integer | ||
Value BigInteger | ||
|
||
GasPrice Integer | ||
GasLimit Integer | ||
|
@@ -100,6 +96,7 @@ type Message struct { | |
### Parameter Encoding | ||
|
||
Parameters to methods get encoded as described in the [basic types](#basic-type-encodings) section below, and then put into a CBOR encoded array. | ||
(TODO: thinking about this, it might make more sense to just have `Params` be an array of things) | ||
|
||
### Signing | ||
|
||
|
@@ -114,9 +111,17 @@ type SignedMessage struct { | |
|
||
The signature is a serialized signature over the serialized base message. For more details on how the signature itself is done, see the [signatures spec](signatures.md). | ||
|
||
### Serialization | ||
## MessageReceipt | ||
|
||
```go | ||
type MessageReceipt struct { | ||
ExitCode uint8 | ||
|
||
Return [][]byte | ||
|
||
Messages and SignedMessages are currently serialized simply by CBOR marshaling them, using lower-camel-cased field names. | ||
GasUsed BigInteger | ||
} | ||
``` | ||
|
||
## Message Receipt | ||
|
||
|
@@ -147,20 +152,13 @@ type Actor struct { | |
Head Cid | ||
|
||
// Nonce is a counter of the number of messages this actor has sent | ||
Nonce Integer | ||
Nonce Uint64 | ||
|
||
// Balance is this actors current balance of filecoin | ||
Balance AttoFIL | ||
Balance BigInteger | ||
} | ||
``` | ||
|
||
|
||
|
||
|
||
### Serialization | ||
|
||
Actors are currently serialized simply by CBOR marshaling them, using lower-camel-cased field names. | ||
|
||
## State Tree | ||
|
||
The state trie keeps track of all state in Filecoin. It is a map of addresses to `actors` in the system. It is implemented using a HAMT. | ||
|
@@ -169,6 +167,7 @@ The state trie keeps track of all state in Filecoin. It is a map of addresses to | |
|
||
TODO: link to spec for our CHAMP HAMT | ||
|
||
|
||
# Basic Type Encodings | ||
|
||
Types that appear in messages or in state must be encoded as described here. | ||
|
@@ -295,3 +294,41 @@ if ((shift <size) && (sign bit of byte is set)) | |
/* sign extend */ | ||
result |= - (1 << shift); | ||
``` | ||
|
||
# Filecoin Compact Serialization | ||
|
||
Datastructures in Filecoin are encoded as compactly as is reasonable. At a high level, each object is converted into an ordered array of its fields (ordered by their appearance in the struct declaration), then CBOR marshaled, and prepended with an object type tag. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it a goal to restrict the cbor data item types, eg to not encode to maps? if so we should say so. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what do you mean? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it the case that we will ever encode something to major type 5? if not and there is a design reason for this eg concern about suitability of key types across languages then we should say so. also if we are not going to use type 5 they can turn it off for safety. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that for the filecoin compact encoding format, we can leave that out. Though for the general cbor ipld stuff, people will still want to use that (i.e. in actor storage). For the purpose of this doc, i think we could explicitly say its not used and should invalidate any object There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should note that we cannot distinguish between unset/missing and zero values given this encoding (which is fine). the encoder must always encode every field is for every object. maybe that is obvious but doesnt hurt to be explicit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, probably worth pointing out. I did like that this scheme removed that ambiguity. |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this encoding scheme doesn't allow for forward compatibility so for example if we have a new object version old binaries are not going to be able to read it. this as opposed a scheme where fields are tagged an old implementations can just ignore new ones it doesnt understand. this is a feature i'm guessing, for size efficiency? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this does allow for forward compatibility. Every object is tagged, if you change an object, you pick a new tag. In general in Filecoin, old binaries will not be able to understand new protocol formats, and it won't make sense for them to try to do so. though maybe i can see a case where a thing wants to try and read messages, and if we were to add a new field for some weird reason, it wouldnt be able to read the old fields. I don't consider than much of an issue though, as my main concern is that the old impl knows the object is a different type, even if it can't understand it. |
||
| FCS Type | tag | | ||
|---|---| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lets have a very specific name for the left hand side that we use nowhere else so we can refer to things of that type unambiguously. is it object? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how about 'FCE type' for Filecoin Compact Encoding type? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, anything that is unique like that works There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. implementers will need a bunch of examples to ensure their implementation is working. uh, including us i suppose. also prolly a good idea to link to something that makes it easy to decode cbor to diagnostic output. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll link this somewhere too, but this tool is really great for debug printing cbor, or even converting it to json There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| block v1 | 43 | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any chance we could use multicodecs? Too big? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we could use multicodecs, but i was just trying to fit into the cbor tags table. Unclear if thats worth doing... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I’ll +1 using multicodecs with the caveat that I’m not sure offhand what the difference in size would be. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mikeal Yeah, Using multicodecs would be nice. Though it moves us out of the realm of using cbor, and into using our own custom encoding format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. given that we already use cbor I would really like to stick with cbor tags, as this ensures cbor implementations are compatible out of the box and this is well defined cbor making other tools work well |
||
| message v1 | 44 | | ||
| signedMessage v1 | 45 | | ||
|
||
For example, a message would be encoded as: | ||
|
||
```cbor | ||
tag<44>[msg.To, msg.From, msg.Nonce, msg.Value, msg.Method, msg.Params] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what about changes over time? would that result in a new tag? or should there be a version field? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. new tag There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is that really worth the few bytes? The object's already going to be at least 55 bytes, likely much more. We can turn this into a map of numeric tags to fields with 5 additional bytes (a single byte key per field). Really, it's not a huge deal but something to consider. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The 5 bytes arent really worth it for the block, but they are for the messages, and I wanted to have a consistent format. messages (with IDs) using this format will likely be under 20 bytes, and messages will take up the bulk of data being moved around. adding another 5 bytes is rather painful. Also, does a map of numeric tags to fields really buy us that much? it ends up looking like a hacky array. (though i guess it does allow us to leave fields out) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Under 20 bytes? Are you somehow skipping fields?
Really, this just allow us to write There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To and From fields both around 4-5 bytes, nonce will be 1 byte most of the time, value will be a few bytes, up to 8 bytes in some bad cases, but hopefully around 2-3 bytes if we can help it. Method is one byte, and then params will be more. But a simple 'send money' transaction won't have any params. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Stebalien also, with our own custom CID type, we can translate nice pathing schemes out of it. this is how we do pathing through other blockchain formats (and unixfs :/ ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i'm a little confused by the array notation in here, does it literally mean an major type 4 array to hold all the data items or does it denote just the sequence of data items? the difference is the presence or absence of major type 4 data item following the outer object tag, which contains the elements of the array. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, major type 4 to hold all the data items, that way its a valid cbor object. |
||
``` | ||
|
||
Each individual type should be encoded as specified: | ||
|
||
| type | encoding | | ||
whyrusleeping marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| --- | ---- | | ||
| Uint64 | CBOR major type 0 | | ||
whyrusleeping marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| BigInteger | [CBOR bignum](https://tools.ietf.org/html/rfc7049#section-2.4.2) | | ||
| Address | CBOR major type 2 | | ||
| Uint8 | CBOR Major type 0 | | ||
| []byte | CBOR Major type 2 | | ||
| string | CBOR Major type 3 | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. missing boolean values ( not yet used but a good thing to have around I hear) |
||
| bool | [CBOR Major type 7, value 20/21](https://tools.ietf.org/html/rfc7049#section-2.3) | | ||
|
||
## Encoding Considerations | ||
|
||
Objects should be encoded using [canonical CBOR](https://tools.ietf.org/html/rfc7049#section-3.9), and decoders should operate in [strict mode](https://tools.ietf.org/html/rfc7049#section-3.10). The maximum size of an FCS Object should be 1MB (2^20 bytes). Objects larger than this are invalid. | ||
|
||
Additionally, CBOR Major type 5 is not used. If an FCS object contains it, that object is invalid. | ||
|
||
## IPLD Considerations | ||
|
||
Cids for FCS objects should use the FCS multicodec (`0x1f`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we need to say that encodings must follow the rules for canonicalization per https://tools.ietf.org/html/rfc7049#section-3.9 and that decoders should use strict mode https://tools.ietf.org/html/rfc7049#section-3.10. also decoders should disable unused features like streaming for security's sake, and set explicit limits (suggestions?) on the size of byte arrays, lists etc as a precaution against dos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would set limits on sizes of objects, canonicalize, strict decoding, and so on. i'll make a mention of that in the doc