Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial compact serialization proposal #131

Merged
merged 4 commits into from
Mar 23, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 55 additions & 18 deletions data-structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

This document serves as an entry point for understanding all of the data structures in filecoin.

TODO: this should also include, or reference, how each data structure is serialized precisely.

## Address

Expand All @@ -29,6 +28,7 @@ For most objects referenced by Filecoin, a Content Identifier (CID for short) is

CIDs are serialized by applying binary multibase encoding, then encoding that as a CBOR byte array with a tag of 42.


## Block

A block represents an individual point in time that the network may achieve consensus on. It contains (via merkle links) the full state of the system, references to the previous state, and some notion of a 'weight' for deciding which block is the 'best'.
Expand Down Expand Up @@ -71,10 +71,6 @@ type Block struct {
}
```

### Serialization

Blocks are currently serialized simply by CBOR marshaling them, using lower-camel-cased field names.

## Message

```go
Expand All @@ -85,9 +81,9 @@ type Message struct {
// When receiving a message from a user account the nonce in
// the message must match the expected nonce in the from actor.
// This prevents replay attacks.
Nonce Integer
Nonce Uint64

Value Integer
Value BigInteger

GasPrice Integer
GasLimit Integer
Expand All @@ -100,6 +96,7 @@ type Message struct {
### Parameter Encoding

Parameters to methods get encoded as described in the [basic types](#basic-type-encodings) section below, and then put into a CBOR encoded array.
(TODO: thinking about this, it might make more sense to just have `Params` be an array of things)

### Signing

Expand All @@ -114,9 +111,17 @@ type SignedMessage struct {

The signature is a serialized signature over the serialized base message. For more details on how the signature itself is done, see the [signatures spec](signatures.md).

### Serialization
## MessageReceipt

```go
type MessageReceipt struct {
ExitCode uint8

Return [][]byte

Messages and SignedMessages are currently serialized simply by CBOR marshaling them, using lower-camel-cased field names.
GasUsed BigInteger
}
```

## Message Receipt

Expand Down Expand Up @@ -147,20 +152,13 @@ type Actor struct {
Head Cid

// Nonce is a counter of the number of messages this actor has sent
Nonce Integer
Nonce Uint64

// Balance is this actors current balance of filecoin
Balance AttoFIL
Balance BigInteger
}
```




### Serialization

Actors are currently serialized simply by CBOR marshaling them, using lower-camel-cased field names.

## State Tree

The state trie keeps track of all state in Filecoin. It is a map of addresses to `actors` in the system. It is implemented using a HAMT.
Expand All @@ -169,6 +167,7 @@ The state trie keeps track of all state in Filecoin. It is a map of addresses to

TODO: link to spec for our CHAMP HAMT


# Basic Type Encodings

Types that appear in messages or in state must be encoded as described here.
Expand Down Expand Up @@ -295,3 +294,41 @@ if ((shift <size) && (sign bit of byte is set))
/* sign extend */
result |= - (1 << shift);
```

# Filecoin Compact Serialization

Datastructures in Filecoin are encoded as compactly as is reasonable. At a high level, each object is converted into an ordered array of its fields (ordered by their appearance in the struct declaration), then CBOR marshaled, and prepended with an object type tag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need to say that encodings must follow the rules for canonicalization per https://tools.ietf.org/html/rfc7049#section-3.9 and that decoders should use strict mode https://tools.ietf.org/html/rfc7049#section-3.10. also decoders should disable unused features like streaming for security's sake, and set explicit limits (suggestions?) on the size of byte arrays, lists etc as a precaution against dos.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely

Copy link
Member Author

@whyrusleeping whyrusleeping Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would set limits on sizes of objects, canonicalize, strict decoding, and so on. i'll make a mention of that in the doc

  • TODO: this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a goal to restrict the cbor data item types, eg to not encode to maps? if so we should say so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the case that we will ever encode something to major type 5? if not and there is a design reason for this eg concern about suitability of key types across languages then we should say so. also if we are not going to use type 5 they can turn it off for safety.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for the filecoin compact encoding format, we can leave that out. Though for the general cbor ipld stuff, people will still want to use that (i.e. in actor storage).

For the purpose of this doc, i think we could explicitly say its not used and should invalidate any object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should note that we cannot distinguish between unset/missing and zero values given this encoding (which is fine). the encoder must always encode every field is for every object. maybe that is obvious but doesnt hurt to be explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably worth pointing out. I did like that this scheme removed that ambiguity.


Copy link
Contributor

@phritz phritz Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this encoding scheme doesn't allow for forward compatibility so for example if we have a new object version old binaries are not going to be able to read it. this as opposed a scheme where fields are tagged an old implementations can just ignore new ones it doesnt understand. this is a feature i'm guessing, for size efficiency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does allow for forward compatibility. Every object is tagged, if you change an object, you pick a new tag.

In general in Filecoin, old binaries will not be able to understand new protocol formats, and it won't make sense for them to try to do so.

though maybe i can see a case where a thing wants to try and read messages, and if we were to add a new field for some weird reason, it wouldnt be able to read the old fields. I don't consider than much of an issue though, as my main concern is that the old impl knows the object is a different type, even if it can't understand it.

| FCS Type | tag |
|---|---|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have a very specific name for the left hand side that we use nowhere else so we can refer to things of that type unambiguously. is it object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about 'FCE type' for Filecoin Compact Encoding type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, anything that is unique like that works

Copy link
Member Author

@whyrusleeping whyrusleeping Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Name this something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementers will need a bunch of examples to ensure their implementation is working. uh, including us i suppose. also prolly a good idea to link to something that makes it easy to decode cbor to diagnostic output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll link this somewhere too, but this tool is really great for debug printing cbor, or even converting it to json

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • TODO: add examples and link diagnostics tool.

| block v1 | 43 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could use multicodecs? Too big?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use multicodecs, but i was just trying to fit into the cbor tags table. Unclear if thats worth doing...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll +1 using multicodecs with the caveat that I’m not sure offhand what the difference in size would be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikeal Yeah, Using multicodecs would be nice. Though it moves us out of the realm of using cbor, and into using our own custom encoding format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we already use cbor I would really like to stick with cbor tags, as this ensures cbor implementations are compatible out of the box and this is well defined cbor making other tools work well

| message v1 | 44 |
| signedMessage v1 | 45 |

For example, a message would be encoded as:

```cbor
tag<44>[msg.To, msg.From, msg.Nonce, msg.Value, msg.Method, msg.Params]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about changes over time? would that result in a new tag? or should there be a version field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new tag

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really worth the few bytes? The object's already going to be at least 55 bytes, likely much more. We can turn this into a map of numeric tags to fields with 5 additional bytes (a single byte key per field).

Really, it's not a huge deal but something to consider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5 bytes arent really worth it for the block, but they are for the messages, and I wanted to have a consistent format.

messages (with IDs) using this format will likely be under 20 bytes, and messages will take up the bulk of data being moved around. adding another 5 bytes is rather painful.

Also, does a map of numeric tags to fields really buy us that much? it ends up looking like a hacky array. (though i guess it does allow us to leave fields out)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under 20 bytes? Are you somehow skipping fields?

Also, does a map of numeric tags to fields really buy us that much? it ends up looking like a hacky array. (though i guess it does allow us to leave fields out)

Really, this just allow us to write /ipld/... paths. However, we can also just introduce a new pathing scheme, /filecoin/ for filecoin paths. However, this does mean that graphsync will need to support filecoin paths out of the box.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien

To and From fields both around 4-5 bytes, nonce will be 1 byte most of the time, value will be a few bytes, up to 8 bytes in some bad cases, but hopefully around 2-3 bytes if we can help it. Method is one byte, and then params will be more. But a simple 'send money' transaction won't have any params.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien also, with our own custom CID type, we can translate nice pathing schemes out of it. this is how we do pathing through other blockchain formats (and unixfs :/ )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a little confused by the array notation in here, does it literally mean an major type 4 array to hold all the data items or does it denote just the sequence of data items? the difference is the presence or absence of major type 4 data item following the outer object tag, which contains the elements of the array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, major type 4 to hold all the data items, that way its a valid cbor object.

```

Each individual type should be encoded as specified:

| type | encoding |
whyrusleeping marked this conversation as resolved.
Show resolved Hide resolved
| --- | ---- |
| Uint64 | CBOR major type 0 |
whyrusleeping marked this conversation as resolved.
Show resolved Hide resolved
| BigInteger | [CBOR bignum](https://tools.ietf.org/html/rfc7049#section-2.4.2) |
| Address | CBOR major type 2 |
| Uint8 | CBOR Major type 0 |
| []byte | CBOR Major type 2 |
| string | CBOR Major type 3 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing boolean values ( not yet used but a good thing to have around I hear)

| bool | [CBOR Major type 7, value 20/21](https://tools.ietf.org/html/rfc7049#section-2.3) |

## Encoding Considerations

Objects should be encoded using [canonical CBOR](https://tools.ietf.org/html/rfc7049#section-3.9), and decoders should operate in [strict mode](https://tools.ietf.org/html/rfc7049#section-3.10). The maximum size of an FCS Object should be 1MB (2^20 bytes). Objects larger than this are invalid.

Additionally, CBOR Major type 5 is not used. If an FCS object contains it, that object is invalid.

## IPLD Considerations

Cids for FCS objects should use the FCS multicodec (`0x1f`).
2 changes: 1 addition & 1 deletion definitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ See [Filecoin Proofs](proofs.md)

#### Leader

A leader, in the context of Filecoin consensus, is a node that is chosen to propose the next block in the blockchain.
A leader, in the context of Filecoin consensus, is a node that is chosen to propose the next block in the blockchain.

#### Leader election

Expand Down