-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Propose ADR 027: Deterministic Protobuf Serialization (#6979)
* Add ADR 026: Protocol Buffer Regencode * Use Regencode in ADR-020 * Apply suggestions from code review Co-authored-by: Aaron Craelius <[email protected]> * Add reasoning behind omitting empty fields * Rename to ADR 026: Deterministic Protobuf Serialization * i++ * explicitely -> explicitly * Fix description of optional values * Update reasoning behind omitting default values * Fix empty -> defaut * Update docs/architecture/adr-027-deterministic-protobuf-serialization.md Co-authored-by: Amaury Martiny <[email protected]> * Add rules to avoid duplicate encoding and specify packed fields * Add rule for variant encoding * Add surrogate pair to test UTF-8 encoding Co-authored-by: Aaron Craelius <[email protected]> Co-authored-by: Amaury Martiny <[email protected]>
- Loading branch information
1 parent
bf0142b
commit 1b9f144
Showing
3 changed files
with
256 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
251 changes: 251 additions & 0 deletions
251
docs/architecture/adr-027-deterministic-protobuf-serialization.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,251 @@ | ||
# ADR 027: Deterministic Protobuf Serialization | ||
|
||
## Changelog | ||
|
||
- 2020-08-07: Initial Draft | ||
|
||
## Status | ||
|
||
Proposed | ||
|
||
## Context | ||
|
||
[Protobuf](https://developers.google.com/protocol-buffers/docs/proto3) | ||
seralization is not unique (i.e. there exist a practically unlimited number of | ||
valid binary representations for a protobuf document)<sup>1</sup>. For signature | ||
verification in Cosmos SDK, signer and verifier need to agree on the same | ||
serialization of a SignDoc as defined in | ||
[ADR-020](./adr-020-protobuf-transaction-encoding.md) without transmitting the | ||
serialization. This document describes a deterministic serialization scheme for | ||
a subset of protobuf documents, that covers this use case but can be reused in | ||
other cases as well. | ||
|
||
## Decision | ||
|
||
The following encoding scheme is proposed to be used by other ADRs. | ||
|
||
### Scope | ||
|
||
This ADR defines a protobuf3 serializer. The output is a valid protobuf | ||
serialization, such that every protobuf parser can parse it. | ||
|
||
No maps are supported in version 1 due to the complexity of defining a | ||
derterministic serialization. This might change in future. Implementations must | ||
reject documents containing maps as invalid input. | ||
|
||
### Serialization rules | ||
|
||
The serialization is based on the | ||
[protobuf 3 encoding](https://developers.google.com/protocol-buffers/docs/encoding) | ||
with the following additions: | ||
|
||
1. Fields must be serialized only once in ascending order | ||
2. Extra fields or any extra data must not be added | ||
3. [Default values](https://developers.google.com/protocol-buffers/docs/proto3#default) | ||
must be omitted | ||
4. `repeated` fields of scalar numeric types must use | ||
[packed encoding](https://developers.google.com/protocol-buffers/docs/encoding#packed) | ||
by default. | ||
5. Variant encoding of integers must not be longer than needed. | ||
|
||
While rule number 1. and 2. should be pretty straight forward and describe the | ||
default behaviour of all protobuf encoders the author is aware of, the 3rd rule | ||
is more interesting. After a protobuf 3 deserialization you cannot differentiate | ||
between unset fields and fields set to the default value<sup>2</sup>. At | ||
serialization level however, it is possible to set the fields with an empty | ||
value or omitting them entirely. This is a significant difference to e.g. JSON | ||
where a property can be empty (`""`, `0`), `null` or undefined, leading to 3 | ||
different documents. | ||
|
||
Omitting fields set to default values is valid because the parser must assign | ||
the default value to fields missing in the serialization<sup>3</sup>. For scalar | ||
types, omitting defaults is required by the spec<sup>4</sup>. For `repeated` | ||
fields, not serializing them is the only way to express empty lists. Enums must | ||
have a first element of numeric value 0, which is the default<sup>5</sup>. And | ||
message fields default to unset<sup>6</sup>. | ||
|
||
Omitting defaults allows for some amount of forward compatibility: users of | ||
newer versions of a protobuf schema produce the same serialization as users of | ||
older versions as long as newly added fields are not used (i.e. set to their | ||
default value). | ||
|
||
### Implementation | ||
|
||
There are three main implementation strategies, ordered from the least to the | ||
most custom development: | ||
|
||
- **Use a protobuf serializer that follows the above rules by default.** E.g. | ||
[gogoproto](https://pkg.go.dev/github.com/gogo/protobuf/gogoproto) is known to | ||
be compliant by in most cases, but not when certain annotations such as | ||
`nullable = false` are used. It might also be an option to configure an | ||
existing serializer accordingly. | ||
- **Normalize default values before encoding them.** If your serializer follows | ||
rule 1. and 2. and allows you to explicitly unset fields for serialization, | ||
you can normalize default values to unset. This can be done when working with | ||
[protobuf.js](https://www.npmjs.com/package/protobufjs): | ||
|
||
```js | ||
const bytes = SignDoc.encode({ | ||
bodyBytes: body.length > 0 ? body : null, // normalize empty bytes to unset | ||
authInfoBytes: authInfo.length > 0 ? authInfo : null, // normalize empty bytes to unset | ||
chainId: chainId || null, // normalize "" to unset | ||
accountNumber: accountNumber || null, // normalize 0 to unset | ||
accountSequence: accountSequence || null, // normalize 0 to unset | ||
}).finish(); | ||
``` | ||
|
||
- **Use a hand-written serializer for the types you need.** If none of the above | ||
ways works for you, you can write a serializer yourself. For SignDoc this | ||
would look something like this in Go, building on existing protobuf utilities: | ||
|
||
```go | ||
if !signDoc.body_bytes.empty() { | ||
buf.WriteUVarInt64(0xA) // wire type and field number for body_bytes | ||
buf.WriteUVarInt64(signDoc.body_bytes.length()) | ||
buf.WriteBytes(signDoc.body_bytes) | ||
} | ||
if !signDoc.auth_info.empty() { | ||
buf.WriteUVarInt64(0x12) // wire type and field number for auth_info | ||
buf.WriteUVarInt64(signDoc.auth_info.length()) | ||
buf.WriteBytes(signDoc.auth_info) | ||
} | ||
if !signDoc.chain_id.empty() { | ||
buf.WriteUVarInt64(0x1a) // wire type and field number for chain_id | ||
buf.WriteUVarInt64(signDoc.chain_id.length()) | ||
buf.WriteBytes(signDoc.chain_id) | ||
} | ||
if signDoc.account_number != 0 { | ||
buf.WriteUVarInt64(0x20) // wire type and field number for account_number | ||
buf.WriteUVarInt(signDoc.account_number) | ||
} | ||
if signDoc.account_sequence != 0 { | ||
buf.WriteUVarInt64(0x28) // wire type and field number for account_sequence | ||
buf.WriteUVarInt(signDoc.account_sequence) | ||
} | ||
``` | ||
|
||
### Test vectors | ||
|
||
Given the protobuf definition `Article.proto` | ||
|
||
```protobuf | ||
package blog; | ||
syntax = "proto3"; | ||
enum Type { | ||
UNSPECIFIED = 0; | ||
IMAGES = 1; | ||
NEWS = 2; | ||
}; | ||
enum Review { | ||
UNSPECIFIED = 0; | ||
ACCEPTED = 1; | ||
REJECTED = 2; | ||
}; | ||
message Article { | ||
string title = 1; | ||
string description = 2; | ||
uint64 created = 3; | ||
uint64 updated = 4; | ||
bool public = 5; | ||
bool promoted = 6; | ||
Type type = 7; | ||
Review review = 8; | ||
repeated string comments = 9; | ||
repeated string backlinks = 10; | ||
}; | ||
``` | ||
|
||
serializing the values | ||
|
||
```yaml | ||
title: "The world needs change 🌳" | ||
description: "" | ||
created: 1596806111080 | ||
updated: 0 | ||
public: true | ||
promoted: false | ||
type: Type.NEWS | ||
review: Review.UNSPECIFIED | ||
comments: ["Nice one", "Thank you"] | ||
backlinks: [] | ||
``` | ||
|
||
must result in the serialization | ||
|
||
``` | ||
0a1b54686520776f726c64206e65656473206368616e676520f09f8cb318e8bebec8bc2e280138024a084e696365206f6e654a095468616e6b20796f75 | ||
``` | ||
When inspecting the serialized document, you see that every second field is | ||
omitted: | ||
``` | ||
$ echo 0a1b54686520776f726c64206e65656473206368616e676520f09f8cb318e8bebec8bc2e280138024a084e696365206f6e654a095468616e6b20796f75 | xxd -r -p | protoc --decode_raw | ||
1: "The world needs change \360\237\214\263" | ||
3: 1596806111080 | ||
5: 1 | ||
7: 2 | ||
9: "Nice one" | ||
9: "Thank you" | ||
``` | ||
## Consequences | ||
Having such an encoding available allows us to get deterministic serialization | ||
for all protobuf documents we need in the context of Cosmos SDK signing. | ||
### Positive | ||
- Well defined rules that can be verified independent of a reference | ||
implementation | ||
- Simple enough to keep the barrier to implement transaction signing low | ||
- It allows us to continue to use 0 and other empty values in SignDoc, avoiding | ||
the need to work around 0 sequences. This does not imply the change from | ||
https://github.com/cosmos/cosmos-sdk/pull/6949 should not be merged, but not | ||
too important anymore. | ||
### Negative | ||
- When implementing transaction signing, the encoding rules above must be | ||
understood and implemented. | ||
- The need for rule number 3. adds some complexity to implementations. | ||
### Neutral | ||
## References | ||
- <sup>1</sup> _When a message is serialized, there is no guaranteed order for | ||
how its known or unknown fields should be written. Serialization order is an | ||
implementation detail and the details of any particular implementation may | ||
change in the future. Therefore, protocol buffer parsers must be able to parse | ||
fields in any order._ from | ||
https://developers.google.com/protocol-buffers/docs/encoding#order | ||
- <sup>2</sup> _Note that for scalar message fields, once a message is parsed | ||
there's no way of telling whether a field was explicitly set to the default | ||
value (for example whether a boolean was set to false) or just not set at all: | ||
you should bear this in mind when defining your message types. For example, | ||
don't have a boolean that switches on some behaviour when set to false if you | ||
don't want that behaviour to also happen by default._ from | ||
https://developers.google.com/protocol-buffers/docs/proto3#default | ||
- <sup>3</sup> _When a message is parsed, if the encoded message does not | ||
contain a particular singular element, the corresponding field in the parsed | ||
object is set to the default value for that field._ from | ||
https://developers.google.com/protocol-buffers/docs/proto3#default | ||
- <sup>4</sup> _Also note that if a scalar message field is set to its default, | ||
the value will not be serialized on the wire._ from | ||
https://developers.google.com/protocol-buffers/docs/proto3#default | ||
- <sup>5</sup> _For enums, the default value is the first defined enum value, | ||
which must be 0._ from | ||
https://developers.google.com/protocol-buffers/docs/proto3#default | ||
- <sup>6</sup> _For message fields, the field is not set. Its exact value is | ||
language-dependent._ from | ||
https://developers.google.com/protocol-buffers/docs/proto3#default | ||
- Encoding rules and parts of the reasoning taken from | ||
[canonical-proto3 Aaron Craelius](https://github.com/regen-network/canonical-proto3) |