-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The unified multicodecs theory #16
Changes from 16 commits
5d3d477
7d08aa4
2160506
d8c45d0
b923744
a3fc375
90aa4fb
885959d
38ea31b
76ed15d
36bd99e
66dcfa4
fa3197f
ab1b196
ce94334
6d7f49d
0c9d2df
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,15 +4,14 @@ | |
[![](https://img.shields.io/badge/project-multiformats-blue.svg?style=flat-square)](http://github.com/multiformats/multiformats) | ||
[![](https://img.shields.io/badge/freenode-%23ipfs-blue.svg?style=flat-square)](http://webchat.freenode.net/?channels=%23ipfs) | ||
|
||
> self-describing codecs | ||
> compact self-describing codecs. Save space by using predefined multicodec tables. | ||
|
||
## Table of Contents | ||
|
||
- [Motivation](#motivation) | ||
- [How does it work? - Protocol Description](#how-does-it-work---protocol-description) | ||
- [Prefix examples](#prefix-examples) | ||
- [prefix - codec - desc](#prefix---codec---desc) | ||
- [The protocol path](#the-protocol-path) | ||
- [Multicodec tables](#multicodec-tables) | ||
- [Standard multicodec table](#standard-mcp-protocol-table) | ||
- [Implementations](#implementations) | ||
- [FAQ](#faq) | ||
- [Maintainers](#maintainers) | ||
|
@@ -21,136 +20,162 @@ | |
|
||
## Motivation | ||
|
||
Multicodecs are self-describing protocol/encoding streams. (Note that a file is a stream). It's designed to address the perennial problem: | ||
[Multistreams](https://github.com/multiformats/multistream) are self-describing protocol/encoding streams. Multicodec uses an agreed-upon "protocol table". It is designed for use in short strings, such as keys or identifiers (i.e [CID](https://github.com/ipld/cid)). | ||
|
||
> I have a bitstring, what codec is the data coded with!? | ||
## Protocol Description - How does the protocol work? | ||
|
||
Instead of arguing about which data serialization library is the best, let's just pick the simplest one now, and build _upgradability_ into the system. Choices are never _forever_. Eventually all systems are changed. So, embrace this fact of reality, and build change into your system now. | ||
`multicodec` is a _self-describing multiformat_, it wraps other formats with a tiny bit of self-description. A multicodec identifier is both a varint and the code identifying the following data, this means that the most significant bit of every multicodec code is reserved to signal the continuation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It describes only the packed format, I think we aren't yet removing the path format, are we? |
||
|
||
Multicodec frees you from the tyranny of past mistakes. Instead of trying to figure it all out beforehand, or continue using something that we can all agree no longer fits, why not allow the system to _evolve_ and _grow_ with the use cases of today, not yesterday. | ||
|
||
To decode an incoming stream of data, a program must either (a) know the format of the data a priori, or (b) learn the format from the data itself. (a) precludes running protocols that may provide one of many kinds of formats without prior agreement on which. multistream makes (b) neat using self-description. | ||
|
||
Moreover, this self-description allows straightforward layering of protocols without having to implement support in the parent (or encapsulating) one. | ||
|
||
## How does it work? - Protocol Description | ||
|
||
`multicodec` is a _self-describing multiformat_, it wraps other formats with a tiny bit of self-description: | ||
This way, a chunk of data identified by multicodec will look like this: | ||
|
||
```sh | ||
<multicodec-header><encoded-data> | ||
# or | ||
<varint-len><code>\n<encoded-data> | ||
<multicodec-varint><encoded-data> | ||
# To reduce the cognitive load, we sometimes might write the same line as: | ||
<mcp><data> | ||
``` | ||
|
||
For example, let's encode a json doc: | ||
|
||
```node | ||
> // encode some json | ||
> var str = JSON.stringify({"hello":"world"}) | ||
> var buf = multicodec.encode('json', str) // prepends multistream.header('/json') | ||
> buf | ||
<Buffer 06 2f 6a 73 6f 6e 2f 7b 22 68 65 6c 6c 6f 22 3a 22 77 6f 72 6c 64 22 7d> | ||
> buf.toString('hex') | ||
062f6a736f6e2f7b2268656c6c6f223a22776f726c64227d | ||
> // decode, and find out what is in buf | ||
> multicodec.decode(buf) | ||
{ "codec": "json", "data": '{"hello": "world"}' } | ||
``` | ||
Another useful scenario is when using the multicodec-packed as part of the keys to access data, example: | ||
|
||
So, `buf` is: | ||
|
||
``` | ||
hex: 062f6a736f6e2f7b2268656c6c6f223a22776f726c64227d | ||
ascii: json/\n{"hello":"world"} | ||
``` | ||
# suppose we have a value and a key to retrieve it | ||
"<key>" -> <value> | ||
|
||
The more you know! Let's try it again, this time with protobuf: | ||
|
||
``` | ||
cat proto.c | ||
# we can use multicodec-packed with the key to know what codec the value is in | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. put back the newlines There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ✔️ |
||
"<mcp><key>" -> <value> | ||
``` | ||
|
||
See also: [multicodec-packed](./multicodec-packed.md). | ||
|
||
## Prefix examples | ||
|
||
|
||
| prefix | codec | desc | type | [packed encoding](https://github.com/multiformats/multicodec/blob/master/multicodec-packed.md)| | ||
|----------------|-------|-------------|-------|---------------------------------------| | ||
|0x052f62696e2f | /bin/ |raw binary |binary | 0x00 | | ||
|0x042f62322f | /b2/ |ascii base2 |binary | | | ||
|0x052f6231362f | /b16/ |ascii base16 |hex | | | ||
|0x052f6233322f | /b32/ |ascii base32 | | | | ||
|0x052f6235382f | /b58/ |ascii base58 | | | | ||
|0x052f6236342f | /b64/ |ascii base64 | | | | ||
|0x062f6a736f6e2f |/json/ | |json | | | ||
|0x062f63626f722f |/cbor/ | |json | | | ||
|0x062f62736f6e2f |/bson/ | |json | | | ||
|0x072f626a736f6e2f|/bjson/| |json | | | ||
|0x082f75626a736f6e2f| /ubjson/| |json | | | ||
|0x182f6d756c7469636f6465632f | /multicodec/ | | multiformat | 0x40 | | ||
|0x162f6d756c7469686173682f | /multihash/ | | multiformat | 0x41 | | ||
|0x162f6d756c7469616464722f | /multiaddr/ | | multiformat | 0x42 | | ||
|0x0a2f70726f746f6275662f |/protobuf/ | Protocol Buffers |protobuf| | | ||
|0x072f6361706e702f | /capnp/ | Cap-n-Proto |protobuf| | | ||
|0x092f666c61746275662f |/flatbuf/ | FlatBuffers |protobuf| | | ||
|0x052f7461722f |/tar/ | | archive | | | ||
|0x052f7a69702f |/zip/ | | archive | | | ||
|0x052f706e672f | /png/ | | archive | | | ||
|0x052f726c702f | /rlp/ | recursive length prefix | ethereum | 0x60 | | ||
## The protocol path | ||
|
||
`multicodec` allows us to specify different protocols in a universal namespace, that way being able to recognize, multiplex, and embed them easily. We use the notion of a `path` instead of an `id` because it is meant to be a Unix-friendly URI. | ||
|
||
A good path name should be decipherable -- meaning that if some machine or developer -- who has no idea about your protocol -- encounters the path string, they should be able to look it up and resolve how to use it. | ||
|
||
An example of a good path name is: | ||
|
||
``` | ||
/bittorrent.org/1.0 | ||
It is worth noting that multicodec-packed works very well in conjunction with [multihash](https://github.com/multiformats/multihash) and [multiaddr](https://github.com/multiformats/multiaddr), as you can prefix those values with a multicodec-packed to tell what they are. | ||
|
||
## MulticodecProtocol Tables | ||
|
||
Multicodec uses "protocol tables" to agree upon the mapping from one multicodec code (a single varint). These tables can be application specific, though -- like [with](https://github.com/multiformats/multihash) [other](https://github.com/multiformats/multibase) [multiformats](https://github.com/multiformats/multiaddr) -- we will keep a globally agreed upon table with common protocols and formats. | ||
|
||
## Multicodec table | ||
|
||
```csv | ||
codec, description, code | ||
|
||
miscelaneous | ||
bin, raw binary, 0x55 | ||
|
||
bases encodings | ||
base1, unary, 0x01 | ||
base2, binary (0 and 1), 0x55 | ||
base8, octal, 0x07 | ||
base10, decimal, 0x09 | ||
base16, hexadecimal, 0x | ||
base32, rfc4648, 0x | ||
base32hex, rfc4648, 0x | ||
base58flickr, base58 flicker, 0x | ||
base58btc, base58 bitcoin, 0x | ||
base64, rfc4648, 0x | ||
base64url, rfc4648, 0x | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The codes of multibase should be the same as https://github.com/multiformats/multibase/blob/master/multibase.csv There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not possible, the multibase.csv are chars available in the encodec space of the base, not hex values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIRC we were removing multibases from multicodec packed. a i misremembering, @diasdavid ? |
||
|
||
serialization formats | ||
cbor, CBOR, 0x | ||
bson, Binary JSON, 0x | ||
ubjson, Universal Binary JSON, 0x | ||
protobuf, Protocol Buffers, 0x | ||
capnp, Cap-n-Proto, 0x | ||
flatbuf, FlatBuffers, 0x | ||
rlp, recursive length prefix, 0x60 | ||
|
||
multiformats | ||
multicodec, , 0x30 | ||
multihash, , 0x31 | ||
multiaddr, , 0x32 | ||
multibase, , 0x33 | ||
|
||
multihashes | ||
sha1, , 0x11 | ||
sha2-256, , 0x12 | ||
sha2-512, , 0x13 | ||
sha3-224, , 0x17 | ||
sha3-256, , 0x16 | ||
sha3-384, , 0x15 | ||
sha3-512, , 0x14 | ||
shake-128, , 0x18 | ||
shake-256, , 0x19 | ||
keccak-224, , 0x1A | ||
keccak-256, , 0x1B | ||
keccak-384, , 0x1C | ||
keccak-512, , 0x1D | ||
Note: keccak has variable output length, instead the number specifies the core length,, | ||
blake2b, , 0x40 | ||
blake2s, , 0x41 | ||
reserved for apps, appl specific range, 0x4000-0x40f0 | ||
|
||
multiaddrs | ||
ip4, , 0x04 | ||
ip6, , 0x29 | ||
tcp, , 0x06 | ||
udp, , 0x0111 | ||
dccp, , 0x21 | ||
sctp, , 0x84 | ||
udt, , 0x012D | ||
utp, , 0x012E | ||
ipfs, , 0x2A | ||
http, , 0x01E0 | ||
https, , 0x01BB | ||
ws, , 0x01DD | ||
onion, , 0x01BC | ||
|
||
archiving formats | ||
tar, , 0x | ||
zip, , 0x | ||
|
||
image formats | ||
png, , 0x | ||
jpg, , 0x | ||
|
||
video formats | ||
mp4, , 0x | ||
mkv, , 0x | ||
|
||
IPLD formats | ||
dag-pb, MerkleDAG protobuf, 0x70 | ||
dag-cbor, MerkleDAG cbor, 0x71 | ||
eth-block, Ethereum Block (RLP), 0x90 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think these should be above There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because that means they take up two bytes when encoded. Anything that is going to be commonly used we would really love to have take only a single byte. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😢 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I noticed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Transport multicodecs are 'more okay' because they get transferred less, in the other hand, format multicodecs get transferred every time a block is transferred, so a byte actually means a lot. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @whyrusleeping is this still a concern? I believe it stopped being as soon you merged IPLD in go-ipfs into master, correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and, there are simply more than 127 things we "really care about". so 2 bytes is in the long run unavoidable. |
||
eth-tx, Ethereum Tx (RLP), 0x91 | ||
bitcoin-block, Bitcoin Block, 0xb0 | ||
bitcoin-tx, Bitcoin Tx, 0xb1 | ||
stellar-block, Stellar Block, 0xd0 | ||
stellar-tx, Stellar Tx, 0xd1 | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this table should move to a CSV, and the readme should point to it or embed it. |
||
|
||
An example of a _great_ path name is: | ||
|
||
``` | ||
/ipfs/Qmaa4Rw81a3a1VEx4LxB7HADUAXvZFhCoRdBzsMZyZmqHD/ipfs.protocol | ||
/http/w3id.org/ipfs/ipfs-1.1.0.json | ||
``` | ||
## Implementations | ||
|
||
These path names happen to be resolvable -- not just in a "multicodec muxer(e.g [multistream]())" but -- in the internet as a whole (provided the program (or OS) knows how to use the `/ipfs` and `/http` protocols). | ||
- [go](https://github.com/multiformats/go-multicodec/) | ||
- [JavaScript](https://github.com/multiformats/js-multicodec) | ||
- [Add yours today!](https://github.com/multiformats/multicodec/edit/master/multicodec.md) | ||
|
||
## Implementations | ||
## Multicodec Path, also known as [`multistream`](https://github.com/multiformats/multistream) | ||
|
||
- [go-multicodec](https://github.com/multiformats/go-multicodec) | ||
- [go-multistream](https://github.com/multiformats/go-multistream) - Implements multistream, which uses multicodec for stream negotiation | ||
- [js-multistream](https://github.com/multiformats/js-multistream-select) - Implements multistream, which uses multicodec for stream negotiation | ||
- [clj-multicodec](https://github.com/multiformats/clj-multicodec) | ||
Multicodec defines a table for the most common data serialization formats that can be expanded overtime or per application bases, however, in order for two programs to talk with each other, they need to know before hand which table or table extension is being used. | ||
|
||
In order to enable self descriptive data formats or streams that can be dynamically described, without the formal set of adding a binary packed code to a table, we have [`multistream`](https://github.com/multiformats/multistream), so that applications can adopt multiple data formats for their streams and with that create different protocols. | ||
|
||
## FAQ | ||
|
||
> **Q. Why?** | ||
> **Q. I have questions on multicodec, not listed here.** | ||
|
||
Today, people speak many languages, and use common ones to interface. But every "common language" has evolved over time, or even fundamentally switched. Why should we expect programs to be any different? | ||
That's not a question. But, have you checked the proper [multicodec FAQ](./README.md#faq)? Maybe your question is answered there. This FAQ is only specifically for multicodec-packed. | ||
|
||
And the reality is they're not. Programs use a variety of encodings. Today we like JSON. Yesterday, XML was all the rage. XDR solved everything, but it's kinda retro. Protobuf is still too cool for school. capnp ("cap and proto") is | ||
for cerealization hipsters. | ||
> **Q. Why?** | ||
|
||
The one problem is figuring out what we're speaking. Humans are pretty smart, we pick up all sorts of languages over time. And we can always resort to pointing and grunting (the ascii of humanity). | ||
Because [multistream](https://github.com/multiformats/multistream) is too long for identifiers. We needed something shorter. | ||
|
||
Programs have a harder time. You can't keep piping json into a protobuf decoder and hope they align. So we have to help them out a bit. That's what multicodec is for. | ||
> **Q. Why varints?** | ||
|
||
> **Q. Why "codec" and not "encoder" and "decoder"?** | ||
So that we have no limitation on protocols. Implementation note: you do not need to implement varints until the standard multicodec table has more than 127 functions. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is the case already. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, we probably need to remove this, and add multicodec impls to all the use-case specific multiformats IF their table requires it. |
||
|
||
Because they're the same thing. Which one of these is the encoder and which the decoder? | ||
> **Q. What kind of varints?** | ||
|
||
5555 ----[ THING ]---> 8888 | ||
5555 <---[ THING ]---- 8888 | ||
An Most Significant Bit unsigned varint, as defined by the [multiformats/unsigned-varint](https://github.com/multiformats/unsigned-varint). | ||
|
||
> **Q. Full paths are too big for my use case, is there something smaller?** | ||
> **Q. Don't we have to agree on a table of protocols?** | ||
|
||
Yes, check out [multicodec-packed](./multicodec-packed.md). It uses a varint and a table to achieve the same thing. | ||
Yes, but we already have to agree on what protocols themselves are, so this is not so hard. The table even leaves some room for custom protocol paths, or you can use your own tables. The standard table is only for common things. | ||
|
||
## Maintainers | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all protocols need to be in 'agreed upon "protocol table"' or only those using packed format?