From 9ae15732be1248e618b014717d2c9c0ee37c7498 Mon Sep 17 00:00:00 2001 From: Jorropo Date: Mon, 10 Oct 2022 17:05:17 +0200 Subject: [PATCH] docs: Write UNIXFSv1 spec --- UNIXFS.md | 233 ------------------------- UNIXFSv1.md | 487 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 487 insertions(+), 233 deletions(-) delete mode 100644 UNIXFS.md create mode 100644 UNIXFSv1.md diff --git a/UNIXFS.md b/UNIXFS.md deleted file mode 100644 index a53c7af2c..000000000 --- a/UNIXFS.md +++ /dev/null @@ -1,233 +0,0 @@ -# ![](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) UnixFS - -**Author(s)**: -- NA - -* * * - -**Abstract** - -UnixFS is a [protocol-buffers](https://developers.google.com/protocol-buffers/) based format for describing files, directories, and symlinks in IPFS. The current implementation of UnixFS has grown organically and does not have a clear specification document. See [“implementations”](#implementations) below for reference implementations you can examine to understand the format. - -Draft work and discussion on a specification for the upcoming version 2 of the UnixFS format is happening in the [`ipfs/unixfs-v2` repo](https://github.com/ipfs/unixfs-v2). Please see the issues there for discussion and PRs for drafts. When the specification is completed there, it will be copied back to this repo and replace this document. - -## Table of Contents - -- [Implementations](#implementations) -- [Data Format](#data-format) -- [Metadata](#metadata) - - [Deduplication and inlining](#deduplication-and-inlining) -- [Importing](#importing) - - [Chunking](#chunking) - - [Layout](#layout) -- [Exporting](#exporting) -- [Design decision rationale](#design-decision-rationale) - - [Metadata](#metadata-1) - - [Separate Metadata node](#separate-metadata-node) - - [Metadata in the directory](#metadata-in-the-directory) - - [Metadata in the file](#metadata-in-the-file) - - [Side trees](#side-trees) - - [Side database](#side-database) - -## Implementations - -- JavaScript - - Data Formats - [unixfs](https://github.com/ipfs/js-ipfs-unixfs) - - Importer - [unixfs-importer](https://github.com/ipfs/js-ipfs-unixfs-importer) - - Exporter - [unixfs-exporter](https://github.com/ipfs/js-ipfs-unixfs-exporter) -- Go - - [`ipfs/go-ipfs/unixfs`](https://github.com/ipfs/go-ipfs/tree/b3faaad1310bcc32dc3dd24e1919e9edf51edba8/unixfs) - - Protocol Buffer Definitions - [`ipfs/go-ipfs/unixfs/pb`](https://github.com/ipfs/go-ipfs/blob/b3faaad1310bcc32dc3dd24e1919e9edf51edba8/unixfs/pb/unixfs.proto) - -## Data Format - -The UnixfsV1 data format is represented by this protobuf: - -```protobuf -message Data { - enum DataType { - Raw = 0; - Directory = 1; - File = 2; - Metadata = 3; - Symlink = 4; - HAMTShard = 5; - } - - required DataType Type = 1; - optional bytes Data = 2; - optional uint64 filesize = 3; - repeated uint64 blocksizes = 4; - optional uint64 hashType = 5; - optional uint64 fanout = 6; - optional uint32 mode = 7; - optional UnixTime mtime = 8; -} - -message Metadata { - optional string MimeType = 1; -} - -message UnixTime { - required int64 Seconds = 1; - optional fixed32 FractionalNanoseconds = 2; -} -``` - -This `Data` object is used for all non-leaf nodes in Unixfs. - -For files that are comprised of more than a single block, the 'Type' field will be set to 'File', the 'filesize' field will be set to the total number of bytes in the file (not the graph structure) represented by this node, and 'blocksizes' will contain a list of the filesizes of each child node. - -This data is serialized and placed inside the 'Data' field of the outer merkledag protobuf, which also contains the actual links to the child nodes of this object. - -For files comprised of a single block, the 'Type' field will be set to 'File', 'filesize' will be set to the total number of bytes in the file and the file data will be stored in the 'Data' field. - -## Metadata - -UnixFS currently supports two optional metadata fields: - -* `mode` -- The `mode` is for persisting the file permissions in [numeric notation](https://en.wikipedia.org/wiki/File_system_permissions#Numeric_notation) \[[spec](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html)\]. - - If unspecified this defaults to - - `0755` for directories/HAMT shards - - `0644` for all other types where applicable - - The nine least significant bits represent `ugo-rwx` - - The next three least significant bits represent `setuid`, `setgid` and the `sticky bit` - - The remaining 20 bits are reserved for future use, and are subject to change. Spec implementations **MUST** handle bits they do not expect as follows: - - For future-proofing the (de)serialization layer must preserve the entire uint32 value during clone/copy operations, modifying only bit values that have a well defined meaning: `clonedValue = ( modifiedBits & 07777 ) | ( originalValue & 0xFFFFF000 )` - - Implementations of this spec must proactively mask off bits without a defined meaning in the implemented version of the spec: `interpretedValue = originalValue & 07777` - -* `mtime` -- A two-element structure ( `Seconds`, `FractionalNanoseconds` ) representing the modification time in seconds relative to the unix epoch `1970-01-01T00:00:00Z`. - - The two fields are: - 1. `Seconds` ( always present, signed 64bit integer ): represents the amount of seconds after **or before** the epoch. - 2. `FractionalNanoseconds` ( optional, 32bit unsigned integer ): when specified represents the fractional part of the mtime as the amount of nanoseconds. The valid range for this value are the integers `[1, 999999999]`. - - - Implementations encoding or decoding wire-representations must observe the following: - - An `mtime` structure with `FractionalNanoseconds` outside of the on-wire range `[1, 999999999]` is **not** valid. This includes a fractional value of `0`. Implementations encountering such values should consider the entire enclosing metadata block malformed and abort processing the corresponding DAG. - - The `mtime` structure is optional - its absence implies `unspecified`, rather than `0` - - For ergonomic reasons a surface API of an encoder must allow fractional 0 as input, while at the same time must ensure it is stripped from the final structure before encoding, satisfying the above constraints. - - - Implementations interpreting the mtime metadata in order to apply it within a non-IPFS target must observe the following: - - If the target supports a distinction between `unspecified` and `0`/`1970-01-01T00:00:00Z`, the distinction must be preserved within the target. E.g. if no `mtime` structure is available, a web gateway must **not** render a `Last-Modified:` header. - - If the target requires an mtime ( e.g. a FUSE interface ) and no `mtime` is supplied OR the supplied `mtime` falls outside of the targets accepted range: - - When no `mtime` is specified or the resulting `UnixTime` is negative: implementations must assume `0`/`1970-01-01T00:00:00Z` ( note that such values are not merely academic: e.g. the OpenVMS epoch is `1858-11-17T00:00:00Z` ) - - When the resulting `UnixTime` is larger than the targets range ( e.g. 32bit vs 64bit mismatch ) implementations must assume the highest possible value in the targets range ( in most cases that would be `2038-01-19T03:14:07Z` ) - -### Deduplication and inlining - -Where the file data is small it would normally be stored in the `Data` field of the UnixFS `File` node. - -To aid in deduplication of data even for small files, file data can be stored in a separate node linked to from the `File` node in order for the data to have a constant [CID] regardless of the metadata associated with it. - -As a further optimization, if the `File` node's serialized size is small, it may be inlined into its v1 [CID] by using the [`identity`](https://github.com/multiformats/multicodec/blob/master/table.csv) [multihash]. - -## Importing - -Importing a file into unixfs is split up into two parts. The first is chunking, the second is layout. - -### Chunking - -Chunking has two main parameters, chunking strategy and leaf format. - -Leaf format should always be set to 'raw', this is mainly configurable for backwards compatibility with earlier formats that used a Unixfs Data object with type 'Raw'. Raw leaves means that the nodes output from chunking will be just raw data from the file with a CID type of 'raw'. - -Chunking strategy currently has two different options, 'fixed size' and 'rabin'. Fixed size chunking will chunk the input data into pieces of a given size. Rabin chunking will chunk the input data using rabin fingerprinting to determine the boundaries between chunks. - - -### Layout - -Layout defines the shape of the tree that gets built from the chunks of the input file. - -There are currently two options for layout, balanced, and trickle. -Additionally, a 'max width' must be specified. The default max width is 174. - -The balanced layout creates a balanced tree of width 'max width'. The tree is formed by taking up to 'max width' chunks from the chunk stream, and creating a unixfs file node that links to all of them. This is repeated until 'max width' unixfs file nodes are created, at which point a unixfs file node is created to hold all of those nodes, recursively. The root node of the resultant tree is returned as the handle to the newly imported file. - -If there is only a single chunk, no intermediate unixfs file nodes are created, and the single chunk is returned as the handle to the file. - -## Exporting - -To read the file data out of the unixfs graph, perform an in order traversal, emitting the data contained in each of the leaves. - -## Design decision rationale - -### Metadata - -Metadata support in UnixFSv1.5 has been expanded to increase the number of possible use cases. These include rsync and filesystem based package managers. - -Several metadata systems were evaluated: - -#### Separate Metadata node - -In this scheme, the existing `Metadata` message is expanded to include additional metadata types (`mtime`, `mode`, etc). It then contains links to the actual file data but never the file data itself. - -This was ultimately rejected for a number of reasons: - -1. You would always need to retrieve an additional node to access file data which limits the kind of optimizations that are possible. - - For example many files are under the 256KiB block size limit, so we tend to inline them into the describing UnixFS `File` node. This would not be possible with an intermediate `Metadata` node. - -2. The `File` node already contains some metadata (e.g. the file size) so metadata would be stored in multiple places which complicates forwards compatibility with UnixFSv2 as to map between metadata formats potentially requires multiple fetch operations - -#### Metadata in the directory - -Repeated `Metadata` messages are added to UnixFS `Directory` and `HAMTShard` nodes, the index of which indicates which entry they are to be applied to. - -Where entries are `HAMTShard`s, an empty message is added. - -One advantage of this method is that if we expand stored metadata to include entry types and sizes we can perform directory listings without needing to fetch further entry nodes (excepting `HAMTShard` nodes), though without removing the storage of these datums elsewhere in the spec we run the risk of having non-canonical data locations and perhaps conflicting data as we traverse through trees containing both UnixFS v1 and v1.5 nodes. - -This was rejected for the following reasons: - -1. When creating a UnixFS node there's no way to record metadata without wrapping it in a directory. - -2. If you access any UnixFS node directly by its [CID], there is no way of recreating the metadata which limits flexibility. - -3. In order to list the contents of a directory including entry types and sizes, you have to fetch the root node of each entry anyway so the performance benefit of including some metadata in the containing directory is negligible in this use case. - -#### Metadata in the file - -This adds new fields to the UnixFS `Data` message to represent the various metadata fields. - -It has the advantage of being simple to implement, metadata is maintained whether the file is accessed directly via its [CID] or via an IPFS path that includes a containing directory, and by keeping the metadata small enough we can inline root UnixFS nodes into their CIDs so we can end up fetching the same number of nodes if we decide to keep file data in a leaf node for deduplication reasons. - -Downsides to this approach are: - -1. Two users adding the same file to IPFS at different times will have different [CID]s due to the `mtime`s being different. - - If the content is stored in another node, its [CID] will be constant between the two users but you can't navigate to it unless you have the parent node which will be less available due to the proliferation of [CID]s. - -2. Metadata is also impossible to remove without changing the [CID], so metadata becomes part of the content. - -3. Performance may be impacted as well as if we don't inline UnixFS root nodes into [CID]s, additional fetches will be required to load a given UnixFS entry. - -#### Side trees - -With this approach we would maintain a separate data structure outside of the UnixFS tree to hold metadata. - -This was rejected due to concerns about added complexity, recovery after system crashes while writing, and having to make extra requests to fetch metadata nodes when resolving [CID]s from peers. - -#### Side database - -This scheme would see metadata stored in an external database. - -The downsides to this are that metadata would not be transferred from one node to another when syncing as [Bitswap] is not aware of the database, and in-tree metadata - -### UnixTime protobuf datatype rationale - -#### Seconds - -The integer portion of UnixTime is represented on the wire using a varint encoding. While this is -inefficient for negative values, it avoids introducing zig-zag encoding. Values before the year 1970 -will be exceedingly rare, and it would be handy having such cases stand out, while at the same keeping -the "usual" positive values easy to eyeball. The varint representing the time of writing this text is -5 bytes long. It will remain so until October 26, 3058 ( 34,359,738,367 ) - -#### FractionalNanoseconds -Fractional values are effectively a random number in the range 1 ~ 999,999,999. Such values will exceed -2^28 nanoseconds ( 268,435,456 ) in most cases. Therefore, the fractional part is represented as a 4-byte -`fixed32`, [as per Google's recommendation](https://developers.google.com/protocol-buffers/docs/proto#scalar). - -[multihash]: https://tools.ietf.org/html/draft-multiformats-multihash-00 -[CID]: https://docs.ipfs.io/guides/concepts/cid/ -[Bitswap]: https://github.com/ipfs/specs/blob/master/BITSWAP.md -[MFS]: https://docs.ipfs.io/guides/concepts/mfs/ diff --git a/UNIXFSv1.md b/UNIXFSv1.md new file mode 100644 index 000000000..fd230cfb9 --- /dev/null +++ b/UNIXFSv1.md @@ -0,0 +1,487 @@ +# ![](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) UnixFS + +**Author(s)**: +- NA + +* * * + +**Abstract** + +UnixFS is a [protocol-buffers](https://developers.google.com/protocol-buffers/) based format for describing files, directories, and symlinks as merkle-dags in IPFS. + +Draft work and discussion on a specification for the upcoming version 2 of the UnixFS format is happening in the [`ipfs/unixfs-v2` repo](https://github.com/ipfs/unixfs-v2). Please see the issues there for discussion and PRs for drafts. + +## Table of Contents + +- [Implementations](#implementations) +- [Data Format](#data-format) +- [Metadata](#metadata) + - [Deduplication and inlining](#deduplication-and-inlining) +- [Importing](#importing) + - [Chunking](#chunking) + - [Layout](#layout) +- [Exporting](#exporting) +- [Design decision rationale](#design-decision-rationale) + - [Metadata](#metadata-1) + - [Separate Metadata node](#separate-metadata-node) + - [Metadata in the directory](#metadata-in-the-directory) + - [Metadata in the file](#metadata-in-the-file) + - [Side trees](#side-trees) + - [Side database](#side-database) + +## Implementations + +- JavaScript + - Data Formats - [unixfs](https://github.com/ipfs/js-ipfs-unixfs) + - Importer - [unixfs-importer](https://github.com/ipfs/js-ipfs-unixfs-importer) + - Exporter - [unixfs-exporter](https://github.com/ipfs/js-ipfs-unixfs-exporter) +- Go + - [`ipfs/go-ipfs/unixfs`](https://github.com/ipfs/go-ipfs/tree/b3faaad1310bcc32dc3dd24e1919e9edf51edba8/unixfs) + - Protocol Buffer Definitions - [`ipfs/go-ipfs/unixfs/pb`](https://github.com/ipfs/go-ipfs/blob/b3faaad1310bcc32dc3dd24e1919e9edf51edba8/unixfs/pb/unixfs.proto) + +## Data Format + +The UnixfsV1 data format is represented by this protobuf: + +```protobuf +message Data { + enum DataType { + Raw = 0; + Directory = 1; + File = 2; + Metadata = 3; + Symlink = 4; + HAMTShard = 5; + } + + required DataType Type = 1; + optional bytes Data = 2; + optional uint64 filesize = 3; + repeated uint64 blocksizes = 4; + optional uint64 hashType = 5; + optional uint64 fanout = 6; + optional uint32 mode = 7; + optional UnixTime mtime = 8; +} + +message Metadata { + optional string MimeType = 1; +} + +message UnixTime { + required int64 Seconds = 1; + optional fixed32 FractionalNanoseconds = 2; +} +``` + +### IPLD `dag-pb` + +A very important other spec for unixfs is the [`dag-pb`](https://ipld.io/specs/codecs/dag-pb/spec/) IPLD spec: + +```protobuf +message PBLink { + // binary CID (with no multibase prefix) of the target object + optional bytes Hash = 1; + + // UTF-8 string name + optional string Name = 2; + + // cumulative size of target object + optional uint64 Tsize = 3; // also known as dagsize +} + +message PBNode { + // refs to other objects + repeated PBLink Links = 2; + + // opaque user data + optional bytes Data = 1; +} +``` + +The two different schemas plays together and it is important to understand their different effect, +- `dag-pb` also named `PBNode` is the "outside" protobuf message, it is the first one you decode. It contain the list of links and some "opaque user data". +- The `Data` message is the "inside" protobuf message, this can be decoded by first decoding the `PBNode` object and then decoding `Data` from the bytes inside `PBNode.Data` field, this will contain all the rest of information. + +This mean we deal with protobuf inside protobuf. This is for history reasons. +This spec sometimes show this a one object for clarity reasons, when that happen it is implied that the `PBNode.Data` field is encoded into prototbuf. + +## Glossary + +- Node, Block + A node is a word from graph theory, this is the smallest unit present in the graph. + Due to how unixfs work, there is a 1 to 1 mapping between nodes and blocks. +- File + A file is some container over an arbitrary sized amounts of bytes. + Files can be said to be single block, or multi block, in the later case they are the concatenation of multiple children files. +- Directory, Folder + This is a named collection of children nodes. +- HAMT Directory + This is a Hashed-Array-Mapped-Tree datastructure representing a Directory, thoses are used to split directories into multiple blocks when they get too big. +- Symlink + This represent a POSIX Symlink. + +## Paths + +Path is a collection of path component (some bytes) seperated by `/` 0x2f, red from left to right, it is inspired by POSIX paths. + +Components MUST NOT contain `/` characters because else it would break the path into two components. + +Components SHOULD be UTF8 unicode. + +### Escaping + +The `\` may be supposed to trigger an escape sequence. + +This might be a thing, but is broken and inconsistent current implementations. +So until we agree on a new spec for this, you SHOULD NOT use any escape sequence and non ascii character. + + +### SHOULD NOT names + +Thoses names SHOULD NOT be used: + +- `.` (as this represent the self node in POSIX pathing) +- `..` (as this represent the parrent node in POSIX pathing) +- nothing (we don't actually know the failure mode for this, but it really feels like this shouldn't be a thing) + +## How to read a Node + +First you get some CID, this will be what will we be trying to decode. + +For recap, every CIDs MUST include: +1. A [multicodec](https://github.com/multiformats/multicodec), also called codec. +1. A [Multihash](https://github.com/multiformats/multihash) (used to specify a hashing algorithm, some hashing parameters and some digest) + +### Get the block + +The first step is to get the block, that means the actual bytes which hashed by the multihash inside the CID give you the same multihash back. + +### Decoding the bytes + +#### Block limit + +A so called "block limit" is in place, we do not allow any single block to be bigger than 2MiB. + +Implementation SHOULD try to not emit 1MiB bigger blocks, but MUST decode blocks <= 2MiB. + +#### Multicodecs + +With Unixfs we deal with two codecs which will be decoded differently: +- `Raw`, single blocks files +- `Dag-PB`, any nodes + +##### `Raw` blocks + +The most simplest node is a `Raw` node. + +They are always of type file. + +They can be recognised because their CIDs have `Raw` codec. + +The file content is purely the block body. + +They never have any childs, and thus are also known as single block files. + +Their size (both `dagsize` and `blocksize`) is the length of the block body. + +###### `Raw` Example + +Let's build a `Raw` file whoses content is `test`. + +1. First hash the data: +```console +$ echo -n "test" | sha256sum +9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 - +``` + +2. Add the CID header: +``` +f this is the multibase prefix, we need it because we are working with a hex CID, this is omitted for binary CIDs + 01 the CID version, here one + 55 the codec, here we MUST use Raw because this is a Raw file + 12 the hashing function used, here sha256 + 20 the digest length 32 bytes + 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 the digest we computed earlier +``` + +3. Profit +Assuming we stored this block in some implementation of our choice which makes it accessible to our client, we can try to decode it: +```console +$ ipfs cat f015512209f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 +test +``` + +##### `Dag-PB` nodes + +Thoses nodes supports many different types (found in `decodeData(PBNode.Data).Type`), every type is handled differently. + +###### `File` type + +####### The sister-lists `PBNode.Links` and `decodeMessage(PBNode.Data).blocksizes` + +The sister-lists are the key point of why `dag-pb` is important for files. + +This allows us to concatenate smaller files together. + +Linked files would be loaded recursively with the same process following a DFS (Depth-First-Search) order. + +Childs nodes must be of type file (so `Dag-PB` where type is `File` or `Raw`) + +For example this example pseudo-json block: +```json +{ + "Links": [{"Hash":"Qmfoo"}, {"Hash":"Qmbar"}], + "Data": { + "Type": "File", + "blocksizes": [20, 30] + } +} +``` + +This indicates that this file is the concatenation of the `Qmfoo` and `Qmbar` files. + +So when reading this file, +the `blocksizes` array give us the size in bytes of the content of the child files, each index in `blocksizes` give the value at the same index in `Links`. + +This allows to do fast indexing into the file, for example if someone is trying to read bytes 25 to 35 we can compute an offset list by summing all previous indexes in `blocksizes`, then do a search to find which indexes contain the range we are intrested in. + +For example here the offset list would be `[0, 20]` and thus we know we only need to download `Qmbar` to get the range we are intrested in. + +If `blocksizes` or `Links` are not of the same length you MUST error. + +####### `decodeMessage(PBNode.Data).Data` + +This field is an array of bytes, it is file content and is appended before the links. + +This must be taken into a count when doing offsets calculations (the len of the `Data.Data` field define the value of the zeroth element of the offset list when computing offsets). + +####### Offset list + +The offset list isn't the only way to use blocksizes and reach a correct implementation, it is a simple cannonical one, python pseudo code to compute it looks like this: +```python +def offsetlist(node): + unixfs = decodeDataField(node.Data) + if len(node.Links) != len(unixfs.Blocksizes): + raise "unmatched sister lists" # error messages are implementation details + + cursor = len(unixfs.Data) if unixfs.Data else 0 + l = [] * len(node.Links) + for i, size in enumerate(unixfs.Blocksizes): + l[i] = cursor + cursor += size + + return l +``` + +This will tell you which offset inside this node the children at the corresponding index starts to cover. + +####### `PBNode.Links[].Name` with Files + +They SHOULD be absent from the protobuf message, however for historic reasons we allow empty file names. + +If this field is present and non empty the file is invalid and you MUST error. + +####### `Blocksize` of a dag-pb file + +This is not a field present in the block directly, but rather a computable property of a file which would be used in parent files. +It is the sum of the length of the `Data.Data` field plus the sum of all link's blocksizes. + +####### `PBNode.Data.Filesize` + +If present, this field must be equal to the `Blocksize` computation above, else the file is invalid. +For now, the usage of this field is unknown given this could be computed from other fields, but some implementations emit it. + +####### Path resolution + +Any attempt of path resolution on files MUST error. + +###### `Directory` Type + +A directory node is a named collection of file. + +The minimum valid `PBNode.Data` field for a directory is (pseudo-json): `{"Type":"Directory"}`, other values are covered in Metadata. + +Every link in the Links list is an entry (children) of the directory, and the `PBNode.Links[].Name` field give you the name. + +####### Link ordering + +The cannonical sorting order is lexicographical over the names. + +In theory there is no reason an encoder couldn't use an other ordering, however this lose some of it's meaning when mapped into most file systems today (most file systems consider directories are unordered-key-value objects). + +A decoder SHOULD if it can, preserve the order of the original files in however it consume thoses names. + +However when some implementation decode, modify then reencode some, the orignal links order fully lose it's meaning. (given that there is no way to indicaggte which sorting was used originally) + +####### Path Resolution + +Pop the left most component of the path, and try to match it to one of the Name in Links. + +If you find a match you can then remember the CID. You MUST continue your search, however if you find a match again you MUST error. + +Assuming no errors were raised, you can continue to the path resolution on the mainaing component and on the CID you poped. + +####### Duplicate names + +Duplicate names are not allowed, if two identical names are present in an directory, the decoder MUST error. + +##### `Symlink` type + +Symlinks MUST NOT have childs. + +Their Data.Data field is a path you may follow resolution at after appending it in front of the current path. + +###### Path resolution on symlinks + +Currently symlinks are not followable, that mean implementations needs to return symlinks objects and fail if a consumer tries to follow it through. + +This is a SHOULD level, you probably wont break much things if you start following them. + +##### `HAMTDirectory` + + + +`node.Data.hashType` indicates a multihash function to use to digest path components used for sharding. +If this field is missing, murmur3-x64-64 (`0x22`) is implied. + +`node.Data.Data` is some bitfield, ones indicates weather or not the links are part of this HAMT or leaves of the HAMT. +The usage of this field is unknown given you can deduce the same information from the links names. + +###### Path resolution on HAMTs + +Steps: +1. Take the current path component then hash it using the multihash id provided in `Data.hashType`. +2. Pop the lowest byte from the path component hash digest, hex encoded (using 0-F) this byte and find the link that starts with this hex encoded byte. +3. if the link name is exactly two bytes (by that, only hex encoded element), follow the link and repeat step 2 with the child node. The child node MUST be a hamt directory else the directory is invalid. Else continue +4. You successfully resolved a path component, everything past the hex encoded prefix is the name of that element (usefull when listing childs of this directory). + + +##### `TSize` / `DagSize` + +This is an optional field for Links of `dag-pb` nodes, **it does not represent any meaningfull information of the underlying structure** and no known usage of it to this day (altho some implementation emit thoses). + +To compute the `dagsize` of a node (which would be stored in the parents) you sum the length of the dag-pb outside message binary length, plus the blocksizes of all child files. + +An example of where this could be usefull is as a hint to smart download clients, for example if you are downloading a file concurrently from two sources that have radically different speeds, it would probably be more efficient to download bigger links from the fastest source, and smaller ones from the slowest source. + + +There is no failure mode known for this field, so your implementation should be able to decode nodes where this field is wrong (not the value you expect), partially or completely missing. This also allows smarter encoder to give a more accurate picture (for example don't count duplicate blocks, ...). + +##### Traversal order + +The traversal is the self node, then recursively traverse each link, from low index to high index. + +### Metadata + +UnixFS currently supports two optional metadata fields: + +* `mode` -- The `mode` is for persisting the file permissions in [numeric notation](https://en.wikipedia.org/wiki/File_system_permissions#Numeric_notation) \[[spec](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html)\]. + - If unspecified this defaults to + - `0755` for directories/HAMT shards + - `0644` for all other types where applicable + - The nine least significant bits represent `ugo-rwx` + - The next three least significant bits represent `setuid`, `setgid` and the `sticky bit` + - The remaining 20 bits are reserved for future use, and are subject to change. Spec implementations **MUST** handle bits they do not expect as follows: + - For future-proofing the (de)serialization layer must preserve the entire uint32 value during clone/copy operations, modifying only bit values that have a well defined meaning: `clonedValue = ( modifiedBits & 07777 ) | ( originalValue & 0xFFFFF000 )` + - Implementations of this spec must proactively mask off bits without a defined meaning in the implemented version of the spec: `interpretedValue = originalValue & 07777` + +* `mtime` -- A two-element structure ( `Seconds`, `FractionalNanoseconds` ) representing the modification time in seconds relative to the unix epoch `1970-01-01T00:00:00Z`. + - The two fields are: + 1. `Seconds` ( always present, signed 64bit integer ): represents the amount of seconds after **or before** the epoch. + 2. `FractionalNanoseconds` ( optional, 32bit unsigned integer ): when specified represents the fractional part of the mtime as the amount of nanoseconds. The valid range for this value are the integers `[1, 999999999]`. + + - Implementations encoding or decoding wire-representations must observe the following: + - An `mtime` structure with `FractionalNanoseconds` outside of the on-wire range `[1, 999999999]` is **not** valid. This includes a fractional value of `0`. Implementations encountering such values should consider the entire enclosing metadata block malformed and abort processing the corresponding DAG. + - The `mtime` structure is optional - its absence implies `unspecified`, rather than `0` + - For ergonomic reasons a surface API of an encoder must allow fractional 0 as input, while at the same time must ensure it is stripped from the final structure before encoding, satisfying the above constraints. + + - Implementations interpreting the mtime metadata in order to apply it within a non-IPFS target must observe the following: + - If the target supports a distinction between `unspecified` and `0`/`1970-01-01T00:00:00Z`, the distinction must be preserved within the target. E.g. if no `mtime` structure is available, a web gateway must **not** render a `Last-Modified:` header. + - If the target requires an mtime ( e.g. a FUSE interface ) and no `mtime` is supplied OR the supplied `mtime` falls outside of the targets accepted range: + - When no `mtime` is specified or the resulting `UnixTime` is negative: implementations must assume `0`/`1970-01-01T00:00:00Z` ( note that such values are not merely academic: e.g. the OpenVMS epoch is `1858-11-17T00:00:00Z` ) + - When the resulting `UnixTime` is larger than the targets range ( e.g. 32bit vs 64bit mismatch ) implementations must assume the highest possible value in the targets range ( in most cases that would be `2038-01-19T03:14:07Z` ) + +## Design decision rationale + +### Metadata + +Metadata support in UnixFSv1.5 has been expanded to increase the number of possible use cases. These include rsync and filesystem based package managers. + +Several metadata systems were evaluated: + +#### Separate Metadata node + +In this scheme, the existing `Metadata` message is expanded to include additional metadata types (`mtime`, `mode`, etc). It then contains links to the actual file data but never the file data itself. + +This was ultimately rejected for a number of reasons: + +1. You would always need to retrieve an additional node to access file data which limits the kind of optimizations that are possible. + + For example many files are under the 256KiB block size limit, so we tend to inline them into the describing UnixFS `File` node. This would not be possible with an intermediate `Metadata` node. + +2. The `File` node already contains some metadata (e.g. the file size) so metadata would be stored in multiple places which complicates forwards compatibility with UnixFSv2 as to map between metadata formats potentially requires multiple fetch operations + +#### Metadata in the directory + +Repeated `Metadata` messages are added to UnixFS `Directory` and `HAMTShard` nodes, the index of which indicates which entry they are to be applied to. + +Where entries are `HAMTShard`s, an empty message is added. + +One advantage of this method is that if we expand stored metadata to include entry types and sizes we can perform directory listings without needing to fetch further entry nodes (excepting `HAMTShard` nodes), though without removing the storage of these datums elsewhere in the spec we run the risk of having non-canonical data locations and perhaps conflicting data as we traverse through trees containing both UnixFS v1 and v1.5 nodes. + +This was rejected for the following reasons: + +1. When creating a UnixFS node there's no way to record metadata without wrapping it in a directory. + +2. If you access any UnixFS node directly by its [CID], there is no way of recreating the metadata which limits flexibility. + +3. In order to list the contents of a directory including entry types and sizes, you have to fetch the root node of each entry anyway so the performance benefit of including some metadata in the containing directory is negligible in this use case. + +#### Metadata in the file + +This adds new fields to the UnixFS `Data` message to represent the various metadata fields. + +It has the advantage of being simple to implement, metadata is maintained whether the file is accessed directly via its [CID] or via an IPFS path that includes a containing directory, and by keeping the metadata small enough we can inline root UnixFS nodes into their CIDs so we can end up fetching the same number of nodes if we decide to keep file data in a leaf node for deduplication reasons. + +Downsides to this approach are: + +1. Two users adding the same file to IPFS at different times will have different [CID]s due to the `mtime`s being different. + + If the content is stored in another node, its [CID] will be constant between the two users but you can't navigate to it unless you have the parent node which will be less available due to the proliferation of [CID]s. + +2. Metadata is also impossible to remove without changing the [CID], so metadata becomes part of the content. + +3. Performance may be impacted as well as if we don't inline UnixFS root nodes into [CID]s, additional fetches will be required to load a given UnixFS entry. + +#### Side trees + +With this approach we would maintain a separate data structure outside of the UnixFS tree to hold metadata. + +This was rejected due to concerns about added complexity, recovery after system crashes while writing, and having to make extra requests to fetch metadata nodes when resolving [CID]s from peers. + +#### Side database + +This scheme would see metadata stored in an external database. + +The downsides to this are that metadata would not be transferred from one node to another when syncing as [Bitswap] is not aware of the database, and in-tree metadata + +### UnixTime protobuf datatype rationale + +#### Seconds + +The integer portion of UnixTime is represented on the wire using a varint encoding. While this is +inefficient for negative values, it avoids introducing zig-zag encoding. Values before the year 1970 +will be exceedingly rare, and it would be handy having such cases stand out, while at the same keeping +the "usual" positive values easy to eyeball. The varint representing the time of writing this text is +5 bytes long. It will remain so until October 26, 3058 ( 34,359,738,367 ) + +#### FractionalNanoseconds +Fractional values are effectively a random number in the range 1 ~ 999,999,999. Such values will exceed +2^28 nanoseconds ( 268,435,456 ) in most cases. Therefore, the fractional part is represented as a 4-byte +`fixed32`, [as per Google's recommendation](https://developers.google.com/protocol-buffers/docs/proto#scalar). + +[multihash]: https://tools.ietf.org/html/draft-multiformats-multihash-00 +[CID]: https://docs.ipfs.io/guides/concepts/cid/ +[Bitswap]: https://github.com/ipfs/specs/blob/master/BITSWAP.md +[MFS]: https://docs.ipfs.io/guides/concepts/mfs/