Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unixfs v1.5 metadata support #223

Closed
wants to merge 7 commits into from
62 changes: 61 additions & 1 deletion UNIXFS.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,15 @@ message Data {

optional uint64 hashType = 5;
optional uint64 fanout = 6;

repeated Metadata metadata = 7;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose repeated metadata structs like this as it allows us to use one byte to represent every missing piece of metadata. We could alternatively use:

  1. A map of integers to metadata.
  2. A bitmap to indicate which files have metadata.

However, both feel like overkill. I'm guessing directories will either have metadata or won't.

Ideally, we'd be able to just embed this metadata in the links section of the parent object but that is not to be...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clarify, does the order the metadata appears here maps directly to the ordering of the Links in the dag-pb node? i.e. metadata[3] applies to Links[3].

...and this only applies when Type is Directory right? Can this be documented please?

Also, if we have one or more files in a directory with metadata then every file has to have metadata right so the order doesn't get mixed up?

Separately, did we consider allowing a single metadata property on the root node for each file so that we can still know metadata when the file is accessed directly? If the reason is in order to not change a file's hash because of meta then maybe mention the reason behind this design decision (perhaps this is not something you typically put in a spec, but, where else is appropriate)?

How does this work for HAMT shards? Links in a sharded dag-pb node don't map directly to files and secondly listing perms for every file in a sharded directory would make a directory node unfeasibily large.

It's very likely metadata will be used for HAMT shards so any solution we come up with here should work with them. Can we please document the solution?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we have one or more files in a directory with metadata then every file has to have metadata right so the order doesn't get mixed up?

Yes. I've expanded the spec a bit.

Separately, did we consider allowing a single metadata property on the root node for each file so that we can still know metadata when the file is accessed directly? If the reason is in order to not change a file's hash because of meta then maybe mention the reason behind this design decision (perhaps this is not something you typically put in a spec, but, where else is appropriate)?

It was considered and dropped for the reason you gave. I'll document it.

How does this work for HAMT shards? Links in a sharded dag-pb node don't map directly to files and secondly listing perms for every file in a sharded directory would make a directory node unfeasibily large.
It's very likely metadata will be used for HAMT shards so any solution we come up with here should work with them. Can we please document the solution?

Should be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please no int maps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg could you expand on that? ( I am clearly missing prior discussion of a downside... )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably isn't the place to litigate this, it always ends in too much text being spilled, but int maps are something we're trying to avoid for IPLD. They're just not straightforward across codecs and programming languages. https://github.com/ipld/specs/blob/master/data-model-layer/data-model.md#motivation

optional Metadata defaultMetadata = 8;
}

message Metadata {
optional string MimeType = 1;
optional string mimeType = 1;
optional uint32 mode = 2;
optional int64 mtime = 3;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 bits because:

  1. We're using varints anyways.
  2. I'd rather not deal with this again in 2038 even if we've moved on to a new format.

(also, it's signed because the world did not start in 1970)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not deal with this again in 2038 even if we've moved on to a new format.

❤️ this is some dedication to the project 😂

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that the world didn't start in 1970 :) In that case should the default value be something other than 01/01/1970?

}
```

Expand All @@ -59,6 +64,61 @@ For files that are comprised of more than a single block, the 'Type' field will

This data is serialized and placed inside the 'Data' field of the outer merkledag protobuf, which also contains the actual links to the child nodes of this object.

## Metadata

UnixFS currently supports three metadata fields:

* `mimeType` -- The mime-type of the file. This generally shouldn't be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this "shouldn't" could be explained? is it deprecated? does it have a special case use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also very confused by this. I didn't know it exists in any existing data; it won't exist in Unixfsv2; and I thought we'd had multiple roundabouts in github issues in which multiple people highlighted the issue that mimetypes are not generally considered a property of a filesystem (rather, they tend to show up in applications, servers, and other things that are routing control flow in some way) and that they shouldn't thus be in unixfsv{n}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type was already reserved by the existing metadata feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to leave it there, in the deprecated state that it becomes there?

If it's deprecated, it can still be used -- but "shouldn't" be. Which is what this text says anyway. If it's deprecated and we don't want it to be used, we should prefer to draw proportionally less attention to it, and leaving it in the depricated notes should be sufficient for that.

* `mode` -- The `mode` is for optionally persisting the [file permissions in numeric notation](https://en.wikipedia.org/wiki/File_system_permissions#Numeric_notation) \[[spec](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html)\]. It defaults to 0755 if unspecified.
* `mtime` -- The modification time in seconds since the epoch. This defaults to the unix epoch if unspecified.

There are three ways to specify file metadata:

### Embedded in the directory

Each entry in the repeated metadata field corresponds to the linked file/directory at the same offset in the Links section of the outer dag-pb. This field is appropriate in nodes with type `Directory` and `HAMTShard`. However, metadata should _not_ be specified for links that point to other HAMT shards.
Stebalien marked this conversation as resolved.
Show resolved Hide resolved

For some context, this solution was chosen over embedding the metadata in the file itself because:

1. It allows accessing the metadata without downloading the target files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we do a directory listing, we have to fetch at least the root node of every directory entry (or traverse a HAMT) in order to say whether it's a file entry or a sub directory entry so we already have a mechanism for accessing metadata without having to materialise the whole file, so I'm not sure what advantage is gained here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You're right.
  2. We could consider additionally storing the file type but I'm worried that will be even more complicated.

Copy link
Member

@achingbrain achingbrain Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure storing the file type in the directory is a good idea since we'd lose that information if we referenced the DAGNode directly (e.g. via /ipfs/Qmfoo) - I think it's fine as it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c/o @warpfork, quite a good discussion about metadata+files that's relevant here I think, feeding into unixfsv2 design decisions https://hackmd.io/4lqtycvdQN2WTspBLpy3qw

2. More importantly, it avoids changing the root hash of the target files to allow for better deduplication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention the tradeoff? If I have QmDir and file.txt (QmFileTxt) then I'll only be able to see metadata for the file if it's accessed via /ipfs/QmDir/file.txt. The file will have no metadata when accessed like /ipfs/QmFileTxt.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanshaw from #220 (comment) :

As far as I can tell, for the original set of users waiting on this, there is no usecase where a piece of data needs POSIX attributes without a filename. The filename is already only available as a directory wrapper node, there is no other way to attach name to an "anonymous stream of data". What makes other attributes different enough to warrant a non-directory anonymous wrapper-node?

I.e. what is the use-case of needing to know mtime/nodetype/permissions of an object without knowing its name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think metadata is tied to file names - you could be copying /ipfs/QmHashOfSomeBinaryFile from ipfs-at-large to a location in your mfs, at which point it would receive a filename, but if it was executable, you'd want to retain that information.

Copy link
Member Author

@Stebalien Stebalien Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this design, the metadata is stored in the parent directory. That doesn't mean that it's tied to the file name.

It does mean that simplifying a path from /ipfs/QmDir/file.txt to /ipfs/QmFile will drop the metadata (along with the filename).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like the right way to do it to me. If I copy a file to another location on my filesystem, it doesn't lose its permissions, nor will it lose its permissions if I remove it from its containing directory (to put it in a tarball, for example).

It will involve changing the CID of a file if it's metadata changes but that seems like a reasonable tradeoff to me, since the CIDs of the underlying file chunks will not change (assuming the file size is > 256KB). We'll probably see the most CID churn in a users' mfs, but we generally don't care about CIDs in mfs as much as paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A counter thought: why should metadata that changes independently from the data (e.g. permissions) be anywhere in the directory tree, couldn't it live outside the tree?

If I copy a file to another location on my filesystem, it doesn't lose its permissions

True, but when I email you a file or upload it to Dropbox it does lose its permissions. Similarly, using unix time as a metadata attribute makes very little sense in a distributed context. For example, Alice and Bob both download an Ubuntu ISO and add it to IPFS the timestamps will be different.

It will involve changing the CID of a file if it's metadata changes but that seems like a reasonable tradeoff to me, since the CIDs of the underlying file chunks will not change

If I recall correctly, the way IPFS currently works makes this a pretty big performance hit. For example, say Alice and Bob both have the same files, but with different metadata. Then if Charlie tries to download Alice's CID while Alice is offline then he gets no data. Similarly, even if Alice is online Charlie will end up downloading the data from Alice but not Bob since his node probably won't make another DHT request for the file CID (which would be expensive anyway).

There are alternative solutions for people that want to share metadata that do not involve putting OS metadata near files for everyone else, which I think are worth discussing. However, if unixfsv1.5 is just a temporary measure to tide us over until unixfsv2 lands then as long as we only use the metadata we actually need (i.e. mtime + execute bit) then this is fine by me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I email you a file or upload it to Dropbox it does lose its permissions.

Those aren't filesystem operations so I would not expect them to retain permissions. Ok, tarballing something up isn't specifically a filesystem operation either, but it is sensitive to it's environment so does preserve perms, I think we should endeavour to do the same thing.

Similarly, using unix time as a metadata attribute makes very little sense in a distributed context.

I agree, distributed time is a bit weird. That said, the file chunks of the ISO will be the same, just the root node containing the metadata will be different. That said Alice and Bob could use different chunking strategies, or whatever so it's already easy to have two CIDs that point to the same file.

the way IPFS currently works

This doesn't seem like a good thing to base spec decisions on. Changing how IPFS works is just another spec decision away, after all..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those aren't filesystem operations so I would not expect them to retain permissions.

I think that people having mounted Dropbox folders on their machine almost exactly mimics the behavior that users would expect from IPFS.

could use different chunking strategies, or whatever so it's already easy to have two CIDs that point to the same file.

That's true, but A) if they the chunking defaults there's no problem B) there are possible solutions for dealing with multiple chunkers since fundamentally the two files can have a canonical CID (e.g. sha256 of the entire file without any unixfs intermediate blocks), however these solutions cannot properly work if the actual content is different (which it would be since metadata contains information)

Changing how IPFS works is just another spec decision away, after all

Unless there's a proposal for how to fix the problem I'm wary of "we'll fix it later". Also, my understanding is that unixfsv1.5 is supposed to be a temporary fix and so the probability of "later" coming before unixfsv1.5 is deprecated is even lower.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like the right way to do it to me. If I copy a file to another location on my filesystem, it doesn't lose its permissions, nor will it lose its permissions if I remove it from its containing directory (to put it in a tarball, for example).

@achingbrain re-reading the thread I realized this is also a flawed model. In general it is the tool that does the copying that preserves various bits of metadata, and moreover only does so if the context allows it to. Consider:

rabbit@Ahasver:/dev/shm/stuff$ ls -l
total 0

rabbit@Ahasver:/dev/shm/stuff$ touch source

rabbit@Ahasver:/dev/shm/stuff$ chmod 666 source

rabbit@Ahasver:/dev/shm/stuff$ umask
0022

rabbit@Ahasver:/dev/shm/stuff$ cp source dest

rabbit@Ahasver:/dev/shm/stuff$ ls -l
total 0
-rw-r--r-- 1 rabbit rabbit 0 Nov 13 10:29 dest
-rw-rw-rw- 1 rabbit rabbit 0 Nov 13 10:29 source

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behaviour is to create the new file with the same permissions, umask changes the defaults.

If we access a file via it's CID we won't have the option to preserve permissions as they are simply not there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the file content is by far larger than the UnixFS data describing the file and is not affected by changing the metadata, is this a de-duplication win that's worth only having metadata available if a file/dir is accessed via a directory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if depends if the file fits into a single chunk or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't know if that's a bad thing or not. My assumption is that most non-trivial uses of IPFS will see files that are more than one block in size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most files are actually pretty small and 256KiB (the default block size) is pretty large.


### Using an intermediate node

DEPRECATED

Metadata can be applied to a single file/directory with an intermediate "metadata node":

1. A `Type` of `Metadata`.
2. A `Data` containing an encoded `Metadata` message.
3. A single `Link` (in the outer dag-pb node) pointing to the actual file/directory.

This solution has been _deprecated_ because it requires an additional intermediate node just for the metadata. It is also poorly supported.

### Default metadata

The `defaultMetadata` field can be used in conjunction with the repeated `metadata` field to specify metadata "defaults".

1. The first `defaultMetadata` field encountered when traversing a sharded directory takes precedence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat related point, does defaultMetadata apply to just the directory it is defined in, or can I specify it in a parent directory and have a parent/sub or parent/sub/sub inherit it? I assume no, but maybe worth clarifying?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. But I'd like to revisit #223 (comment) and consider dropping it before moving on.

2. Default metadata and explicit per-file metadata are merged field-wise.

Given the `defaultMetdata` field, the _actual_ default metadata is determined as follows:

* For directories:
* The default mode is `defaultMetadata.mode | 0111` (sets the execute bit).
* The default mime type is ignored.
* The default mtime is as specified in `defaultMetadata`.
* For symlinks, the defaults are ignored.
* For regular files, the defaults are as specified in `defaultMetadata`.

To determine the metadata for a file:

1. The per-file metadata is looked up in the `metadata` list as usual.
2. For each field in the file's metadata:
1. If the field is specified, use it.
2. If the field is unspecified but a default is specified, use it.
3. Otherwise, use the global default.

## Importing

Importing a file into unixfs is split up into two parts. The first is chunking, the second is layout.
Expand Down