-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unixfs v1.5 metadata support #223
Conversation
8a9fdd2
to
1335b49
Compare
1335b49
to
583bf06
Compare
UNIXFS.md
Outdated
|
||
1. The repeated `metadata` field in a directory applies metadata to each file in the directory. | ||
2. An intermediary node with a `Type` of `Metadata` applies metadata to an individual file. While this feature should be supported, it has been deprecated. | ||
3. An optional `defaultMetadata` field to specify the _default_ metadata for files in the directory. If unspecified, the default mode is 0755 and the default modification time is the epoch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can save us a ton of space as most files in a directory have the same mode. However, it is slightly complicated.
@mib-kd743naq suggested keeping a reverse mapping of, e.g., mode -> [file offsets]
but, while very compact, that felt too complicated.
On the other hand, we could also just rely on future compression. Repeated identical metadata fields will compress really well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but, while very compact, that felt too complicated.
This is mostly because of huge/sharded directories right? Otherwise, seems reasonably straightforward.
we could also just rely on future compression
Seems like a bad idea because future should hopefully come after unixfsv2. Also, sometimes compression + encryption don't play nicely which might cause problems if we rely too heavily on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll channel @mib-kd743naq's alter-ego here ;P The idea of defaultMetadata
looks deceptively simple, but would be hard to make "convergent". The typical mode of producing UFS1.5 datasets would be tar-file ( or equivalent ) importers. A tar importer works best ( and most faithfully ) in linear streaming fashion, which in turn means that the "best" defaultMetadata is difficult to know ahead of time. This is especially important for mega-directories like https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts and is even harder to determine in cases of sharded directories.
The flip side of doing "deduplication of metadata" as we go is to simply keep a map during encoding of "what metadata have we seen so far" keyed off the encoded protobuf content itself, together with a plain integer increment counter.
With all that said - it appears to NOT matter. Using the dataset of the current kernel, with each individual entry decorated with the proper metadata based on a spec similar[*] to the proposed the numbers are way closer than I anticipated. Also, yes, transparent block compression will help a lot here, but there still will be a lot of indirection for a compressor to deal with, and there is nothing simpler than simply having nothing extra included ;) Additionally transparent compression is a separate :canofworms: that may take longer than UFS2.0 to shake out :(
So the data based on https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.3.7.tar.sign
To see individual directory makeups simply run:
for h in \
bafybeif7ceybwlonoct3bnjtnnfookm2iezslo2ypwrsae6ilkc2ucvsfm \
bafybeigoe7cnxqy6hjlcci3ae62l5fmws5pxeq62d77lhbvgphhm5ugtnq \
bafybeico6ybaiaqohfsydgqxexwcjb4vnlhzojyxhjcw6s3xhm7weaikb4 \
bafybeibj2hpul5vgomel56rl6xf63rviyym4go5pg2fmwwqioaltshh6fa
do tgt=$h/linux-5.3.7/Documentation/PCI/endpoint
echo -e "\n\nhttps://ipfs.io/ipfs/$tgt\n==============\n"
ipfs block get $( ipfs resolve $tgt | cut -b 7- ) | protoc --decode_raw
done | less
And the stats:
dumb ( each directory entry gets an exactly one "struct 7" ):
ipid : bafybeif7ceybwlonoct3bnjtnnfookm2iezslo2ypwrsae6ilkc2ucvsfm
count_data_blocks : 230,613
count_link_blocks : 42,341
count_all_blocks : 272,954
size_data_blocks : 843,884,873
size_link_blocks : 13,694,638
size_all_blocks : 857,579,511
size_payload : ^ dedup ^
stebalien-first ( Stebalien's proposal, if more than 2 entries are present in a directory, the first one becomes the "default struct 8" ):
ipid : bafybeigoe7cnxqy6hjlcci3ae62l5fmws5pxeq62d77lhbvgphhm5ugtnq
count_data_blocks : 230,613
count_link_blocks : 42,341
count_all_blocks : 272,954
size_data_blocks : 843,884,873
size_link_blocks : 13,036,969
size_all_blocks : 856,921,842
size_payload : ^ dedup ^
stebalien-most ( Steven's proposal, if more than 2 entries are present in a directory, an analysis takes place of which is the most-encountered, and that one becomes default ):
ipid : bafybeico6ybaiaqohfsydgqxexwcjb4vnlhzojyxhjcw6s3xhm7weaikb4
count_data_blocks : 230,613
count_link_blocks : 42,341
count_all_blocks : 272,954
size_data_blocks : 843,884,873
size_link_blocks : 13,013,814
size_all_blocks : 856,898,687
size_payload : ^ dedup ^
olcbean ( the proposal to just have a running reverse index per metadata blob ):
ipid : bafybeibj2hpul5vgomel56rl6xf63rviyym4go5pg2fmwwqioaltshh6fa
count_data_blocks : 230,613
count_link_blocks : 42,341
count_all_blocks : 272,954
size_data_blocks : 843,884,873
size_link_blocks : 12,941,959
size_all_blocks : 856,826,832
size_payload : ^ dedup ^
- Once I am back from a trip we really need to chat some time next week why I keep pushing for a slightly different organization of the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is unixfsv1.5. We don't have to be perfect, just good enough.
Given this data, I'm starting to think that we might just want to drop the default metadata field and default to 0644 for files and 0755 for directories. Those numbers show a 0.09% overhead which I can easily live with.
in linear streaming fashion
Note: the reverse index is just as tricky to compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly because of huge/sharded directories right? Otherwise, seems reasonably straightforward.
I'm not concerned about sharded directories. We'd just store the reverse map in every terminal shard.
My main concern is that I'd like to just stick this metadata inside the directory entries in unixfsv2, right next to the links to the files.
- This is "cleaner" (subjectively).
- It makes it easy to move a directory entry from one directory to another (just copy the node).
- When decoding the directory, we'd need to reverse the reverse index (turn it back into something where we can easily lookup metadata by file name and/or offset). At the moment, in all of the proposals for UnixFSv2 that I've seen, we'll be able to define the mapping between go structs and encoded IPLD objects using a nice IPLD schema. If we use this reverse map, we're going to either:
- Need to introduce a second step and a second intermediate object.
- Turn a "reverse index" into an advanced layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One data point from Peergos, though I'm not sure how relevant it is to unixfsv2 given the lack of encryption: we decided to put file metadata (name, modification time, mime type etc.) on the file itself, rather than the directory. This is because of using capability based access control - we want to be able to easily grant someone access to modify a file without them being able to read or modify or even identify sibling files in the same directory. So our directories are just encrypted links to files (and the encrypted name of the directory) which might be under a different IPNS signing key and an encrypted link to any subsequent block in the same directory (large directories).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to know.
In that case, we'd probably have a special "encrypted" file type. Normal files will have a "data" field referring to the actual file data but this file type would have a "file" field pointing to the actual encrypted file.
However, we'd still likely want some unencrypted metadata ("file type", capabilities?, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We take it even further. Unless you have a read cap to a directory we don't want you to even be able to know how many files are in the directory (we pad and bulk encrypt all the child links together), and we even go so far as to make a directory indistinguishable from a small file at the ipld level (without a read cap). That way you can grant read access to a file, but even with that capability the only thing you can figure out about the parent directory is the name (to get a well defined path).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting. so you encrypt the directory entries themselves. At that point, I think we'll need to change the structures entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unixfsv3 then :-)
UNIXFS.md
Outdated
There are three ways to specify file metadata: | ||
|
||
1. The repeated `metadata` field in a directory applies metadata to each file in the directory. | ||
2. An intermediary node with a `Type` of `Metadata` applies metadata to an individual file. While this feature should be supported, it has been deprecated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this in here because we technically support it. However, I'd be happy to remove it if it's not in-use. go-ipfs has some internal support but it's hard to actually use this feature so I doubt any one is doing so (and it doesn't work properly in many cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and completely unspecified, does Data
contain the metadata or would it now be the metadata field?
(Not expecting that question to be answered - just stating the state of affairs 😉)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data
contains the metadata object. Note: this entire thing was completely unspecified until this spec was written...
optional string MimeType = 1; | ||
optional string mimeType = 1; | ||
optional uint32 mode = 2; | ||
optional int64 mtime = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 bits because:
- We're using varints anyways.
- I'd rather not deal with this again in 2038 even if we've moved on to a new format.
(also, it's signed because the world did not start in 1970)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not deal with this again in 2038 even if we've moved on to a new format.
❤️ this is some dedication to the project 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that the world didn't start in 1970 :) In that case should the default value be something other than 01/01/1970?
@@ -46,10 +46,15 @@ message Data { | |||
|
|||
optional uint64 hashType = 5; | |||
optional uint64 fanout = 6; | |||
|
|||
repeated Metadata metadata = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose repeated metadata structs like this as it allows us to use one byte to represent every missing piece of metadata. We could alternatively use:
- A map of integers to metadata.
- A bitmap to indicate which files have metadata.
However, both feel like overkill. I'm guessing directories will either have metadata or won't.
Ideally, we'd be able to just embed this metadata in the links section of the parent object but that is not to be...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clarify, does the order the metadata appears here maps directly to the ordering of the Links
in the dag-pb node? i.e. metadata[3] applies to Links[3].
...and this only applies when Type
is Directory
right? Can this be documented please?
Also, if we have one or more files in a directory with metadata then every file has to have metadata right so the order doesn't get mixed up?
Separately, did we consider allowing a single metadata property on the root node for each file so that we can still know metadata when the file is accessed directly? If the reason is in order to not change a file's hash because of meta then maybe mention the reason behind this design decision (perhaps this is not something you typically put in a spec, but, where else is appropriate)?
How does this work for HAMT shards? Links
in a sharded dag-pb node don't map directly to files and secondly listing perms for every file in a sharded directory would make a directory node unfeasibily large.
It's very likely metadata will be used for HAMT shards so any solution we come up with here should work with them. Can we please document the solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we have one or more files in a directory with metadata then every file has to have metadata right so the order doesn't get mixed up?
Yes. I've expanded the spec a bit.
Separately, did we consider allowing a single metadata property on the root node for each file so that we can still know metadata when the file is accessed directly? If the reason is in order to not change a file's hash because of meta then maybe mention the reason behind this design decision (perhaps this is not something you typically put in a spec, but, where else is appropriate)?
It was considered and dropped for the reason you gave. I'll document it.
How does this work for HAMT shards? Links in a sharded dag-pb node don't map directly to files and secondly listing perms for every file in a sharded directory would make a directory node unfeasibily large.
It's very likely metadata will be used for HAMT shards so any solution we come up with here should work with them. Can we please document the solution?
Should be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please no int maps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvagg could you expand on that? ( I am clearly missing prior discussion of a downside... )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably isn't the place to litigate this, it always ends in too much text being spilled, but int maps are something we're trying to avoid for IPLD. They're just not straightforward across codecs and programming languages. https://github.com/ipld/specs/blob/master/data-model-layer/data-model.md#motivation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this sounds like it's going in the right direction. My only real concern is HAMT shards - see comment in review for more info.
@@ -46,10 +46,15 @@ message Data { | |||
|
|||
optional uint64 hashType = 5; | |||
optional uint64 fanout = 6; | |||
|
|||
repeated Metadata metadata = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clarify, does the order the metadata appears here maps directly to the ordering of the Links
in the dag-pb node? i.e. metadata[3] applies to Links[3].
...and this only applies when Type
is Directory
right? Can this be documented please?
Also, if we have one or more files in a directory with metadata then every file has to have metadata right so the order doesn't get mixed up?
Separately, did we consider allowing a single metadata property on the root node for each file so that we can still know metadata when the file is accessed directly? If the reason is in order to not change a file's hash because of meta then maybe mention the reason behind this design decision (perhaps this is not something you typically put in a spec, but, where else is appropriate)?
How does this work for HAMT shards? Links
in a sharded dag-pb node don't map directly to files and secondly listing perms for every file in a sharded directory would make a directory node unfeasibily large.
It's very likely metadata will be used for HAMT shards so any solution we come up with here should work with them. Can we please document the solution?
UNIXFS.md
Outdated
1. The repeated `metadata` field in a directory applies metadata to each file in the directory. | ||
2. An intermediary node with a `Type` of `Metadata` applies metadata to an individual file. While this feature should be supported, it has been deprecated. | ||
3. An optional `defaultMetadata` field to specify the _default_ metadata for files in the directory. If unspecified, the default mode is 0755 and the default modification time is the epoch. | ||
* Files: The default metadata is applied as-is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the default mimeType
string? ""
? No default (field is optional anyway)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unspecified"/null? Good question.
Note: We've also discussed just dropping support for mime types this way. The original idea was to do this so that the gateway could use the metadata but it's quite a bit simpler to just use a .ipfsattr
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'll useful to know in the directory listing which links are directories, which are symlinks and which are regular files. For example WebUI currently has to stat every file in a directory listing to determine this and display the correct icon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, please consider storing the entire UNIX 32bit st_mode structure. Or alternatively just go with https://golang.org/pkg/os/#FileMode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Including the actual file type is definitely a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those following along - this is what including the type ( i.e. full st_mode/mode_t ) allows. This is the same CID as the one used earlier in #223 (comment). Observe how a user now has clear visual cues what is a directory and thus explorable. All can be displayed, backwards-compatibly, without downloading any extra blocks.
Note: temporary hacked together gateway, will not persist: http://5.9.79.235:2020/ipfs/bafybeif7ceybwlonoct3bnjtnnfookm2iezslo2ypwrsae6ilkc2ucvsfm/linux-5.3.7/
UNIXFS.md
Outdated
There are three ways to specify file metadata: | ||
|
||
1. The repeated `metadata` field in a directory applies metadata to each file in the directory. | ||
2. An intermediary node with a `Type` of `Metadata` applies metadata to an individual file. While this feature should be supported, it has been deprecated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and completely unspecified, does Data
contain the metadata or would it now be the metadata field?
(Not expecting that question to be answered - just stating the state of affairs 😉)
UNIXFS.md
Outdated
|
||
1. The repeated `metadata` field in a directory applies metadata to each file in the directory. | ||
2. An intermediary node with a `Type` of `Metadata` applies metadata to an individual file. While this feature should be supported, it has been deprecated. | ||
3. An optional `defaultMetadata` field to specify the _default_ metadata for files in the directory. If unspecified, the default mode is 0755 and the default modification time is the epoch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Instead of defaultMetadata
, what about by convention if you specify only 1 metadata
property it applies to all directory contents?
UNIXFS.md
Outdated
|
||
Fields: | ||
|
||
* `mimeType` -- The mime-type of the file. This generally shouldn't be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally shouldn't be used.
Please can we document why? We wouldn't have to do sniffing in many cases if it's explicit.
optional string MimeType = 1; | ||
optional string mimeType = 1; | ||
optional uint32 mode = 2; | ||
optional int64 mtime = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not deal with this again in 2038 even if we've moved on to a new format.
❤️ this is some dedication to the project 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 minor clarifications you might want to add but assuming @achingbrain is ok with the implementation for HAMT this SGTM 🚢🚢🚢
For some context, this solution was chosen over embedding the metadata in the file itself because: | ||
|
||
1. It allows accessing the metadata without downloading the target files. | ||
2. More importantly, it avoids changing the root hash of the target files to allow for better deduplication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention the tradeoff? If I have QmDir
and file.txt
(QmFileTxt
) then I'll only be able to see metadata for the file if it's accessed via /ipfs/QmDir/file.txt
. The file will have no metadata when accessed like /ipfs/QmFileTxt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanshaw from #220 (comment) :
As far as I can tell, for the original set of users waiting on this, there is no usecase where a piece of data needs POSIX attributes without a filename. The filename is already only available as a directory wrapper node, there is no other way to attach name to an "anonymous stream of data". What makes other attributes different enough to warrant a non-directory anonymous wrapper-node?
I.e. what is the use-case of needing to know mtime/nodetype/permissions of an object without knowing its name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think metadata is tied to file names - you could be copying /ipfs/QmHashOfSomeBinaryFile
from ipfs-at-large
to a location in your mfs
, at which point it would receive a filename, but if it was executable, you'd want to retain that information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this design, the metadata is stored in the parent directory. That doesn't mean that it's tied to the file name.
It does mean that simplifying a path from /ipfs/QmDir/file.txt
to /ipfs/QmFile
will drop the metadata (along with the filename).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like the right way to do it to me. If I copy a file to another location on my filesystem, it doesn't lose its permissions, nor will it lose its permissions if I remove it from its containing directory (to put it in a tarball, for example).
It will involve changing the CID
of a file if it's metadata changes but that seems like a reasonable tradeoff to me, since the CID
s of the underlying file chunks will not change (assuming the file size is > 256KB). We'll probably see the most CID
churn in a users' mfs
, but we generally don't care about CID
s in mfs
as much as paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A counter thought: why should metadata that changes independently from the data (e.g. permissions) be anywhere in the directory tree, couldn't it live outside the tree?
If I copy a file to another location on my filesystem, it doesn't lose its permissions
True, but when I email you a file or upload it to Dropbox it does lose its permissions. Similarly, using unix time as a metadata attribute makes very little sense in a distributed context. For example, Alice and Bob both download an Ubuntu ISO and add it to IPFS the timestamps will be different.
It will involve changing the CID of a file if it's metadata changes but that seems like a reasonable tradeoff to me, since the CIDs of the underlying file chunks will not change
If I recall correctly, the way IPFS currently works makes this a pretty big performance hit. For example, say Alice and Bob both have the same files, but with different metadata. Then if Charlie tries to download Alice's CID while Alice is offline then he gets no data. Similarly, even if Alice is online Charlie will end up downloading the data from Alice but not Bob since his node probably won't make another DHT request for the file CID (which would be expensive anyway).
There are alternative solutions for people that want to share metadata that do not involve putting OS metadata near files for everyone else, which I think are worth discussing. However, if unixfsv1.5 is just a temporary measure to tide us over until unixfsv2 lands then as long as we only use the metadata we actually need (i.e. mtime + execute bit) then this is fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I email you a file or upload it to Dropbox it does lose its permissions.
Those aren't filesystem operations so I would not expect them to retain permissions. Ok, tarballing something up isn't specifically a filesystem operation either, but it is sensitive to it's environment so does preserve perms, I think we should endeavour to do the same thing.
Similarly, using unix time as a metadata attribute makes very little sense in a distributed context.
I agree, distributed time is a bit weird. That said, the file chunks of the ISO will be the same, just the root node containing the metadata will be different. That said Alice and Bob could use different chunking strategies, or whatever so it's already easy to have two CIDs that point to the same file.
the way IPFS currently works
This doesn't seem like a good thing to base spec decisions on. Changing how IPFS works is just another spec decision away, after all..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those aren't filesystem operations so I would not expect them to retain permissions.
I think that people having mounted Dropbox folders on their machine almost exactly mimics the behavior that users would expect from IPFS.
could use different chunking strategies, or whatever so it's already easy to have two CIDs that point to the same file.
That's true, but A) if they the chunking defaults there's no problem B) there are possible solutions for dealing with multiple chunkers since fundamentally the two files can have a canonical CID (e.g. sha256 of the entire file without any unixfs intermediate blocks), however these solutions cannot properly work if the actual content is different (which it would be since metadata contains information)
Changing how IPFS works is just another spec decision away, after all
Unless there's a proposal for how to fix the problem I'm wary of "we'll fix it later". Also, my understanding is that unixfsv1.5 is supposed to be a temporary fix and so the probability of "later" coming before unixfsv1.5 is deprecated is even lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like the right way to do it to me. If I copy a file to another location on my filesystem, it doesn't lose its permissions, nor will it lose its permissions if I remove it from its containing directory (to put it in a tarball, for example).
@achingbrain re-reading the thread I realized this is also a flawed model. In general it is the tool that does the copying that preserves various bits of metadata, and moreover only does so if the context allows it to. Consider:
rabbit@Ahasver:/dev/shm/stuff$ ls -l
total 0
rabbit@Ahasver:/dev/shm/stuff$ touch source
rabbit@Ahasver:/dev/shm/stuff$ chmod 666 source
rabbit@Ahasver:/dev/shm/stuff$ umask
0022
rabbit@Ahasver:/dev/shm/stuff$ cp source dest
rabbit@Ahasver:/dev/shm/stuff$ ls -l
total 0
-rw-r--r-- 1 rabbit rabbit 0 Nov 13 10:29 dest
-rw-rw-rw- 1 rabbit rabbit 0 Nov 13 10:29 source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behaviour is to create the new file with the same permissions, umask
changes the defaults.
If we access a file via it's CID we won't have the option to preserve permissions as they are simply not there.
UNIXFS.md
Outdated
|
||
The `defaultMetadata` field can be used in conjunction with the repeated `metadata` field to specify metadata "defaults". | ||
|
||
1. The first `defaultMetadata` field encountered when traversing a sharded directory takes precedence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat related point, does defaultMetadata
apply to just the directory it is defined in, or can I specify it in a parent directory and have a parent/sub or parent/sub/sub inherit it? I assume no, but maybe worth clarifying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. But I'd like to revisit #223 (comment) and consider dropping it before moving on.
For some context, this solution was chosen over embedding the metadata in the file itself because: | ||
|
||
1. It allows accessing the metadata without downloading the target files. | ||
2. More importantly, it avoids changing the root hash of the target files to allow for better deduplication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think metadata is tied to file names - you could be copying /ipfs/QmHashOfSomeBinaryFile
from ipfs-at-large
to a location in your mfs
, at which point it would receive a filename, but if it was executable, you'd want to retain that information.
|
||
For some context, this solution was chosen over embedding the metadata in the file itself because: | ||
|
||
1. It allows accessing the metadata without downloading the target files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we do a directory listing, we have to fetch at least the root node of every directory entry (or traverse a HAMT) in order to say whether it's a file entry or a sub directory entry so we already have a mechanism for accessing metadata without having to materialise the whole file, so I'm not sure what advantage is gained here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You're right.
- We could consider additionally storing the file type but I'm worried that will be even more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure storing the file type in the directory is a good idea since we'd lose that information if we referenced the DAGNode directly (e.g. via /ipfs/Qmfoo
) - I think it's fine as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c/o @warpfork, quite a good discussion about metadata+files that's relevant here I think, feeding into unixfsv2 design decisions https://hackmd.io/4lqtycvdQN2WTspBLpy3qw
For some context, this solution was chosen over embedding the metadata in the file itself because: | ||
|
||
1. It allows accessing the metadata without downloading the target files. | ||
2. More importantly, it avoids changing the root hash of the target files to allow for better deduplication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the file content is by far larger than the UnixFS data describing the file and is not affected by changing the metadata, is this a de-duplication win that's worth only having metadata available if a file/dir is accessed via a directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if depends if the file fits into a single chunk or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I don't know if that's a bad thing or not. My assumption is that most non-trivial uses of IPFS will see files that are more than one block in size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most files are actually pretty small and 256KiB (the default block size) is pretty large.
One extra byte per file in a directory isn't worth the complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just re-read this and it looks to me like a good enough incremental improvement until unixfs2 lands. Shall we ship it 🚢?
optional string MimeType = 1; | ||
optional string mimeType = 1; | ||
optional uint32 mode = 2; | ||
optional int64 mtime = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that the world didn't start in 1970 :) In that case should the default value be something other than 01/01/1970?
{} | ||
}, | ||
Links: [ | ||
{Name: "foo", ...}, // mode: 0644 (default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above we say that the mode defaults to 0755 if unspecified
but here we say mode: 0644 (default)
The apparent lack of alignment on what the IPFS project is "locking" itself into by this spec makes me uneasy. The bottom line is that anything that ships here is going to be supported forever ( n.b. an uncomfortably long time), just like Given the above we are not really discussing a "stop-gap" solution, we are discussing another fork of the ipfs web gateway and cli tools. Distinct from I wonder if we can put together a 2d matrix of sorts ( in the wiki? ). On the P.S. I would have put this matrix together myself, if my free time wasn't vapourized by a set of unrelated events all happening at the same time :/ |
I may be in the minority here, but why not? Gateways are just one way of viewing data and UnixFS is just one of the multitude of IPLD formats people can use to store their data. Would it be so bad if even immediately after UnixFS v1.5 the gateways still didn't show any ACL metadata? @ribasushi Would having a document/proposal for how IPFS plans to deal with the transitions of updating UnixFS help with your concerns? I am certainly wary of giving this spec first party support indefinitely so hearing about how the transition is going to work would ease my concerns. |
The answer is naturally "no". What would be bad is if they start showing it, and then stop doing so at some point in the future. Same for various
Absolutely! Now that you verbalized this - it seems obvious that most ( all? ) current and future directly user-visible specs need to include an "Obsolescence statement" akin to that of the Raspberry Pi foundation. This would allow a project like e.g. Peergos to make informed decisions regarding which specs they can follow and which ones they would need to re-implement in-house to de-risk their product offerings ( which may have very different content half-lives compared to the typical user of a certain spec ). |
|
||
UnixFS currently supports three metadata fields: | ||
|
||
* `mimeType` -- The mime-type of the file. This generally shouldn't be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this "shouldn't" could be explained? is it deprecated? does it have a special case use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also very confused by this. I didn't know it exists in any existing data; it won't exist in Unixfsv2; and I thought we'd had multiple roundabouts in github issues in which multiple people highlighted the issue that mimetypes are not generally considered a property of a filesystem (rather, they tend to show up in applications, servers, and other things that are routing control flow in some way) and that they shouldn't thus be in unixfsv{n}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type was already reserved by the existing metadata feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to leave it there, in the deprecated state that it becomes there?
If it's deprecated, it can still be used -- but "shouldn't" be. Which is what this text says anyway. If it's deprecated and we don't want it to be used, we should prefer to draw proportionally less attention to it, and leaving it in the depricated notes should be sufficient for that.
So, we just had yet another discussion and have decided to put the metadata in files. TL;DR:
@achingbrain will write up a more thorough decision† along with a new spec. †Decision for everyone at the meeting. Nothing is final till it's in a spec on GitHub, merged, and implemented. But we need to move forward on this ASAP. |
I didn't include the pros and cons of the other options because I don't think they should be included in the spec. That is, if I'm implementing this, I read the spec to understand what I should implement and how it should behave, I don't want to read a bunch of exposition on how we got to where we are. The decision log is important though, should it go in `ipfs/notes`? Or maybe we could have accomanying `*-notes.md` files? Follows on from/supercedes #223 s
I didn't include the pros and cons of the other options because I don't think they should be included in the spec. That is, if I'm implementing this, I read the spec to understand what I should implement and how it should behave, I don't want to read a bunch of exposition on how we got to where we are. The decision log is important though, should it go in `ipfs/notes`? Or maybe we could have accomanying `*-notes.md` files? Follows on from/supercedes #223
I didn't include the pros and cons of the other options because I don't think they should be included in the spec. That is, if I'm implementing this, I read the spec to understand what I should implement and how it should behave, I don't want to read a bunch of exposition on how we got to where we are. The decision log is important though, should it go in `ipfs/notes`? Or maybe we could have accompanying `*-notes.md` files? Follows on from/supersedes #223
I didn't include the pros and cons of the other options because I don't think they should be included in the spec. That is, if I'm implementing this, I read the spec to understand what I should implement and how it should behave, I don't want to read a bunch of exposition on how we got to where we are. The decision log is important though, should it go in `ipfs/notes`? Or maybe we could have accompanying `*-notes.md` files? Follows on from/supersedes #223
I didn't include the pros and cons of the other options because I don't think they should be included in the spec. That is, if I'm implementing this, I read the spec to understand what I should implement and how it should behave, I don't want to read a bunch of exposition on how we got to where we are. The decision log is important though, should it go in `ipfs/notes`? Or maybe we could have accompanying `*-notes.md` files? Follows on from/supersedes #223
Superseded by #226 |
This deviates slightly from what was discussed but should be enough to get us started. We can tweak as necessary as we go along.