-
Notifications
You must be signed in to change notification settings - Fork 3
Requirements #1
Comments
also cc @diasdavid @dignifiedquire @Kubuxu |
also cc @pgte |
Need:
Maybe:
Question: Do we want to duplicate the file type? That is, store it in the file and in the directory or just in the directory. |
I would say it would be nice to preserve all information we can pull from filesystem. My guess is that basing off the tar format (as in what it stores) would be good start. Storing extended attributes would be very nice too. It would allow for storage of for example ACLs. They could be a special case of generic metadata. |
I think there should be a base mode of maximal deduplication, as well as a
"preserve everything" mode.
…On Mon, Sep 18, 2017, 10:35 AM Jakub Sztandera ***@***.***> wrote:
I would say it would be nice to preserve all information we can pull from
filesystem. My guess is that basing off the tar format (as in what it
stores) would be good start.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABL4HKBnUqpR2YusFoqoZiJbMElsiCBoks5sjqmugaJpZM4PaPm1>
.
|
@whyrusleeping actually, we can get the best of both worlds by storing this metadata in the directories but not the files. If we really need to attach the metadata to the files themselves, we could add a special metadata node that adds metadata to a file (although this would only be used when linking to files directly which probably won't be that common). |
I think it might be also useful to start producing directory wrapped files by default. It would be even more breaking change but would clear up more confusion and preserve more metadata. |
Also needed is support for X-Attrs |
@whyrusleeping and anyone, I can create a draft proposal to get things moving. However, I am a little confused on the format the spec should be in. Since there is
but the ipld spec does not spell out how this will be serialized in binary formats as far as I can tell. The translation to cbor is pretty straightforward, but I don't like the idea of including string hash keys in what should be a compact binary format, I would rather assign integer to properties such as "/", "mtime", "size", "isdir", "isexec". If we what to serialize to a protobuf we have to assign integer values to the properties anyway or have a very inefficient protobuf encoding. Note on design choice of the spec above: (1) I believe very strongly that some sort of timestamp should be included by default. (2) There are multiple ways to define the "size" of the directory. The current unixfs defines this as the size of the directory and the contents, real unix filesystems define the size of a directory as the size of the directory entry, but not its contents, I think it is better just to leave it out. |
As far what to include, the tar spec might be a good start: https://en.wikipedia.org/wiki/Tar_%28computing%29 Although there are some fields, such as the owner and user ID that are probably not generally useful and should not be included by default. Also there is the question if how unix like we want this, for example the "file mode" is very unix specific. |
@kevina, a few notes:
We may be able to use xattrs for this but it would be nice to support linking to "related" files in IPFS. This way, we can write importers that can, e.g., parse a merkdown/HTML file, find all linked IPFS content, and link to it from the metadata (so it gets pinned along with the file). |
CARs are (to be speced) content addressable archives. That is, archives of IPLD DAGs. The primary motivation is to be able to dump an IPLD DAG onto a hard drive, ship it somewhere, and them import it back into IPFS. However, I believe we can use them to bride IPLD and IPFS. That is, when I call The power here is not the ability to generate CARs on the fly (although that's really convenient), it's the ability to map structured linked data into unixfs without losing its structure. For now, most tools will just see it as a byte stream (a CAR). However, we can give it some extra metadata marking it as an IPLD DAG so tools that understand IPLD can operate on it that way. This may also be useful for access control (ACLs on files lead to ACLs on structured data) but I haven't put much thought into that yet. |
What we have to take into account is not only format for unixfs itself but we also have to create arbitrary sized, seekable bytearrays on IPLD. |
IPLD doesn't have built-in sharding by design (so it can't really have arbitrary sized byte arrays). We decided to punt on that and build a sharded DAG system on top of it later (leave IPLD objects as "atoms"). However, this is a good opportunity to tackle that. The idea was to abstract the sharding logic from IPFS into a middle layer between IPFS and IPLD. |
@whyrusleeping @Stebalien and others, can I have a concrete example of what you envision the ipld unixfs looking like. I have not been following the IPLD development closely and right now the spec seams more like a collection of notes that a formal spec. |
Writing up an IPLD spec is one of my tasks this quarter (right behind putting out fires on the gateways). Concrete ProblemsThere are two concrete issues with the current system: poor abstractions and the DagProtobuf IPLD format. AbstractionsWe've implemented sharding directly in unixfs. This means that other applications can't take advantage of this work to shard up their own structured data. DagProtobufThis IPLD format is cumbersome, rigid, and not self describing.
Goals
Personally, I'd also like unixfs to interoperate with structured data better than other filesystems. That's why I want the CAR support. |
This is exactly the "mission statement" of the various tar formats: we will try to preserve as much as we can, uid/gid included, but we might not be able to actually recreate it on the client side. See also my thoughts in ipfs/kubo#4292 (comment) |
/cc @mguentner as once upon a time he expressed the same frustrations as mine: ipfs/notes#60 (comment) |
This is probably a minority opinion but I think that storing permissions, timestamps, and extended attributes is archaic and backwards. I say make a simple standard for simple file-system DAGs, and then a superset unixfs standard for storing cruft metadata. Unix is only as relevant as computer science is lazy and unimaginative. UIDs, GIDs, and permissions are features of administrated file-systems, not distributed ones. The executable bit is just arbitrary metadata that can be replaced by conventions in where executable data is stored. If the exec bit is present, what would a proper #! line look like? The more Unix stuff that is explicitly supported the more there will be odd corner cases to fumble through. |
My understanding is that this is exactly what this issue is trying to define. IPLD on its own is well defined and already has apps running on it... |
I'm just trying to argue for a simple file and directory standard that works across different operating systems that addresses things like small file packing but defers the Unix stuff. |
@ehmry I agree that storing permissions and extended attributes is not useful for other operating systems but storing the last modified time is useful. The zip file format also stores timestamps: https://en.wikipedia.org/wiki/Zip_(file_format). |
I really disagree that last modified time is useful in this context. @kevina what is the use case for that? |
@whyrusleeping I don't have a "use case" I just consider it a good best practice. Many, many, many times when trying to figure out what a file is several years later the timestamp can give important clues to doing so. I would really like to see this information preserved by default and not as part of some extended metadata package that can easily be discarded and lost forever. Almost every filesystem and every archive format has some sort of timestamp on files and I consider it basic information. I was quite surprised that the current unixfs format does not have it. |
Personally, I'd like to specify a minimum format and a way of storing metadata. Then, we can say that specific metadata fields mean specific things. Minimal implementations can ignore metadata they don't care about. This also means that, for the purposes of archiving, we can record and extract anything.
I kind of agree as well. Personally, I see IPFS data as kind of "timeless". However, I have also found modification times to be useful... Honestly, I don't know the right solution. I'd say at least support them, even if "support" just means "stick them in some extended attribute field if requested" (e.g., by a tar-like program). |
Other idea: Pinning. This came up elsewhere but we've discussed replacing the current pinning system with mfs. That is, you wouldn't "pin" files, you'd link them into your own personal filesystem. This way, all pinned files are named and applications can manage their own named pin sets by managing their own application-specific sub-directories (each app would get a data folder). However, the we'd probably want a way to specify what should be pinned. There are many cases where one would want to pin certain parts of one's filesystem but not all of it (it can be useful to link to data without necessarily persisting it). Even beyond pinning, it would be useful to be able to include hints about what to download first/what data is "important" in IPFS. Therefore, I think it may be useful to allow one to attach IPLD selectors to directory entries describing the relative importance of the data within it (where pin-selectors attached to higher level directories can override selectors attached to lower-level directories). For now, we probably don't have to tackle this (we don't even have IPLD selectors yet). However, we may want to keep it in mind. This is yet another piece of metadata we'll likely want to be able to attach to files. |
And this is exactly what I don't want to happen because it can easily be disregarded. If I understand how we want this to be implemented extended attribute may be in a separate IPFS block from the core date (to say aid in deduplications). And even if it is not it would be lumped with all the other extended attributes that could easily be stripped. The motivation for stripping this could be to save space, or it could also be out of privacy concerns if the metadata includes other information such as the Unix username. For example of how metadata can get lost look at pictures shared on the net. JPEG has all sorts of metadata as part of the format. But often this information is often intentionally stripped due to privacy concerns. And even when it is not all the metadata is lost anyway when we convert it various other formats or when it is shared by taking a screenshot. For archival purposes when something was created is the single most useful piece of context. For example if someone wrote a note to themselves (but didn't include the date in the text themselves) but then looks back on it several years later, the date can help determine exactly what that note was referring to. Since creation data is not really stored by most filesystems we have to go with the next best thing, the last modified date. Even with that, in a group of files in a directory the combined modification time of all the files in a directory could be useful in determining when the files where created. For example if the backup copy of the file was modified 4 years before the current copy and all the other files in directory where modified 3-4 years ago, that gives a pretty good idea that most of the files where likely created 3-4 years ago. |
That's exactly why I want to do it this way. I'd like to keep the core of IPFS simple (ish). Now, I wouldn't just stick it in an arbitrary metadata field, we'd have an agreed upon metadata field for storing modification dates (and possibly others for, e.g., linking to past versions of the file). However, I'd like to support simple clients that simply treat all metadata as "extra stuff" to be preserved and copied but not necessarily understood. That way, the protocol is extensible but simple at its core.
That's another reason I'd like extensible metadata fields.
Honestly, I'm not sure what's the best way to do this. My current thinking is that metadata would usually be inlined into the directory except when linking to a file directly (in which case, it would be broken out into a separate block). Usually, I'd discourage linking to individual files directly. However, this does introduce some complexity. |
I am a visual kind of person, so I put together this graph, want to make sure I understood what both @kevina and @Stebalien are proposing. This assumes 3 directories in each case, each containing a single entry, pointing to a file that is small enough to fit in a single block. What did I get right, and what did I get wrong? |
@mib-kd743naq could you also post your dot source? |
@mib-kd743naq t hat is not really what I am proposing. I am proposing the mtime be just another field in the directory entry, just as the file name, the hash, the execution bit, etc. |
Timestamps are probably most efficiently stored as data derived from journal. If a major file-system is backed by IPLD then append a timestamp and the top-level CID to a journal every few minutes. If you want a revision history then work your way backwards through the journal. If that is too slow then you should be using a native unix file-system as scratch-space anyway. |
Everyone, there is now an initial spec to comment on in #2. |
Another thing it would be nice to support: hints about where to find files (e.g., a URL, a bittorrent tracker, etc.). This would probably just go in the metadata section but we should keep it in mind. |
I am a little confused by this, doesn't this go against what IPFS is about. That is given a CID it should be IPFS job to find it, wherever it is. |
At the end of the day, it is. However, IMO, hints are fair game. Giving IPFS hints at where to find it could make IPFS much faster. |
Moving these comments here from #2: I think I'd like to narrow down the scope of this. Metadata is a pretty I propose for now we look only into migrating unixfs from Moving unixfs to dag-cbor is an We should concentrate on making resolution and traversal work nicely.
What I'm trying to say is: let's narrow the requirements for the first stage. We can add metadata at the second stage. A CBOR-based unixfs is much more approachable if we don't mix it with extensions of the contained metadata right away. Trying to do everything at once will make this harder and longer to pull off. |
I agree with many of the points above, however I do not understand why just creating dag-cbor nodes instead of dag-pb is a pressing issue by itself if they are going to have the same format (Links + Data). What am I missing? |
We could just start creating dag-cbor nodes instead of dag-pb nodes, but my understanding is that we want to get rid of protobuf structures (unixfs would still be protobufs in a dag-cbor node), not carry them over. "Pressing" might have been a too strong word -- there's just UX issues with unixfs still being based off double protobuf structure that I think are significant. We can solve them sooner if we break up this endeavour into smaller, more digestable pieces. Unixfs is the data structure that almost every user is exposed to right from the start, and it's confusing that there's an additional layer, instead of it just being straight-forward IPLD, like the filesystem examples in the IPLD spec. We're just still in the middle of the IPLD migration, and we have these two separate UX strains: This is basically the UX problem that motivated the unixfs/IPLD session during the labweek unconf in October (but I think unfortunately we didn't take good notes there). This problem makes it really complicated to explain how IPFS represents data. Most people eventually get, but it's still a complication we should solve sooner rather than later. tl;dr I'm thinking moving forward with the general IPLD migration should take priority over extending unixfs features, for now. |
@Stebalien how important is the executable bit? To help move things along I am very tempted to go with @lgierth suggestion and not add new attributes. |
closing for archival |
This issue is to gather the requirements and/or desired features for the next generation unixfs directory.
@whyrusleeping @Stebalien @magik6k
ref: ipfs/kubo#4229
The text was updated successfully, but these errors were encountered: