-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Dat mounts / symlinks (revisited) #32
Comments
I really like this! Some thoughts:
|
Yeah good catch, I've been thinking about that but don't have an answer yet. I think it's pretty important that we cluster the swarms to make this perform the way it needs to, but I'm not sure how you avoid a segmented network (so to speak).
That's a good question. I'd be inclined to say any recursive behavior should either not recurse into mounts by default but have flags to do so, or never recurse into mounts ever.
I just dealt with this for symlinks in the local filesystem; it's actually not too difficult to detect and abort recursion if this occurs. (You just have to maintain a set of folders you visit, which means there's a memory cost.)
I do think we'd need some kind of field on the stat output. |
Maybe when you commit to "seeding" a dat, you join the networks for all the mounts? I think this shows that there should be a way to notify the application of all mounted dats. Either something like |
That's a thought. I'm sure we'll figure something out.
Probably, yeah, because the replication code would need it. |
One other thought: it might be useful to be able to pin the version of the upstream. |
What about using a URL for mounting instead of a separate key / path? i.e. |
@RangerMauve That's really just a question of API. The URL is a serialization of the same info. In the node |
That's a good point.
Should the API look something like |
Oh, yes, that was a typo. I edited the example in the original. |
I was thinking about this too lately and quite like your proposal. One point might need some more thought: If implemented at the hyperdb level (which opens many doors, so ++) then we might also deal with different value encodings of different dbs (ie: What happens if, in your example, hyperdb I'm thinking of a usecase I have, and of which I presume it will become not quite uncommon: Combining a hyperdrive with another data structure. E.g. a hyperdrive with metadata in a hypergraph or forum in DatDB with JSON-encoded values, with the feature to browse uploaded files via regular hyperdrive-based dat tools (e.g. beaker's filesystem view or the dat cli), or some yet-upcoming dat-based file-based version control system with embedded issue management. I think this proposal could allow for that. Or is this out of scope? Don't want to overtake this. I am in the process of doing a writeup/proposal to integrate some support for subhyperdb in the surrounding dat tooling (which is the other way I am exploring). |
@Frando yeah I think that's a fair point, we should give that some thought. Hypercores are having data structure identifiers added to them (DEP on the way) so it should be possible to semi-automatically resolve those lookups to the correct managing code. |
An appealing possible feature of this would be that multi-writer permissions This could be used to "hide" hypercore feeds from metadata analysis; one Could wrapper feeds be used to solve the breakage (corruption) and upgrade In the hyperdrive case, I think this could be prototyped at the application My gut instinct is always to keep the protocol and protobuf schemas as simple I also think there is a "semantic burden" of adding complexity to the |
Yes. Elaborating on this for posterity: current discussion around multi-writer has been looking at dat-wide permission policies. Mounts would make it possible to create prefix-scoped permission schemes by using multiple dats while still using dat-wide permission policies.
That occurred to me RE the discussion around reader policy. I'm not sure I'd state it as a goal, however, because of the other discussion around possibly needing to swarm the archives individually at times.
IE by mounting to root? Probably not. If the wrapper feed gets corrupted, then there'd be no wrapper to fix that, whereas the version pointer in the other proposal has semantics which can not be corrupted. (You can always fix the situation by publishing a pointer with a higher version... for some value of "always." Probably need to limit the number of major versions allowed before we end up with 1024bit majors.)
I'd be okay with this, but we'll also need a way to talk with the replication layer so that it can exchange the additional cores.
I had a similar concern and that's why I've been a "no" on this for a while. The thing that ultimately swayed me was
|
Love this idea |
A spitball thought: if we had an excellent fuse (userspace filesystem mounting) implementation, would we even need this? or, staying in some non-operating-system API, should we explode the generality here and allow mounting many different types of things to the same namespace? eg, git repositories, hyperdrives, tar archives, HTTP remote directories, local filesystem folders. similar to the random access storage API. And a design reference: would this address the design needs that sciencefair has? cc: @blahah for feedback. IIRC they wanted to have a stable root hypercore to reference, but be able to swap out sub-hyperdrives (so they could refactor content schema without being burdened with full storage history data size?), and work around performance/scaling issues with many files in non-hyperdb hyperdrives. |
I appreciate the spitballing, but I'm a hard no on that idea. Adding other protocol mounts is a ton of added complexity we dont need to consider right now. Each one of those alternative mount targets is a protocol that a dat-based platform needs to support, and each one is going to have its own characteristics that would require additional consideration. (For instance, a git repo has a ton of semantics around version control, while HTTP is a remote target.) The door isn't closed to it (I plan to use URLs to reference the mounts in dat.json in my proposed implementation) but I just don't think that's a productive space to consider at this point. |
Another thought: re-using the existing connection for additional hypercore feeds (instead of doing a separate swarm) could result in discovery and history problems in some cases. Eg:
This is sort of an existing issue with hypercore (how to find peers with full history, vs. just recent history?), and not all applications really care about having full history. A solution would be to have nodes join the swarm for sub-feeds, even if they are mostly using the efficiency trick Paul mentions (opening new channels on existing connection, instead of discovering peers the regular way); this would ensure that the historical sub-modules are discover-able. Maybe a user-agent configuration option if there are discovery scaling issues ("the number of total active swarms can be reduced"); this may require real-world experimentation and testing to sort out? User agents should also notice if they aren't able to find sub-feeds after several attempts of opening channels with existing peers and go through the full discovery process (aka, I think that mechanism is a great optimization, but there should be an elegant fallback). Platforms like hashbase.io will need to support automatically fetching sub-feeds; should they tie these together, or auto-create a new hypercore "item"/path? Needs design thought but solvable. Such platforms would be natural places to store full history (and broadcast as such for discovery). @pfrazee could you clarify the "number of total active swarms can be reduced" motivation? Is this to reduce the amount of open TCP/uTP connections beaker has to keep up with (causing battery/wifi waste), or a performance issue with many swarms, or load on discovery servers/networks? I think we should try to minimize all of these, but i'm curious which if any are the most acute today. |
@bnewbold you just articulated everything I've been thinking, including the potential approach of having the swarm-manager decide to join the swarm on not-found, or just prioritizing the subfeeds lower once it has to start juggling open connections.
There's really two things that interest me:
Right now, point 2 is the most high value, and we actually get that no matter what. Point 1 becomes important once we start scaling. |
Will we get change events for mounted dats through the DatArchive API? |
We'll have to see what the performance looks like if we bubble the events. |
Glad this idea got picked up again. I had a naive implementation of symlink on top of hyperdrive. It worked. But in order to re-use a swarm to sync all linked archive, the API becomes really awkward. It would be nice to have this built-in to hypercore/hyperdrive, so we can have a cleaner API. |
So, this would be at the hyperdrive level, not hyperdb, right? |
It appears that's the plan for now. |
I think this could play in well with content addresability. I looked at the hyperdrive DEP a little and it seems to be pretty sparce at the moment. How would this relate to the Node type? |
@mafintosh has been working on "strong" links with https://github.com/mafintosh/hypercore-strong-identifier which might be what you're looking for |
I'm also a big 👍 on this. It'd be a very powerful feature, but it does add some non-obvious complexity. For one, versioning dbs containing many non-versioned symlinks could be confusing. The naive approach here is to store all links in an index, such that you can iterate over them and grab their versions during every call to the parent's The latter seems like the right place to start, but it might lead to confusion if that behavior if not obvious to a user (i.e. they might checkout the parent to a specific version, expecting a mounted library, say, to be at the correct version). While this obviates the need for the cross-db version computation, it's also limiting, as a user cannot have both a "live" link and the guarantee that checkouts will consistently work as expected (to do so requires checking out all links as well). (P.S. Very much looking forward to being able to use this in upstream hyperdb, so I can get rid of the kludgy code in my approach here 😉) |
@pfrazee, re: "strong links" I'm not sure if those really address the need for content addressability since it's tied to the feed and to the representation of the merkle tree rather than to file contents. |
@RangerMauve depends on whether you're just looking for a guarantee about what data you receive, or if you're also trying to get deduplication |
@andrewosh that's a good observation about versioning. I suppose it's not too different than package.json. If you consistently put |
@pfrazee I definately want deduplication. I think it's one of the things that Dat is lacking at the moment compared to IPFS. Although de-duplication between multiple Dats would require something fancy in the storage layer, I think that getting it within a hyperdrive is a great first step and will help with stuff like duplicate files and reverted changes. Plus, content addressability being in the hyperdrive will make it easier to make storage layers that do global de-duplication. Also, content addresability will be useful for quick diffing which I don't think strong identifiers will be good for, unless I'm misunderstanding something, since the merkle tree will be different per fork. |
@RangerMauve one possibility for de-duplication across multiple Dats would be "flat symlinking" (for lack of a better term), which merges all files at the top-level of the link (i.e. mounting This approach doesn't have an analogue in the standard symlinking world, but for this particular use-case it might work well. If this were available, content-based deduping could be added easily in userland. |
What about this idea: Dat mounts as a way to migrate to multiwriter. A lot of people have invested in existing, single-writer archives for stuff like their websites and fritter. Once we get multiwriter support, people will likely need to create new archives in order to use it. What if migrating your dat to a new URL was as easy as mounting your new archive URL on |
@RangerMauve That's a pretty interesting thought. |
Question: Original description said it's mapping folder to floder, is there reason to not allow linking a single file ? Use case: I think there is one more issue this addresses that had not benig explicitly called out - Right now beaker apps that include assets from other dats totally break when mirrored over http(s). This would allow mounting those assets instead to linke to them relatively. Question 2: What would be a write semantics for mounted content ? Presumably ability to write to mounted drive should allow writing. But what if mounted is pinned down to specific version ? Aside: I did wrote down my thougts / wishes regarding module management for beaker apps and I think this proposal would greatly improve things. |
No reason. Would it add much complexity?
IMO: you can write to the mount if it's not pinned to a version. Otherwise, no writing.
I agree. I think it's the best solution for a module system. |
This would also enable support for symlinks for stuff like BrowserFS which would in turn enable better compatability with something like isomorphic-git |
This idea isn't new, but I've recently realized there's a potential optimization that might make this worth prioritizing.
Let's call this proposal a "mount." It's conceptually simple, like a symlink but for dats. It could apply to both hyperdb and to hyperdrive. It is a pointer which maps a prefix/folder to a prefix/folder of another hyperdb or hyperdrive.
When a hyperdb/drive is replicated, the client would request any mounted dbs/drives using the same connection & swarm as the parent db/drive. This means that a mount does not have to incur additional swarming overhead. (This is the optimization I was referring to.)
Mounting is generally useful to applications. It has the following uses:
/vendor
directory could be populated with mounts to libraries./users
directory could be populated with mounts to a site's active users;/users/bob
could point tobob
user.The text was updated successfully, but these errors were encountered: