-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COW cp (--reflink) support #405
Comments
Link to zfs-discuss thread for this feature request: |
I have my doubts regarding whether this is needed -- because the functionality is already available at the dataset(filesystem) level, and ZFS is intended to be implemented with fine-grained datasets(filesystems).
COW is already implemented across specific datasets -- e.g. clones or datasets with snapshots (promoted or otherwise). Therefore, I propose a more generally useful version of this request: implement or allow COW for all copy operations in the same pool base on a setting implemented at both the dataset and pool level. |
Just to reiterate here what has been discussed in https://groups.google.com/a/zfsonlinux.org/forum/?fromgroups=#!msg/zfs-discuss/mvGB7QEpt3w/r4xZ3eD7nn0J -- snapshots, dedup and so on already provide most of the benefits of COW hardlink breaking, but not all. Specifically, binaries and libs with the same inode number get mapped to the same memory locations for execution, which results in memory savings. This matters a lot with container based virtualisation where you may have dozens or hundreds of identical copies of libc6 or sshd or apache in memory. KSM helps there, but it needs madvise() and is not free (it costs CPU at runtime). COW hardlinks are essentially free. linux-vserver (http://linux-vserver.org/) already patches ext[234] and I think jfs to enable this sort of COW hardlink breaking functionality by abusing the immutable attribute and an otherwise unused attribute linux-vserver repurposes as "break-hardlink-and-remove-immutable-bit-when-opened-for-writing". It would be great to have the same feature in zfsonlinux; its lack is one of the only reasons I can't put my vservers volume on zfs (the lack of posix or nfsv4 acl support is the oter reason). So, to clarify, I'm requesting the following semantics:
This wouldn't break existing applications because the feature would not be enabled by default (you'd have to set the special xattr on a file to use it). |
As was brought up in the thread we currently using http://www.xmailserver.org/flcow.html on ext4 for file/dir level COW. This works, but we would much prefer if we were using ZFS to have the filesystem take care of COW goodness. (For our narrow use case we can probably do everything we need to do with a filesystem per directory, but having our code just work with ``cp` would be nice to have.) |
I would like to bump for this feature. When I submitted it 2 years ago there were like 30 issues before it, now there are like 500. It's moving farther and farther with issues being created before it like in the expansion of the universe. I do understand this a feature request and not a bugfix but it would make ZFS a helluva lot more appealing for people. |
One of the problems with implementing this is that the directory entries are implemented as name-value pairs, which at a glance, provides no obvious way of doing this. I just noticed today that value is divided into 3 sections. The top 4 bits indicate the file type, the bottom 48-bits are the actual object and the middle 12-bits are unused. One of those unused bits could be repurposed to implement reflinks. Implementing reflinks would require much more than marking a bit, but the fact that we have spare bits available should be useful information to anyone who decides to work on this. |
Hi Ryao, thanks for noticing this :-) If deduplication does not need such bits or a different directory structure, then also reflink should not need them. I see the reflink as a way to do copy+deduplicate a specific file without the costs associated with copy and with deduplication, but with the same final result... Is it not so? |
@torn5 I believe you're correct, this could be done relatively easily by leveraging dedup. Basically, the reflink ioctl() would provide an user interface for per-file deduplication. As long as we're talking about a relatively small number of active files the entire deduplication table should be easily cachable and performance will be good. If implemented this way we'd inherit the existing dedup behavior for quotas and such. This makes the most sense for individual files in larger filesystems, for directories creating a new dataset would still be best. |
Here is a scenario that I think this feature would be very helpful for: I take regular snapshots of my video collection. Because of COW, these snapshots do not take any space. However, a (young relative|virus|hostile alien) comes for a visit and deletes some videos from my collection, and I would like to recover them from my handy snapshots. If I use cp normally, each recovered video is duplicated in snapshots and in the active space. With cp --reflink, the file system would be signaled to COW the file to a new name, without taking any additional space, along with making recovery instantaneous. As an aside, is there a way to scan a ZFS volume and run an offline deduplication? If I had copied the data, is there a way to recover the space other than deleting all snapshots that contained the data? |
On Thu, Jan 16, 2014 at 05:16:38PM -0800, hufman wrote:
I'm not sure I see how that would work; it would need cross-filesystem Normally, to recover from such situations, you'd just roll back to the If this is a frequent occurrence, maybe you'd like to turn on deduplication.
None that I know of. What you could do is: enable dedup, then copy each file
Other than rolling back to a snapshot, no, I don't think so. Andras
|
Thank you for your respose! |
My use-case for reflink is that we are building a record and replay debugging tool (https://github.com/mozilla/rr) and every time a debuggee process mmaps a file, we need to make a copy of that file so we can map it again later unchanged. Reflink makes that free. Filesystem snapshots won't work for us; they're heavier-weight, clumsy for this use-case, and far more difficult to implement in our tool. |
I also have a use for this, I use different zfs filesystems to control differing IO requirements. Currently for my app to move items between these filesystems it must do a full copy, which works, but makes things fairly unresponsive fairly regularly, as it copies dozens of gigabytes of data. I would consider using deduplication, but I'm fairly resource constrained as it is. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
If someone does decide to work on this I'm happy to help point them in the right direction. |
I had jotted down a possible way of doing this in #2554 in response to an inquiry about btrfs-style deduplication via reflinks. I am duplicating it here for future reference:
Thinking about it, the indirection tree itself is always significantly smaller than the data itself, so the act of writing would be bounded to some percentage of the data. We already suffer from this sort of penalty from the existing deduplication implementation, but at a much higher penalty as we must perform a random seek on each data access. Letting users copy only metadata through reflinks seems preferrable to direct data copies by shuffling data through userland. This could avoid the penalties in the presence of dedup=on because all of the data has been preprocessed by our deduplication algorithm. That being said, the DDT has its own reference counts for each block, so we would either need to implement this in a way compatible with that or change it. We also need to consider the interaction with snapshots. |
Here is a technique that might be possible to apply: https://www.usenix.org/legacy/event/lsf07/tech/rodeh.pdf There are a few caveats:
|
I'm late to this party but I want to give a definitive and real use-case for this that is not satisfied by clones. We have a process sandbox which imports immutable files. Ideally, each imported file may be modified by the process but those modifications shouldn't change the immutable originals. Okay, that could be solved with a cloned file system and COW. However, from this sandbox we also want to extract (possibly very large) output files. This throws a wrench in our plans. We can't trivially import data from a clone back into the parent file system without doing byte-by-byte copy. I think this is a problem worth solving. Clones (at least, as they currently exist) are not the right answer. |
Sorry if I lost something in this long range discussion, but it seems to me that everybody is thinking in snapshots, clones, dedup, etc. I am, personally, a big fan of dedup. But I know it has many memory drawbacks, because it is done online. In BTRFS and WAFL, dedup is done offline. Thus, all this memory is only used during the dedup process. I think that the original intent of this request is to add "clone/dedup" functionality to cp. But not by enabling online dedup to the filesystem, nor by first copying and them deduping the file. Let the filesystem just create another "file" instance, in which its data is a set of CoW sectors from another file. Depending on how ZFS adjusts data internally, I can imagine this even being used to "move" files between filesystems on the same pool. No payload disk block need to be touched, only metadata. Ok, there are cases in which blocksize, compression, etc are setup differently. But IIRC, these are only "valid" for newer files. Old files will keep what is already on disk. So, it appears to me that this not a problem. Maybe crypto, as @darthur has already told at 2011, which is NOT even a real feature yet... There's already such a feature in WALF. Please, take a look at this: http://community.netapp.com/t5/Microsoft-Cloud-and-Virtualization-Discussions/What-is-SIS-Clone/td-p/6462 Let's start cloning single files!!! |
@dioni21 perhaps you should consider a zvol formatted with btrfs. |
Yeah, this is on me - in my haste I said (also, run time is not a great indicator of whether or not a clone happened, but that's by the by). Basically: |
As kernel 5.19+ removed the restriction about mountpoints, what does now prevent Thanks. |
They have to have the same superblock. Btrfs subvolumes share a superblock so this works there, but ZFS datasets have separate superblocks so fails that check. |
Would it be possible to use a similar trick as btrfs does or to remove the limitation upstream? |
I don't think that's a "trick," more like a fundamental design difference. I may be wrong, of course, and defer to more knowledgeable people on this one. |
Yeah, no trick. The superblock structure is where a lot of the Linux-side metadata and accounting are held for the mount, as well as being the "key" in a superblock->dataset mapping so when Linux calls into OpenZFS we can find the right dataset. Trying to share a single superblock across multiple datasets is going to significantly complicate things inside OpenZFS, assuming its even possible to do without causing real issues. ("how does Btrfs do it then?", you'll ask. And I don't know, but I do know it is a fundamentally different system with a very different scope, so I can easily imagine its needs are quite different internally. I don't think its useful to compare). Removing the limitation upstream is technically trivial, but working with Linux upstream takes a lot of time and energy I'm not sure many have available (I don't). Its also of limited benefit to anyone not on the bleeding-edge kernels, which is the majority of OpenZFS users. (fun fact; if you install 2.2-rc4 on RHEL7, |
I can assure you that soon enough LTS users will face the very same issue: bleeding edge doesn't remain so forever :) |
Be that as it may, it doesn't change anything. There isn't one neat trick to solve this particular issue, and the block cloning feature is still very useful even with this shortcoming. |
Unfortunately not so much for my use case and there is nothing I can do to work it around because the dataset is actually the same:
I want to use reflinks to restore data from snapshots (mainly virtual machine images) without having to rollback the dataset. Rolling back I will loose subsequent snapshots and I cannot use clones because that would break |
Could you create a new issue on cross-dataset reflinks? I think that's far past this issue's scope. |
A filesystem and a snapshot are not the same dataset. |
Have you tried using |
Yes I did, but unfortunately it doesn't work. I've tried timing the time it requires to copy the file and even checking the pool available space before and after the copy. |
I am on stable LTS distros as Debian12 and Rocky9, so I don't have first hand experience on such latest kernels. On a Debian12 box,
Notice how |
@shodanshok it seems like
@scineram since you've put a "thumbs down" on my previous comment may I ask you why you're against using reflinks to restore data from snapshots? Is there any reason why that would be considered a bad practice? |
It is perfectly normal for Please provide the output of |
Unfortunately
|
Add You also want the path relative to dataset on the calls to
|
Anyway, the problem is encryption. We can't clone encrypted blocks across datasets because the key material is partially bound to the source dataset (actually its encryption root). #14705 has a start on this. |
I already tried relative paths but ended up copy pasting the absolute ones in the end.
Unfortunately that's not it, because the same happens for the src one which is on disk since a very long time:
Maybe this has something to do with encryption as well?
Thanks, I will follow that issue. |
Ahh yeah, you'll need |
Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050 Closes #405 Closes openzfs#13349
is there something special needed to enable reflinks? I ran a |
Yeah, I didn't realize it was still so unstable, definitely gonna hold off for a bit, but thanks for clarifying my confusion. |
This is a feature request for an implementation of the BTRFS_IOC_CLONE in zfsonlinux, or something similar, so that it's possible to COW-copy a single file in zero space, zero RAM and zero time, without having to enable super-expensive things like filesystem-wide deduplication (which btw is not zero RAM or zero time).
If it can be done at the directory level, so to clone entire directory trees with one call, even better.
On the mailing list, doubts were raised regarding semantics on:
Firstly, I don't expect this to work across datasets. Secondly, I'd suggest that the same semantics of deduplication are used. It should just be a shortcut of 1) enabling deduplication + 2) copying the file by reading it byte-by-byte and writing it byte-by-byte elsewhere.
If you can implement exactly the BTRFS_IOC_CLONE, the same cp with --reflink that works for btrfs could be used here. If the IOC is different, we will also need patches to the cp program or another another cp program.
The text was updated successfully, but these errors were encountered: