kernel: add support for unionfs 2.6

firmadyne · Sep 1, 2020 · 7affcae · 7affcae
1 parent 2df9c50
commit 7affcae
Show file tree

Hide file tree

Showing 42 changed files with 11,161 additions and 12 deletions.
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
@@ -110,6 +110,8 @@ udf.txt
 	- info and mount options for the UDF filesystem.
 ufs.txt
 	- info on the ufs filesystem.
+unionfs/
+	- info on the unionfs filesystem
 vfat.txt
 	- info on using the VFAT filesystem used in Windows NT and Windows 95
 vfs.txt

diff --git a/Documentation/filesystems/unionfs/00-INDEX b/Documentation/filesystems/unionfs/00-INDEX
@@ -0,0 +1,10 @@
+00-INDEX
+	- this file.
+concepts.txt
+	- A brief introduction of concepts.
+issues.txt
+	- A summary of known issues with unionfs.
+rename.txt
+	- Information regarding rename operations.
+usage.txt
+	- Usage information and examples.
diff --git a/Documentation/filesystems/unionfs/concepts.txt b/Documentation/filesystems/unionfs/concepts.txt
@@ -0,0 +1,287 @@
+Unionfs 2.x CONCEPTS:
+=====================
+
+This file describes the concepts needed by a namespace unification file
+system.
+
+
+Branch Priority:
+================
+
+Each branch is assigned a unique priority - starting from 0 (highest
+priority).  No two branches can have the same priority.
+
+
+Branch Mode:
+============
+
+Each branch is assigned a mode - read-write or read-only. This allows
+directories on media mounted read-write to be used in a read-only manner.
+
+
+Whiteouts:
+==========
+
+A whiteout removes a file name from the namespace. Whiteouts are needed when
+one attempts to remove a file on a read-only branch.
+
+Suppose we have a two-branch union, where branch 0 is read-write and branch
+1 is read-only. And a file 'foo' on branch 1:
+
+./b0/
+./b1/
+./b1/foo
+
+The unified view would simply be:
+
+./union/
+./union/foo
+
+Since 'foo' is stored on a read-only branch, it cannot be removed. A
+whiteout is used to remove the name 'foo' from the unified namespace. Again,
+since branch 1 is read-only, the whiteout cannot be created there. So, we
+try on a higher priority (lower numerically) branch and create the whiteout
+there.
+
+./b0/
+./b0/.wh.foo
+./b1/
+./b1/foo
+
+Later, when Unionfs traverses branches (due to lookup or readdir), it
+eliminate 'foo' from the namespace (as well as the whiteout itself.)
+
+
+Opaque Directories:
+===================
+
+Assume we have a unionfs mount comprising of two branches.  Branch 0 is
+empty; branch 1 has the directory /a and file /a/f.  Let's say we mount a
+union of branch 0 as read-write and branch 1 as read-only.  Now, let's say
+we try to perform the following operation in the union:
+
+	rm -fr a
+
+Because branch 1 is not writable, we cannot physically remove the file /a/f
+or the directory /a.  So instead, we will create a whiteout in branch 0
+named /.wh.a, masking out the name "a" from branch 1.  Next, let's say we
+try to create a directory named "a" as follows:
+
+	mkdir a
+
+Because we have a whiteout for "a" already, Unionfs behaves as if "a"
+doesn't exist, and thus will delete the whiteout and replace it with an
+actual directory named "a".
+
+The problem now is that if you try to "ls" in the union, Unionfs will
+perform is normal directory name unification, for *all* directories named
+"a" in all branches.  This will cause the file /a/f from branch 1 to
+re-appear in the union's namespace, which violates Unix semantics.
+
+To avoid this problem, we have a different form of whiteouts for
+directories, called "opaque directories" (same as BSD Union Mount does).
+Whenever we replace a whiteout with a directory, that directory is marked as
+opaque.  In Unionfs 2.x, it means that we create a file named
+/a/.wh.__dir_opaque in branch 0, after having created directory /a there.
+When unionfs notices that a directory is opaque, it stops all namespace
+operations (including merging readdir contents) at that opaque directory.
+This prevents re-exposing names from masked out directories.
+
+
+Duplicate Elimination:
+======================
+
+It is possible for files on different branches to have the same name.
+Unionfs then has to select which instance of the file to show to the user.
+Given the fact that each branch has a priority associated with it, the
+simplest solution is to take the instance from the highest priority
+(numerically lowest value) and "hide" the others.
+
+
+Unlinking:
+=========
+
+Unlink operation on non-directory instances is optimized to remove the
+maximum possible objects in case multiple underlying branches have the same
+file name.  The unlink operation will first try to delete file instances
+from highest priority branch and then move further to delete from remaining
+branches in order of their decreasing priority.  Consider a case (F..D..F),
+where F is a file and D is a directory of the same name; here, some
+intermediate branch could have an empty directory instance with the same
+name, so this operation also tries to delete this directory instance and
+proceed further to delete from next possible lower priority branch.  The
+unionfs unlink operation will smoothly delete the files with same name from
+all possible underlying branches.  In case if some error occurs, it creates
+whiteout in highest priority branch that will hide file instance in rest of
+the branches.  An error could occur either if an unlink operations in any of
+the underlying branch failed or if a branch has no write permission.
+
+This unlinking policy is known as "delete all" and it has the benefit of
+overall reducing the number of inodes used by duplicate files, and further
+reducing the total number of inodes consumed by whiteouts.  The cost is of
+extra processing, but testing shows this extra processing is well worth the
+savings.
+
+
+Copyup:
+=======
+
+When a change is made to the contents of a file's data or meta-data, they
+have to be stored somewhere.  The best way is to create a copy of the
+original file on a branch that is writable, and then redirect the write
+though to this copy.  The copy must be made on a higher priority branch so
+that lookup and readdir return this newer "version" of the file rather than
+the original (see duplicate elimination).
+
+An entire unionfs mount can be read-only or read-write.  If it's read-only,
+then none of the branches will be written to, even if some of the branches
+are physically writeable.  If the unionfs mount is read-write, then the
+leftmost (highest priority) branch must be writeable (for copyup to take
+place); the remaining branches can be any mix of read-write and read-only.
+
+In a writeable mount, unionfs will create new files/dir in the leftmost
+branch.  If one tries to modify a file in a read-only branch/media, unionfs
+will copyup the file to the leftmost branch and modify it there.  If you try
+to modify a file from a writeable branch which is not the leftmost branch,
+then unionfs will modify it in that branch; this is useful if you, say,
+unify differnet packages (e.g., apache, sendmail, ftpd, etc.) and you want
+changes to specific package files to remain logically in the directory where
+they came from.
+
+Cache Coherency:
+================
+
+Unionfs users often want to be able to modify files and directories directly
+on the lower branches, and have those changes be visible at the Unionfs
+level.  This means that data (e.g., pages) and meta-data (dentries, inodes,
+open files, etc.) have to be synchronized between the upper and lower
+layers.  In other words, the newest changes from a layer below have to be
+propagated to the Unionfs layer above.  If the two layers are not in sync, a
+cache incoherency ensues, which could lead to application failures and even
+oopses.  The Linux kernel, however, has a rather limited set of mechanisms
+to ensure this inter-layer cache coherency---so Unionfs has to do most of
+the hard work on its own.
+
+Maintaining Invariants:
+
+The way Unionfs ensures cache coherency is as follows.  At each entry point
+to a Unionfs file system method, we call a utility function to validate the
+primary objects of this method.  Generally, we call unionfs_file_revalidate
+on open files, and __unionfs_d_revalidate_chain on dentries (which also
+validates inodes).  These utility functions check to see whether the upper
+Unionfs object is in sync with any of the lower objects that it represents.
+The checks we perform include whether the Unionfs superblock has a newer
+generation number, or if any of the lower objects mtime's or ctime's are
+newer.  (Note: generation numbers change when branch-management commands are
+issued, so in a way, maintaining cache coherency is also very important for
+branch-management.)  If indeed we determine that any Unionfs object is no
+longer in sync with its lower counterparts, then we rebuild that object
+similarly to how we do so for branch-management.
+
+While rebuilding Unionfs's objects, we also purge any page mappings and
+truncate inode pages (see fs/unionfs/dentry.c:purge_inode_data).  This is to
+ensure that Unionfs will re-get the newer data from the lower branches.  We
+perform this purging only if the Unionfs operation in question is a reading
+operation; if Unionfs is performing a data writing operation (e.g., ->write,
+->commit_write, etc.) then we do NOT flush the lower mappings/pages: this is
+because (1) a self-deadlock could occur and (2) the upper Unionfs pages are
+considered more authoritative anyway, as they are newer and will overwrite
+any lower pages.
+
+Unionfs maintains the following important invariant regarding mtime's,
+ctime's, and atime's: the upper inode object's times are the max() of all of
+the lower ones.  For non-directory objects, there's only one object below,
+so the mapping is simple; for directory objects, there could me multiple
+lower objects and we have to sync up with the newest one of all the lower
+ones.  This invariant is important to maintain, especially for directories
+(besides, we need this to be POSIX compliant).  A union could comprise
+multiple writable branches, each of which could change.  If we don't reflect
+the newest possible mtime/ctime, some applications could fail.  For example,
+NFSv2/v3 exports check for newer directory mtimes on the server to determine
+if the client-side attribute cache should be purged.
+
+To maintain these important invariants, of course, Unionfs carefully
+synchronizes upper and lower times in various places.  For example, if we
+copy-up a file to a top-level branch, the parent directory where the file
+was copied up to will now have a new mtime: so after a successful copy-up,
+we sync up with the new top-level branch's parent directory mtime.
+
+Implementation:
+
+This cache-coherency implementation is efficient because it defers any
+synchronizing between the upper and lower layers until absolutely needed.
+Consider the example a common situation where users perform a lot of lower
+changes, such as untarring a whole package.  While these take place,
+typically the user doesn't access the files via Unionfs; only after the
+lower changes are done, does the user try to access the lower files.  With
+our cache-coherency implementation, the entirety of the changes to the lower
+branches will not result in a single CPU cycle spent at the Unionfs level
+until the user invokes a system call that goes through Unionfs.
+
+We have considered two alternate cache-coherency designs.  (1) Using the
+dentry/inode notify functionality to register interest in finding out about
+any lower changes.  This is a somewhat limited and also a heavy-handed
+approach which could result in many notifications to the Unionfs layer upon
+each small change at the lower layer (imagine a file being modified multiple
+times in rapid succession).  (2) Rewriting the VFS to support explicit
+callbacks from lower objects to upper objects.  We began exploring such an
+implementation, but found it to be very complicated--it would have resulted
+in massive VFS/MM changes which are unlikely to be accepted by the LKML
+community.  We therefore believe that our current cache-coherency design and
+implementation represent the best approach at this time.
+
+Limitations:
+
+Our implementation works in that as long as a user process will have caused
+Unionfs to be called, directly or indirectly, even to just do
+->d_revalidate; then we will have purged the current Unionfs data and the
+process will see the new data.  For example, a process that continually
+re-reads the same file's data will see the NEW data as soon as the lower
+file had changed, upon the next read(2) syscall (even if the file is still
+open!)  However, this doesn't work when the process re-reads the open file's
+data via mmap(2) (unless the user unmaps/closes the file and remaps/reopens
+it).  Once we respond to ->readpage(s), then the kernel maps the page into
+the process's address space and there doesn't appear to be a way to force
+the kernel to invalidate those pages/mappings, and force the process to
+re-issue ->readpage.  If there's a way to invalidate active mappings and
+force a ->readpage, let us know please (invalidate_inode_pages2 doesn't do
+the trick).
+
+Our current Unionfs code has to perform many file-revalidation calls.  It
+would be really nice if the VFS would export an optional file system hook
+->file_revalidate (similarly to dentry->d_revalidate) that will be called
+before each VFS op that has a "struct file" in it.
+
+Certain file systems have micro-second granularity (or better) for inode
+times, and asynchronous actions could cause those times to change with some
+small delay.  In such cases, Unionfs may see a changed inode time that only
+differs by a tiny fraction of a second: such a change may be a false
+positive indication that the lower object has changed, whereas if unionfs
+waits a little longer, that false indication will not be seen.  (These false
+positives are harmless, because they would at most cause unionfs to
+re-validate an object that may need no revalidation, and print a debugging
+message that clutters the console/logs.)  Therefore, to minimize the chances
+of these situations, we delay the detection of changed times by a small
+factor of a few seconds, called UNIONFS_MIN_CC_TIME (which defaults to 3
+seconds, as does NFS).  This means that we will detect the change, only a
+couple of seconds later, if indeed the time change persists in the lower
+file object.  This delayed detection has an added performance benefit: we
+reduce the number of times that unionfs has to revalidate objects, in case
+there's a lot of concurrent activity on both the upper and lower objects,
+for the same file(s).  Lastly, this delayed time attribute detection is
+similar to how NFS clients operate (e.g., acregmin).
+
+Finally, there is no way currently in Linux to prevent lower directories
+from being moved around (i.e., topology changes); there's no way to prevent
+modifications to directory sub-trees of whole file systems which are mounted
+read-write.  It is therefore possible for in-flight operations in unionfs to
+take place, while a lower directory is being moved around.  Therefore, if
+you try to, say, create a new file in a directory through unionfs, while the
+directory is being moved around directly, then the new file may get created
+in the new location where that directory was moved to.  This is a somewhat
+similar behaviour in NFS: an NFS client could be creating a new file while
+th NFS server is moving th directory around; the file will get successfully
+created in the new location.  (The one exception in unionfs is that if the
+branch is marked read-only by unionfs, then a copyup will take place.)
+
+For more information, see <http://unionfs.filesystems.org/>.
diff --git a/Documentation/filesystems/unionfs/issues.txt b/Documentation/filesystems/unionfs/issues.txt
@@ -0,0 +1,28 @@
+KNOWN Unionfs 2.x ISSUES:
+=========================
+
+1. Unionfs should not use lookup_one_len() on the underlying f/s as it
+   confuses NFSv4.  Currently, unionfs_lookup() passes lookup intents to the
+   lower file-system, this eliminates part of the problem.  The remaining
+   calls to lookup_one_len may need to be changed to pass an intent.  We are
+   currently introducing VFS changes to fs/namei.c's do_path_lookup() to
+   allow proper file lookup and opening in stackable file systems.
+
+2. Lockdep (a debugging feature) isn't aware of stacking, and so it
+   incorrectly complains about locking problems.  The problem boils down to
+   this: Lockdep considers all objects of a certain type to be in the same
+   class, for example, all inodes.  Lockdep doesn't like to see a lock held
+   on two inodes within the same task, and warns that it could lead to a
+   deadlock.  However, stackable file systems do precisely that: they lock
+   an upper object, and then a lower object, in a strict order to avoid
+   locking problems; in addition, Unionfs, as a fan-out file system, may
+   have to lock several lower inodes.  We are currently looking into Lockdep
+   to see how to make it aware of stackable file systems.  For now, we
+   temporarily disable lockdep when calling vfs methods on lower objects,
+   but only for those places where lockdep complained.  While this solution
+   may seem unclean, it is not without precedent: other places in the kernel
+   also do similar temporary disabling, of course after carefully having
+   checked that it is the right thing to do.  Anyway, you get any warnings
+   from Lockdep, please report them to the Unionfs maintainers.
+
+For more information, see <http://unionfs.filesystems.org/>.
diff --git a/Documentation/filesystems/unionfs/rename.txt b/Documentation/filesystems/unionfs/rename.txt
@@ -0,0 +1,31 @@
+Rename is a complex beast. The following table shows which rename(2) operations
+should succeed and which should fail.
+
+o: success
+E: error (either unionfs or vfs)
+X: EXDEV
+
+none = file does not exist
+file = file is a file
+dir  = file is a empty directory
+child= file is a non-empty directory
+wh   = file is a directory containing only whiteouts; this makes it logically
+		empty
+
+                      none    file    dir     child   wh
+file                  o       o       E       E       E
+dir                   o       E       o       E       o
+child                 X       E       X       E       X
+wh                    o       E       o       E       o
+
+
+Renaming directories:
+=====================
+
+Whenever a empty (either physically or logically) directory is being renamed,
+the following sequence of events should take place:
+
+1) Remove whiteouts from both source and destination directory
+2) Rename source to destination
+3) Make destination opaque to prevent anything under it from showing up
+