Skip to content
Kieron Browne edited this page Jun 25, 2021 · 16 revisions

GrootFS

GrootFS is a tool with a command line interface (CLI) that provides filesystem isolation for containers. Isolated filesystems are also called root filesystems (or rootfss). Each Garden container references one rootfs that is mounted as its root mountpoint.

GrootFS

CLI

The command line interface (CLI) implements the Garden's image plugin binary interface and is used as an image plugin by Garden's volumizer. The CLI consists of the following commands:

  • InitStore: used to create a new grootfs store. Required before creating any rootfses.
  • DeleteStore: deletes a store created by InitStore.
  • GenerateVolumeSizeMetadata: TODO
  • Capacity: returns store_size_bytes from the config file
  • Create: creates a new rootfs and returns information (json document) about it on the standard out. Garden parses that json object and uses its data when building the container config.json.
  • Delete: deletes a rootfs, invoked when Garden destroys a container
  • Clean: cleans all unused layers from the cache in the store, invoked when the store size reaches a threshold
  • List: lists all the images in the store
  • Stats: returns store stats

What is a Root Filesystem (rootfs)?

The root filesystem is an overlay filesystem that is mounted under a directory in the GrootFS store. Every rootfs starts with a base image (a tar file, or an OCI image) that consists of layers that are downloaded during rootfs creation. The base image layers are mounted as lower/upper overlay dirs (and are read-only) while the workdir is created by GrootFS on rootfs creation and is writable. Thus containers can only change their writable workdirs but cannot change the base layers.

As base image layers are read-only for containers, they can be shared across different rootfses via simply mounting the upper/lower dirs into different overlay mounts thus optimising disk usage and not downloading layers that are already downloaded. For example, two containers that have their rootfses based on the same base image (such as ubuntu) would have their own rootfses with their own workdirs, but the upper/lower dirs in the overlay mount would be the same. Furthermore, when the second container is created, the layers from the ubuntu base image will not be downloaded (as they have been downloaded when the first container has been created) which also helps for performance.

Note: GrootFS has been designed to be filesystem agnostic, so you can implement it on top of any linux filesystem. Historically the team has been experimenting with aufs, btrfs, overlayfs, xfs and ext4. For more information on the outcomes of those experiments and insight of why we settled on overlayfs on top of xfs, please read this blog post

Notable Command Details

Init Store

GrootFS requires a place to create its rootfses. This is achieved by creating a sparse file creating an xfs filesystem on it, and loop mounting it. Using a sparse file means that we can quickly generate a huge filesystem that initially takes very little space on the parent filesystem. Of course this also leads to confusion about how much space is actually used. See understanding grootfs store disk usage for a discussion. Beware that talk of reclaiming space only applies to the xfs filesystem within the sparse file. Once a sparse file expands to accommodate more content, it cannot be shrunk again. To do that you would need to copy the sparse file to a new sparse file, and that is not possible with GrootFS and garden. So if a filesystem containing a GrootFS is full because of the size of the sparse file, there is no easy fix.

When the command grootfs --config <CONFIG_FILE> init-store is run (with store set to an appropriate path in the config file), GrootFS creates the following directory structure. E.g. for store: /tmp/store/unprivileged:

/tmp/store/
├── unprivileged/
│   ├── images/
│   ├── l/
│   ├── locks/
│   ├── meta/
│   │   ├── dependencies/
│   │   └── namespace.json
│   ├── projectids/
│   ├── tmp/
│   ├── volumes/
│   └── whiteout_dev
└── unprivileged.backing-store

The unprivileged.backing-store file is the sparse file containing the xfs filesystem. The xfs filesystem it contains was mounted by GrootFS at /tmp/store/unprivileged.

The subdirectories are used as follows:

  • images: contains a directory per image. Inside there are the workdir and diff directories used by the overlayfs mount, and rootfs where the overlayfs is mounted.
  • l: shortname symlinks to volumes
  • locks: contains lock files
  • meta: contains details about uid/gid mappings, volumes used by images, and volume sizes
  • projectids: used by xfs to manage quotas
  • tmp: temporary storage
  • volumes: where the volumes (image layers) are stored

UID/GID Mapping

Garden uses user namespaces as a security measure, unless using privileged containers. This means all users in the container will be mapped to unprivileged users on the host. Root in particular is mapped to a userid we call maximus, which is the top of the UID range on the system.

The UID/GID mappings are the same for all containers, and the rootfses created by grootfs need to set appropriate file ownership according to these mappings. So if root in the container is mapped to 50000 on the host, a volume in grootfs containing a file that should be owned by root in the container must have uid 50000 set on it. The UID/GID mappings are passed as options to the init-store command and are stored in the meta/namespace.json file where they can be used during the create command.

Existing Backing Stores

It is possible to Init a store in a location where the backing-store file already exists. If it has an xfs filesystem already in it, that will be mounted. If not, the filesystem will be created first.

Direct IO

The GrootFS store is a loop mounted file. Reads and writes to the xfs filesystem inside the store will be cached by the xfs filesystem. Also the backing file's reads and writes will be cached inside its filesystem. Better performance can be obtained by setting the direct-io flag on the loop device used for the backing file mount. This avoids caching on the loop device, saving memory, and stops implicit syncing to the disk.

You can choose to enable direct IO as an argument to the init-store call.

delete-store

The delete store command removes all the images created in the store, and all volumes, then unmounts the backing-store, and deletes the mount path (i.e. store path).

The backing-store file is not deleted, although it possibly should be if you need to reclaim space in it.

Create

The create command generates a rootfs in the store. The store must have been previously created using init-store.

The command takes a URL for the rootfs image, and an ID for the rootfs. The URL can be either:

  • a filesystem path to a tarball containing the container base image, e.g. /path/to/fs.tar
  • a Docker URL, e.g. docker://cfgarden/strace
  • an OCI URL pointing to an OCI on the local filesystem, e.g. oci:///path/to/oci/file:version

The tarball URL will result in a single volume / layer, whereas the other two type may result in multiple volumes / layers. In all three cases, the layers are stored as tarballs, and grootfs will extract them

Pulling the Base Image

The first main step of image creation is pulling the base image. First the image information is grabbed using either the tar_fetcher or the layer_fetcher, depending on the image URL. In the case of tar, the info is just the path to the tarball, and a chain ID constructed from some of its attributes. The layer_fetcher leans on containers/image to do the work.

Next the image layers are pulled. Since we need to process whiteouts, we must process the layers from lowest to highest. Layers have unique IDs, so we can check if we already have the layer, in which case we can exit early. Otherwise creating the layer involves creating a new volume for each layer, backed by a temporary directory in the store/tmp directory. When creating the temp directory we use the Re-execer in case container root permissions are required to change directory ownership. Then the local or remote layer, which is always in tar format, is streamed to a tar unpacker. This also uses the Re-execer, both for potential ownership requirements, and also to extract within a chrooted environment to protect against untar vulnerabilities.

Interestingly, if the user namespace is used in the re-execer, file ownership works transparently, and so a no-op uid/gid converter is used. When a user namespace is not used, uid/gid conversion is done explicitly. Why not always use the user namespace and get the correct uids/gids for free?

Whiteouts

In multi-layer OCI or docker images, when files or directories present in a lower layer are deleted in an upper layer, whiteout files are used as a mask. For instance, the file /usr/bin/danger might be whited-out using a file named /usr/bin/.wh.danger. Or the directory /home/to-delete and its contents could be removed with a file named /home/to-delete/.wh..wh.opq. When we extract tarred layers, we have to consider these whiteout files.

Our target format is overlayfs. This also uses whiteouts to mask files and directories in lower levels. The format is different of course. See the definition here. So as we encounter OCI style whiteouts, we convert them to overlayfs whiteouts, which are implemented with extended file attributes and a special whiteout device.

Finalisation

When the layer is completely extracted and overlayfs whiteouts applied, the temporary directory is moved to its permanent location in the store volumes directory. The link in the l directory is also recreated to point to the new location.

Image Creation

Once all the image layers exist as volumes in the store, the image is created. A directory matching the requested image ID is created in the store images directory. A size quota is applied using tardis. Inside, directories are created for rootfs, diff and workdir.

By default, grootfs will mount the image. It creates an overlayfs mount of the base layers with diff as the upper layer, workdir as the work dir, and rootfs as the mount point.

If the --without-mount option was passed to create, then the overlayfs mount information will be included in the spec returned to the user. This was intended for rootless operation when grootfs will be running as a non-root user and won't have permission to create a mount. Instead, runc would create the mount in the container namespace.

Output

The create command sends a fragment of an OCI spec to stdout on completion. E.g.

{
    "ociVersion": "",
    "process": {
        "user": {
            "uid": 0,
            "gid": 0
        },
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ],
        "cwd": ""
    },
    "root": {
        "path": "/tmp/store/unprivileged/images/my-id/rootfs"
    }
}

If the --without-mount option was specified, then mount information will be included:

{
    "ociVersion": "",
    "process": {
        "user": {
            "uid": 0,
            "gid": 0
        },
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ],
        "cwd": ""
    },
    "root": {
        "path": "/tmp/store/unprivileged/images/layers/rootfs"
    },
    "mounts": [
        {
            "destination": "/",
            "type": "overlay",
            "source": "overlay",
            "options": [
                "lowerdir=/tmp/store/unprivileged/l/vWEirZk7g1401963,upperdir=/tmp/store/unprivileged/images/layers/diff,workdir=/tmp/store/unprivileged/images/layers/workdir"
            ]
        }
    ]
}

Delete

The delete command removes an image. It unmounts the image's rootfs, then recursively deletes the contents of the image ID directory in the store, also removing any quota files. Base volumes are left intact. They might be garbage collected at a later time if unused by other images.

Clean

The clean command takes a threshold in bytes. It calculates the allowed quotas in all the existing images, and the size of all the volumes present. If the sum of these is greater than the threshold, then the garbage collector is invoked.

The garbage collector first marks all volumes currently unused in any images as eligible for garbage collection. Then it iterates through these volumes, removing them.

The garbage collector may also be invoked after a create command, depending on the --with-clean and --threshold-bytes options.

In a garden deployment, the thresholder is used to calculate the threshold value to use.