-
Notifications
You must be signed in to change notification settings - Fork 78
GrootFS
GrootFS is a tool with a command line interface (CLI) that provides filesystem
isolation for containers. Isolated filesystems are also called root filesystems
(or rootfs
s).
Each Garden container references one rootfs that is mounted as its root mountpoint.
The command line interface (CLI) implements the Garden's image plugin binary interface and is used as an image plugin by Garden's volumizer. The CLI consists of the following commands:
-
InitStore
: used to create a new grootfs store. Required before creating any rootfses. -
DeleteStore
: deletes a store created by InitStore. -
GenerateVolumeSizeMetadata
: TODO -
Capacity
: returnsstore_size_bytes
from the config file -
Create
: creates a new rootfs and returns information (json document) about it on the standard out. Garden parses that json object and uses its data when building the containerconfig.json
. -
Delete
: deletes a rootfs, invoked when Garden destroys a container -
Clean
: cleans all unused layers from the cache in the store, invoked when the store size reaches a threshold -
List
: lists all the images in the store -
Stats
: returns store stats
The root filesystem is an
overlay filesystem
that is mounted under a directory in the GrootFS store. Every rootfs starts
with a base image (a tar file, or an
OCI image) that
consists of layers that are downloaded during rootfs creation. The base image
layers are mounted as lower/upper overlay dirs (and are read-only) while the
workdir
is created by GrootFS on rootfs creation and is writable. Thus
containers can only change their writable workdir
s but cannot change the base
layers.
As base image layers are read-only for containers, they can be shared across
different rootfses via simply mounting the upper/lower dirs into different
overlay mounts thus optimising disk usage and not downloading layers that are
already downloaded. For example, two containers that have their rootfses based
on the same base image (such as ubuntu
) would have their own rootfses with
their own workdir
s, but the upper/lower dirs in the overlay mount would be
the same. Furthermore, when the second container is created, the layers from
the ubuntu
base image will not be downloaded (as they have been downloaded
when the first container has been created) which also helps for performance.
Note: GrootFS has been designed to be filesystem agnostic, so you can implement it on top of any linux filesystem. Historically the team has been experimenting with aufs, btrfs, overlayfs, xfs and ext4. For more information on the outcomes of those experiments and insight of why we settled on overlayfs on top of xfs, please read this blog post
GrootFS requires a place to create its rootfses. This is achieved by creating a sparse file creating an xfs filesystem on it, and loop mounting it. Using a sparse file means that we can quickly generate a huge filesystem that initially takes very little space on the parent filesystem. Of course this also leads to confusion about how much space is actually used. See understanding grootfs store disk usage for a discussion. Beware that talk of reclaiming space only applies to the xfs filesystem within the sparse file. Once a sparse file expands to accommodate more content, it cannot be shrunk again. To do that you would need to copy the sparse file to a new sparse file, and that is not possible with GrootFS and garden. So if a filesystem containing a GrootFS is full because of the size of the sparse file, there is no easy fix.
When the command grootfs --config <CONFIG_FILE> init-store
is run (with store set to an appropriate path in the config file),
GrootFS creates the following directory structure. E.g. for store: /tmp/store/unprivileged
:
/tmp/store/
├── unprivileged/
│ ├── images/
│ ├── l/
│ ├── locks/
│ ├── meta/
│ │ ├── dependencies/
│ │ └── namespace.json
│ ├── projectids/
│ ├── tmp/
│ ├── volumes/
│ └── whiteout_dev
└── unprivileged.backing-store
The unprivileged.backing-store
file is the sparse file containing the xfs filesystem.
The xfs filesystem it contains was mounted by GrootFS at /tmp/store/unprivileged
.
The subdirectories are used as follows:
-
images
: contains a directory per image. Inside there are theworkdir
anddiff
directories used by the overlayfs mount, androotfs
where the overlayfs is mounted. -
l
: shortname symlinks to volumes -
locks
: contains lock files -
meta
: contains details about uid/gid mappings, volumes used by images, and volume sizes -
projectids
: used by xfs to manage quotas -
tmp
: temporary storage -
volumes
: where the volumes (image layers) are stored
Garden uses user namespaces as a security measure, unless using privileged containers. This means all users in the container will be mapped to unprivileged users on the host. Root in particular is mapped to a userid we call maximus, which is the top of the UID range on the system.
The UID/GID mappings are the same for all containers, and the rootfses created by grootfs need to set appropriate file ownership according to these mappings.
So if root in the container is mapped to 50000 on the host, a volume in grootfs containing a file that should be owned by root in the container must have uid 50000 set on it.
The UID/GID mappings are passed as options to the init-store command and are stored in the meta/namespace.json
file
where they can be used during the create
command.
It is possible to Init a store in a location where the backing-store file already exists. If it has an xfs filesystem already in it, that will be mounted. If not, the filesystem will be created first.
The GrootFS store is a loop mounted file. Reads and writes to the xfs filesystem inside the store will be cached by the xfs filesystem. Also the backing file's reads and writes will be cached inside its filesystem. Better performance can be obtained by setting the direct-io flag on the loop device used for the backing file mount. This avoids caching on the loop device, saving memory, and stops implicit syncing to the disk.
You can choose to enable direct IO as an argument to the init-store call.
The delete store command removes all the images created in the store, and all volumes, then unmounts the backing-store, and deletes the mount path (i.e. store path).
The backing-store file is not deleted, although it possibly should be if you need to reclaim space in it.
The create command generates a rootfs in the store. The store must have been previously created using init-store.
The command takes a URL for the rootfs image, and an ID for the rootfs. The URL can be either:
- a filesystem path to a tarball containing the container base image, e.g.
/path/to/fs.tar
- a Docker URL, e.g.
docker://cfgarden/strace
- an OCI URL pointing to an OCI on the local filesystem, e.g.
oci:///path/to/oci/file:version
The tarball URL will result in a single volume / layer, whereas the other two type may result in multiple volumes / layers. In all three cases, the layers are stored as tarballs, and grootfs will extract them
The first main step of image creation is pulling the base image.
First the image information is grabbed using either the tar_fetcher
or the layer_fetcher
, depending on the image URL.
In the case of tar, the info is just the path to the tarball, and a chain ID constructed from some of its attributes.
The layer_fetcher leans on containers/image to do the work.
Next the image layers are pulled. Since we need to process whiteouts, we must process the layers from lowest to highest. Layers have unique IDs, so we can check if we already have the layer, in which case we can exit early. Otherwise creating the layer involves creating a new volume for each layer, backed by a temporary directory in the store/tmp directory. When creating the temp directory we use the Re-execer in case container root permissions are required to change directory ownership. Then the local or remote layer, which is always in tar format, is streamed to a tar unpacker. This also uses the Re-execer, both for potential ownership requirements, and also to extract within a chrooted environment to protect against untar vulnerabilities.
Interestingly, if the user namespace is used in the re-execer, file ownership works transparently, and so a no-op uid/gid converter is used. When a user namespace is not used, uid/gid conversion is done explicitly. Why not always use the user namespace and get the correct uids/gids for free?
In multi-layer OCI or docker images, when files or directories present in a lower layer are deleted in an upper layer, whiteout files are used as a mask.
For instance, the file /usr/bin/danger
might be whited-out using a file named /usr/bin/.wh.danger
.
Or the directory /home/to-delete
and its contents could be removed with a file named /home/to-delete/.wh..wh.opq
.
When we extract tarred layers, we have to consider these whiteout files.
Our target format is overlayfs. This also uses whiteouts to mask files and directories in lower levels. The format is different of course. See the definition here. So as we encounter OCI style whiteouts, we convert them to overlayfs whiteouts, which are implemented with extended file attributes and a special whiteout device.
When the layer is completely extracted and overlayfs whiteouts applied, the temporary directory is moved to its permanent location in the store volumes
directory.
The link in the l
directory is also recreated to point to the new location.
Once all the image layers exist as volumes in the store, the image is created.
A directory matching the requested image ID is created in the store images
directory.
A size quota is applied using tardis
.
Inside, directories are created for rootfs
, diff
and workdir
.
By default, grootfs will mount the image.
It creates an overlayfs mount of the base layers with diff
as the upper layer, workdir
as the work dir,
and rootfs
as the mount point.
If the --without-mount
option was passed to create, then the overlayfs mount information will be included in the spec returned to the user.
This was intended for rootless operation when grootfs will be running as a non-root user and won't have permission to create a mount.
Instead, runc would create the mount in the container namespace.
The create command sends a fragment of an OCI spec to stdout on completion. E.g.
{
"ociVersion": "",
"process": {
"user": {
"uid": 0,
"gid": 0
},
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"cwd": ""
},
"root": {
"path": "/tmp/store/unprivileged/images/my-id/rootfs"
}
}
If the --without-mount
option was specified, then mount information will be included:
{
"ociVersion": "",
"process": {
"user": {
"uid": 0,
"gid": 0
},
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"cwd": ""
},
"root": {
"path": "/tmp/store/unprivileged/images/layers/rootfs"
},
"mounts": [
{
"destination": "/",
"type": "overlay",
"source": "overlay",
"options": [
"lowerdir=/tmp/store/unprivileged/l/vWEirZk7g1401963,upperdir=/tmp/store/unprivileged/images/layers/diff,workdir=/tmp/store/unprivileged/images/layers/workdir"
]
}
]
}
The delete command removes an image. It unmounts the image's rootfs, then recursively deletes the contents of the image ID directory in the store, also removing any quota files. Base volumes are left intact. They might be garbage collected at a later time if unused by other images.
The clean command takes a threshold in bytes. It calculates the allowed quotas in all the existing images, and the size of all the volumes present. If the sum of these is greater than the threshold, then the garbage collector is invoked.
The garbage collector first marks all volumes currently unused in any images as eligible for garbage collection. Then it iterates through these volumes, removing them.
The garbage collector may also be invoked after a create command,
depending on the --with-clean
and --threshold-bytes
options.
In a garden deployment, the thresholder is used to calculate the threshold value to use.