container: support splitting inputs #69

cgwalters · 2021-08-27T14:24:47Z

Optimizing ostree-native containers into layers

This project started by creating a simple and straightforward mechanism to bidirectionally map between an ostree commit object and an OCI/Docker container - we serialize the whole commit/image into a single tarball, which is then wrapped as a container layer.

In other words, we generate a base image with one big layer. This was simple and straightforward to implement, relatively speaking. (Note that derived containers naturally do have distinct layers. But this issue is about the "base image".)

The core problem is that any change to the base image (e.g. kernel security update) requires each client to redownload the entire base image which could be 1GB or more. See e.g. this spreadsheet for Fedora CoreOS content size.

Prior art and related projects

See https://grahamc.com/blog/nix-and-layered-docker-images for a Nix based build system that knows how to intelligently generate a container image with multiple layers.

There is also work in the baseline container ecosystem for optimizing transport of containers.

Copying in bits of this comment:

https://www.balena.io/docs/learn/deploy/delta/ is a custom fork of docker with custom deltas that has been around a while
https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md is a bit more recent
Support layer deltas containers/image#902 is bsdiff for layers
https://www.scrivano.org/posts/2021-10-26-compose-fs/ and zstd chunked
https://github.com/linuxkit/linuxkit changes the default OS to be a pile of containers (very interesting direction)

estargz has some design points worth looking at, but largely speaking I think few people want to have their base operating system lazily fetched. (Although, there is clearly an intersection with estargz's per-file digests and ostree)

Proposed initial solution: support for splitting layers

I think right now, we cannot rely on any container-level deltas. As a baseline, we should support splitting ostree commits into distinct layers, because that works with every container runtime and registry today. Further, splitting layers optimizes both pulls and pushes.

In addition, split layers means that any other "per layer" deltas actually just work more efficiently and better, whether that's zstd-chunked or layer-bsdiff.

Aside: today with Dockerfile, one cannot easily do this. Discussion: containers/podman#12605

Implementing split layers

There is an initial PR for this here: #123

Conceptually there are two parts:

Generating split layers

There are multiple competing things we need to optimize. A first theoretically obvious thing to do is have a single layer per e.g. RPM/deb package. But we can't do this because there are simply too many packages and container limits are around 100. See the above nix-related blog.

And even if we wanted to go close to a theroetical limit of 100, because we want to support people generating derived images, we must reserve space for that. At least 30 layers. Conservatively, let's say we shouldn't go beyond 50 layers.

In the general case, because ostree is a low-level library, we need to support higher level software telling us how to split things - or at a minimum, which files are likely to change together.

Principles/constraints:

Content which is large and/or frequently changing should be in a layer
Conversely, small and infrequently changing content could be grouped together
We shouldn't get more than N layers, for N=50 or so

The "over time" problem

We should also be robust to changes in the content set itself. For example, if a component which was previously small grew large, we want to avoid that change "cascading" into affecting how we change multiple layers. Another way to say this is that any major change in how we chunk things also implies clients will need to redownload many layers.

Initial proposal

In the initial code, ostree has some hardcoded knowledge of things like /usr/lib/firmware from linux-firmware, which is by far the largest single chunk of the OS. It is large, and while it doesn't change too often, it clearly makes sense to chunk by itself.

The current auto-chunking logic also handles the kernel/initramfs in /usr/lib/modules.

Past that, we have code which then tries to cherry pick large files (as a percentage of the total remaining size) which captures things like large statically linked Go binaries.

Supporting injected mappings

What would clearly aid this is support for having e.g. rpm-ostree inject the mapping between file paths and RPM name - or really most generically, assign a stable identifier to a set of file paths. Also supporting some sort of indication of relative change frequency would be useful.

Something like:

struct ContentSource {
    change_frequency: u32, // Percentage between 1-100 or so?
}
// Maps from e.g. "bash" or "kernel" to metadata about that content
type ContentSet = BTreeMap<String, ContentSource>;
// Maps from an ostree content object digest to e.g. "bash"
type ContentIdentification = BTreeMap<Checksum, &str>;

Some interesting subtleties here around e.g. "what happens if a file from different packages/components was deduplicated by ostree to one object?". Probably it needs to be promoted to the source with the greatest change frequency.

Current status

See https://quay.io/repository/cgwalters/fcos-chunked for an image that was generated this way. Specifically, try viewing the manifest and you can see the separate chunks - for example, there's a ~220MB chunk just for the kernel. If a kernel security update happens, you just download that chunk and the rpm database chunk.

The text was updated successfully, but these errors were encountered:

cgwalters · 2021-09-07T21:01:24Z

There's obviously nothing ostree specific about this really. It should be possible to write a tool which accepts an arbitrary OCI/Docker image and reworks the layering such that e.g. large binaries are moved into separate content-addressed layers - right?

Maybe it's a bit ostree specific in that it would be most effective when applied to a "fat" operating system that has lots of embedded binaries from multiple sources.

In contrast a simple standard multi-stage build that links to a base image is already fairly optimal. It's doubtful that it'd be valuable to split up most standard distribution base images. (But that'd be interesting to verify)

mheon · 2021-09-07T21:10:10Z

There are issues with splitting into too many layers. I recall hearing about performance suffering as layer count increased, but I can't recall details and that's likely solvable. If it does increase too much, however, you run into some fairly fundamental limits on registries and in many image-handling libraries. Some of our source image work produced images with >256 layers, and those were effectively unusable - the kernel parameter for the Overlay mount was simply too large to pass to the kernel, for example.

cgwalters · 2021-09-07T21:24:17Z

Thinking about this more, a notable wrinkle here is things like the dpkg/rpm database, which today are single-file databases. That means that when one writes such a tool to split out e.g. the openssl or glibc libraries into their own content addressed layer, the final image layer will still need to contain the updated package metadata.

This is another way of noting that single file package databases defeat the ability to do true dynamic linking - i.e. when just glibc changes we still need to regenerate the top layer to update the database, even if we had container-level dynamic linking. I think NixOS doesn't have such a single file database? But I haven't been able to figure it out.

mkenigs · 2021-10-04T19:12:53Z

NixOS does have a database: https://nixos.org/guides/nix-pills/install-on-your-running-system.html#idm140737320795232

I've discussed trying to work around the layers issue with @giuseppe in order to allow sharing packages between host and guest as well as between containers.

My best current shot at a solution is to have packages installed somewhere central (like /nix/store), and then all the necessary packages could be reflinked into a single container layer's /nix/store. To maintain compatibility, everything in the container's /nix/store could then get hardlinked to the usual paths on the root filesystem. To see how much space that would use, I tried reflinking and hardlinking on XFS, and it only takes 3% of the blocks of the original files to create the reflinked/hardlinked tree. Did that here: https://github.com/mkenigs/reflinks

cgwalters · 2021-10-04T19:44:56Z

I've discussed trying to work around the layers issue

Can you elaborate on "layers issue"?

with @giuseppe in order to allow sharing packages between host and guest as well as between containers.

Let's avoid the use of the term "guest" in the context of containers as it's kind of owned by virtualization. The term "host" is still quite relevant to both, but it's also IMO important that containers really are just processes from the perspective of the host.

Unless here you do mean "host" and "guest" in the virtualization sense?

Now I think let's be a bit more precise here and when you say "containers" you really mean "container image layers", correct? (e.g. "image layers" for short).

My best current shot at a solution is to have packages installed somewhere central (like /nix/store), and then all the necessary packages could be reflinked into a single container layer's /nix/store.

This seems to be predicated on the idea of a single "package" system, whether that's nix or RPM or dpkg or whatever inside the containers. I don't think that's going to realistically happen.

mkenigs · 2021-10-04T20:18:28Z

Can you elaborate on "layers issue"?

Sorry the issue of having too many container layers leading to decreased performance.

Unless here you do mean "host" and "guest" in the virtualization sense?
Now I think let's be a bit more precise here and when you say "containers" you really mean "container image layers", correct? (e.g. "image layers" for short).

Yep hopefully this is more precise: I mean sharing between host and container image layers and between different container image layers (if some sort of linking within the layer is used) or between containers (if the entire layer is shared)

This seems to be predicated on the idea of a single "package" system, whether that's nix or RPM or dpkg or whatever inside the containers. I don't think that's going to realistically happen.

I don't think it has to - just think of it as subdividing OCI layers into noncolliding chunks with checksums. For our purposes, it would probably be easiest if each chunk was a single RPM, and that would make sharing with rpm-based hosts easier. But the single packaging system is already provided by containerization - we already have /var/lib/containers/storage, and we already store unique layers based on checksums there. This would just be an optimization marking part of a layer as a chunk that can probably be deduplicated and shared with the host. I don't fully understand it, but I think @giuseppe's current approach for zstd:chunked requires a lot more searching for files to deduplicate (https://github.com/containers/storage/blob/48e2ca0ba67b5e7b8c04542690ac4e5d6ec73809/pkg/chunked/storage_linux.go#L310). Whereas this would just involve checking if /var/lib/containers/storage/some_chunk_checksum exists

cgwalters · 2021-10-14T20:58:39Z

I am still very wary of intersecting host and container content and updates. That's not opposition exactly, but I think we need to do it very carefully and thoughtfully. I touched on this in containers/image#1084 (comment)

But the single packaging system is already provided by containerization - we already have /var/lib/containers/storage, and we already store unique layers based on checksums there.

(I wouldn't quite call this a "packaging system"; when one says "package" I think more of apt/yum)

Ultimately I think a key question here is:

Does the container stack attempt to dynamically (client side) perform deduplication, or do we try to create a "postprocessing tool" that transforms an image as I'm suggesting? Or, a variant of this is we do something like teach yum to write the pristine input RPM files to /usr/containers/layers/$digest, and then podman build --auto-layers looks at the final image generated, and then re-works the layering to have one blob for each entry in /usr/containers/layers/ or something?

edit1:
Actually I think operating under the assumption there's no really good use case for having higher levels fully remove lower levels, podman build --auto-layers could actually do this incrementally as a build progresses too.

edit2:
Actually we can't sanely make a tarball inside the container without duplicating all the space (barring reflink support). So it'd probably need to be /usr/containers/layers/<fstree>.

mkenigs · 2021-10-15T01:09:32Z

I am still very wary of intersecting host and container content and updates. That's not opposition exactly, but I think we need to do it very carefully and thoughtfully. I touched on this in containers/image#1084 (comment)

That makes a lot of sense. Would you say there's already a hole in the firewall between containers because of shared layers? Would doing this make that hole worse, or is it the same hole?

Ultimately I think a key question here is:

...

Agreed

mkenigs · 2021-10-15T01:15:18Z

Returning to the issue of too many layers decreasing performance:
@mtrmac I can't remember for sure if it was you who brought up performance overhead of too many layers when we discussed it this summer, but am I correct that you don't think it's solvable? Just checking because of @mheon saying that it's likely solvable

mtrmac · 2021-10-15T17:25:34Z

I’m afraid I really can’t remember if I brought up something like that or if so, what it was. I’m not aware of layer count scaling horribly, at least.

Do note I read @mheon ’s comment above as performance being solvable up to the ~128-layer limit for a container. Right now, I wouldn’t recommend any design that can get anywhere close to that value, without having a way forward if the limit is hit.

(That 128-layer limit is also why I don’t think it’s too likely we would get horrible performance: there would have to be an exponential algorithm or something like that for O(f(N)) to be unacceptable with N≤128; at these scales, I’d be much more worried about scaling to large file counts, in hundreds of thousands or whatever.)

mkenigs · 2021-10-15T18:52:50Z

@mtrmac thanks!

Do note I read @mheon ’s comment above as performance being solvable up to the ~128-layer limit for a container. Right now, I wouldn’t recommend any design that can get anywhere close to that value, without having a way forward if the limit is hit.

The article @cgwalters linked discusses how to combine layers once that limit is hit

cgwalters · 2021-10-22T22:01:17Z

Some initial code in #123

cgwalters · 2021-10-25T13:46:49Z

Some of our source image work produced images with >256 layers, and those were effectively unusable - the kernel parameter for the Overlay mount was simply too large to pass to the kernel, for example.

One thing I'd note here: I think we should be able to separate logical layers from physical layers.
Not every layer needs to be a derivation source, i.e. a source for overlayfs.

I think it significantly simplifies things if we remove support for whiteouts etc. in these layers. IOW, the layers are "pure union" sources. That means that given layers L1, L2, L3 that are known to be union-able and also not a source, instead of using overlayfs at runtime, the layers are physically re-unioned by hardlink/reflinking. (Perhaps a corollary here really is that these union-only layers should also not care about their hardlink count)

cgwalters · 2021-10-26T16:17:07Z

I wanted to comment on https://www.scrivano.org/posts/2021-10-26-compose-fs/ and https://github.com/giuseppe/composefs briefly.

Broadly speaking, I think we're going to get a lot of mileage out of better splitting up container blobs as is discussed here. For one thing, such an approach benefits container runtimes today without any kernel/etc changes.

(I still have a TODO item to more deeply investigate buildkit because I think they may already do this)

One major downside of ostree (and this proposal) is that garbage collection is much more expensive. I think what we probably want is to at least only do deduplication inside related images to start. It seems very unlikely to me that the benefits of e.g. sharing files across a rhel8 image and a fedora34 image are worth the overhead. (But, it'd be interesting to be proven wrong)

mkenigs · 2021-10-29T14:57:27Z

The only issue that post identifies with overlayfs is that it can't deduplicate the same file present in multiple layers, right? Would there be any advantage to composefs compared to overlayfs with files split into separate layers? Would it scale better for large numbers of layers?

cgwalters · 2021-10-29T18:18:31Z

I once saw an overlayfs maintainer describe it as a "just in time" cp -a. So I wouldn't say this is a limitation of overlayfs - it's how the container stack is doing things (using overlayfs). Which in turn I think is somewhat driven by the compatibility constraint of having st_nlink inside a container image apparently match only files inside that container image as Guiseppe's post touches on.

That said, I suspect the vast majority of images and use cases would be completely fine seeing a higher st_nlink for e.g. glibc. I mean, ostree has been doing this for a long time now. If it's just tools like e.g. rpm -V saying "hmm glibc has a higher st_nlink than I expected", I think we should just fix those to accept it (optionally detecting that they're inside a container).

mkenigs · 2021-10-29T18:23:39Z

Isn't another st_nlink concern security? e.g. being able to detect what version of glibc the host is running.

cgwalters · 2021-11-01T16:52:21Z

Isn't another st_nlink concern security? e.g. being able to detect what version of glibc the host is running.

Well...containers (by default) know which version of the kernel the host is running, which I would say is far more security sensitive. But OTOH, generalizing this into leaking any arbitrary file (executable/shared library) does seems like a potentially valid concern.

mkenigs · 2021-12-07T21:25:57Z

Not sure if comments about sharing layers between the host and container image layers belong on a separate issue, but if that was possible would experimental-image-proxy (containers/skopeo#1476) be unnecessary? Since we would want containers/image involved for anything we pull for the host

Wondering since we'll be using experimental-image-proxy for #121

cgwalters · 2021-12-07T22:06:23Z

Not sure if comments about sharing layers between the host and container image layers belong on a separate issue, but if that was possible

I think that is probably a separate issue. It's certainly related to this, but it would be a rather profound change from the current system architecture.

One thing I do want to make more erogonomic though is pulling from to containers-storage: - that relates to #153 And we should clearly support writing via the proxy too (no issue tracks that yet).

Prep for ostreedev#69 where we'll split up the input ostree commit into content-addressed blobs. We want to inject something useful into the the `history` in the config git that describes each chunk, so add support for that into our OCI writer. Change the default description for the (currently single) layer to include the commit subject, if present; otherwise the commit hash. The description of the layer shouldn't change as this tool changes, just as the input changes. (Side note; today rpm-ostree isn't adding a subject description, but hey, maybe someone else is)

This came out of some prep work on ostreedev#69 Right now it's confusing, the layering code ended up re-implementing the "fetch and unpack tarball" logic from the unencapsulation path unnecessarily. I think it's much clearer if the layering path just calls down into the unencapsulation path first. Among other things this will also ensure we're honoring the image verification string.

Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.

In ostree we aim to provide generic mechanisms that can be consumed by any package or build system. Hence we often use the term "component" instead of "package". This new `objectsource` module is an abstraction over basic metadata for a component/package, currently name, identifier, and last change time. This will be used for splitting up a single ostree commit back into "chunks" or container image layers, grouping objects that come from the same component together. ostreedev#69

Closes: ostreedev#69 This is initial basic support for splitting files (objects) from a commit into separate container image layers, and reassembling those layers into a commit on the client. We retain our present logic around e.g. GPG signature verification. There's a new `chunking.rs` file which has logic to automatically factor out things like the kernel/initramfs and large files. In order to fetch these images client side, we now heavily intermix/cross the previous code for fetching non-ostree layers.

cgwalters mentioned this issue Oct 22, 2021

Support for split layers #123

Merged

This was referenced Nov 16, 2021

Add tooling to generate delta containers #40

Closed

implement webserver #34

Closed

This was referenced Dec 6, 2021

Support layer deltas containers/image#902

Open

half-baked idea: Dockerfile support for explicit layer generation containers/podman#12530

Closed

cgwalters mentioned this issue Dec 10, 2021

container: Add history struct to oci writing, tweak history description #183

Merged

cgwalters mentioned this issue Dec 13, 2021

container: Make layering more directly re-use unencapsulation #184

Merged

cgwalters mentioned this issue Mar 7, 2022

Add new objectsource module #260

Merged

cgwalters closed this as completed in #123 Mar 24, 2022

cgwalters mentioned this issue May 11, 2022

Rework build process to generate rhel-coreos-base distinct from ocp-rhel-coreos openshift/os#799

Closed

This was referenced Jun 8, 2022

Tracker for switching to chunked ostree container images coreos/fedora-coreos-tracker#1226

Closed

image: Switch to oci-chunked format openshift/os#808

Merged

cgwalters mentioned this issue Jul 19, 2022

Boot partition can easily run out of space on upgrade coreos/fedora-coreos-tracker#1247

Closed

owtaylor mentioned this issue Nov 16, 2022

[Feature request]: Consider impact of https://fedoraproject.org/wiki/Changes/OstreeNativeContainerStable flatpak/flatpak#5182

Open

2 tasks

This was referenced Aug 28, 2024

Scoping in more things? microsoft/rpmoci#153

Open

feature request: image merge/"rechunking" containers/buildah#5717

Open

cgwalters mentioned this issue Sep 30, 2024

idea: "upper layer" (erofs) inside of initramfs containers/composefs#332

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container: support splitting inputs #69

container: support splitting inputs #69

cgwalters commented Aug 27, 2021 •

edited

Loading

cgwalters commented Sep 7, 2021

mheon commented Sep 7, 2021

cgwalters commented Sep 7, 2021

mkenigs commented Oct 4, 2021

cgwalters commented Oct 4, 2021

mkenigs commented Oct 4, 2021

cgwalters commented Oct 14, 2021 •

edited

Loading

mkenigs commented Oct 15, 2021

mkenigs commented Oct 15, 2021

mtrmac commented Oct 15, 2021

mkenigs commented Oct 15, 2021

cgwalters commented Oct 22, 2021

cgwalters commented Oct 25, 2021

cgwalters commented Oct 26, 2021

mkenigs commented Oct 29, 2021

cgwalters commented Oct 29, 2021

mkenigs commented Oct 29, 2021

cgwalters commented Nov 1, 2021

mkenigs commented Dec 7, 2021

cgwalters commented Dec 7, 2021

container: support splitting inputs #69

container: support splitting inputs #69

Comments

cgwalters commented Aug 27, 2021 • edited Loading

Optimizing ostree-native containers into layers

Prior art and related projects

Proposed initial solution: support for splitting layers

Implementing split layers

Generating split layers

The "over time" problem

Initial proposal

Supporting injected mappings

Current status

cgwalters commented Sep 7, 2021

mheon commented Sep 7, 2021

cgwalters commented Sep 7, 2021

mkenigs commented Oct 4, 2021

cgwalters commented Oct 4, 2021

mkenigs commented Oct 4, 2021

cgwalters commented Oct 14, 2021 • edited Loading

mkenigs commented Oct 15, 2021

mkenigs commented Oct 15, 2021

mtrmac commented Oct 15, 2021

mkenigs commented Oct 15, 2021

cgwalters commented Oct 22, 2021

cgwalters commented Oct 25, 2021

cgwalters commented Oct 26, 2021

mkenigs commented Oct 29, 2021

cgwalters commented Oct 29, 2021

mkenigs commented Oct 29, 2021

cgwalters commented Nov 1, 2021

mkenigs commented Dec 7, 2021

cgwalters commented Dec 7, 2021

cgwalters commented Aug 27, 2021 •

edited

Loading

cgwalters commented Oct 14, 2021 •

edited

Loading