Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Layer Store no longer works for layers without TOC Digest annotations #2188

Open
GerrySeidman opened this issue Dec 3, 2024 · 6 comments

Comments

@GerrySeidman
Copy link

Background: The AuriStor's Additional Layer Store (ALS) FUSE driver simply mounts layer 'diff' directories directly from the the AuriStorFS distributed file system (based on image names and layer digests). The AuriStor ALS FUSE driver currently works with unmodified OCI images. It does not require nor does it use any TOC Digest or index file (as with stargz).

The AurIStorFS ALS FUSE driver currently services Layer content relative to the Layer Digest:

<ALS Root>/<base64(imageName)>/<LayerDigest>/{info,diff,use,blob}

First Occurance: As of containers/image 5.31 (Podman 5.1), the ALS Semantics changed such that there is no longer an ALS Code path for images without a TOC Digest Annotation (containerd.io/snapshot/stargz/toc.tocJSONDigestAnnotation or containers/storage/pkg/chunked/internal/compression.ManifestChecksumKey or "containerd.io/snapshot/stargz/toc.digest" / "io.github.containers.zstd-chunked.manifest-checksum" respectively).

image/storage/storage_desc.go, tryReusingBlobAsPending() now requires options.TOCDigest != "" in order to execise/leverage ALS in its code path:

        containers/image v53.0: if options.SrcRef != nil 
        containers/image v5.31: if options.SrcRef != nil && options.TOCDigest != "" && options.LayerIndex != nil 

Other changes: The ALS Semantics seems to now be that, for those images with TOC annotations, ALS now expects the FUSE driver to service the layer based on the TOC Digest

<ALS Root>/<base64(imageName)>/<TOCDigest>/{info,diff,use,blob}

Note: I did not participated in any of the design discussions around these changes to ALS, but it appears that ALS is being narrowed towards support of zstd:chunked, composefs, and fs-verity.

Issue: If the intention is to continue allowing ALS plugins (such as the AuriStorFS ALS) to be able to support unmodified OCI Images (ie without TOC annotations), then this is a bug that was introduced in image:5.31 and storage:1.54

Request: Some clarity on the intended ALS paths and path semantics that an ALS Fuse driver should be servicing (going forward) would be appreciated. I could then modify the AuriStorFS ALS Fuse driver to service layers appropriately. - Thanks

Note: While our current AuriStorFS ALS implementation uses a FUSE driver, our plan is to ultimately provide ALS support natively from the AuriStorFS Kernel module (along with also adding fs-verity support to AuriStorFS)

@mtrmac
Copy link
Collaborator

mtrmac commented Dec 4, 2024

Cc: @ktock , this is from #1924 .
Cc: @giuseppe

“Truly lazy” pulling (only reading files on-demand) requires basing layer identities on the TOC identity to authenticate the file contents.
Is AurIStorFS reading all of the files to validate the uncompressed digests?

FWIW c/image needs to know which of the digests is being used by the backend driver, see e.g. https://github.com/containers/image/blob/5263462aa97ae23d3a620b4dcdbbc557665a42c8/storage/storage_dest.go#L90-L110 . I suppose that would significantly affect the design of the filesystem interface.

@GerrySeidman
Copy link
Author

FWIW c/image needs to know which of the digests is being used by the backend driver, see e.g. https://github.com/containers/image/blob/5263462aa97ae23d3a620b4dcdbbc557665a42c8/storage/storage_dest.go#L90-L110 . I suppose that would significantly affect the design of the filesystem interface.

Not necessarily. See Below TLDR; about the underlying AuriStor distributed file system and how AuriStorFS "Layer Volumes" are generated

AuriStor's LVGS could instead

  • Create AuriStorFS volumes with names corresponding to the TOCDigest (rather than the Layer Digest)
    • This would not be backward compatible with Podman older than 5.1 (not sure what version of Podman my customer is using). I could always create TWO volumes, one each with names corresponding to Layer Digest and TOC Digest. Since AuriStorFS supports cross volume links, one of them would be very small.
  • It would also be necessary to add "toc-digest": [TOCDigest] to [Anchor Path]/info

However, there is still the question of whether it should be possible for an ALS to support images with layers that do not have a TOC (ie not estargz or zstd:chunked)

TLDR; Background on the AuriStor distributed file system, AuriStorFS: In AuriStorFS, the unit of policy/management are named "AFS Volumes" correspond to a rooted directory of content (which can be replicated for high availability, for example). Volumes are scoped to a (typically) organizational 'cell'. So volume named 'foo' in the cell 'myorg.com' can be universally addressed via the following path /afs/.@mount/myorg.com/foo which is handled by the AuriStorFS Kernel module. This can leverage AuriStorFS DNS Service records for myorg.com.-

TLDR; Background on AuriStorFS ALS:

  • The AuriStor Layer Volume Generation Service (LVGS) consumes images by reference name and creates read-only AuriStorFS Volumes with names and content corresponding to each of its layers. For example a layer with digest sha256:abcd123 would correspond to a an AuriStorFS volume lv_abcd123 and accessed via /afs/.@mount/myorg.com/lv_abcd123. The contents of this volume includes AuriStorFS specific metadata files, but also a 'diff' directory with the expanded layer content and an 'info' file as expected by the container runtime for ALS. Note that the AuriStor ALS does not require any special modification to the container image and can support any compression format/etc expressed in the image manifest file.
  • The AuriStor ALS FUSE driver simply interprets the ALS anchor path [mountpoint]/[base64(image name)]/[Layer Digest] and provides corresponding symlinks (below), but but ALS capability may be built into the AuriStorFS file system kernel module in the future:
    [Anchor Path]/info ==> /afs/.@mount/myorg.com/lv_abcd123/info 
    [Anchor Path]/diff ==> /afs/.@mount/myorg.com/lv_abcd123/diff

@GerrySeidman
Copy link
Author

Is AurIStorFS reading all of the files to validate the uncompressed digests?

There are two places where the validation could occur

  • During Layer Volume Generation time, ie once when populating the read-only "AuriStorFS "Layer Volume" in the AuriStorFS distributed file system (see TLDR; below)
    • Layer Volume Generation is triggered via either an API call or container registry webhooks notification for images pushed into registries.
    • File validation can easily be done during creation time of the Layer Volume
  • During container runtime
    • Do you mean leveraging fs-verity or explicit validation by the ALS driver (stargz/zstd:chunked)?
    • Note that the AuriStor ALS FUSE driver is just symlinking to the diff content in a read-only AFS Volume and is not otherwise involved.
      • However, implementing fs-verity capabilities into AuriStorFS is on our current roadmap.

@mtrmac
Copy link
Collaborator

mtrmac commented Dec 5, 2024

Ah, thanks for the explanation. To summarize, the layers are not (necessarily) pulled on-demand at all, the layer is potentially populated in advance, making the TOC not essential to the backend; ordinary tar layers can be supported.


I’m not at all authoritative (and probably not even well-informed) in this space. With that disclaimer, it seems possible to me to support both, by querying the ALS filesystem both by the compressed digest (as in the old version), and by the TOC digest (as in the new version); the backend could then support one or the other, depending on how it authenticates the layer.

This would probably have to be another change to the ALS filesystem conventions, so that the backend can tell whether the value is a layer digest or a TOC digest (they can be ambiguous, at least for not very strict readers). That’s not ideal, but probably unavoidable — just reverting the change (is not at really an option and) would also be a change to the ALS filesystem conventions.

That’s the limit of my expertise, I’ll defer to @ktock and @giuseppe for anything more detailed, and figuring out coordination / migration for other backends like stargz-snapshotter.

@GerrySeidman
Copy link
Author

Ah, thanks for the explanation. To summarize, the layers are not (necessarily) pulled on-demand at all

That all depends on your definition of "on-demand". The "demand" for the AuriStorFS case is triggered by regular OS open/read/write/etc and is handled for the AFS distributed file sytem on the kernel level (perhaps with a bit of optimistic read-ahead).

As an FYI: AuriStorFS manages its own pre-allocated cache on the host machine. The AFS cache has some nice features including surviving reboots and cache consistency File Server's callbacks if/when files have changed.

Since AuriStorFS maintains its own cache, this also begs the question of what the AuriStor ALS should report as the uncompressed size (in .../info)? Arguably, if the uncompressed size is only used for things like helping OpenShift make image garbage collection decisions, then AuriStor ALS should always report the uncompressed size as ZERO.

With that disclaimer, it seems possible to me to support both, by querying the ALS filesystem both by the compressed digest (as in the old version), and by the TOC digest (as in the new version); the backend could then support one or the other, depending on how it authenticates the layer.

Yes, as I said earlier, we could support both by creating two target paths (corresponding to compressed digest and TOC digest), though it feels a little hacky. Regardless, this goes back to the question of validation. For example there is always the chance that a malicious actor could poison the container runtime cache of "read-only" diff layers. My understanding is that composeFS+fsVerity would address, but borrowing your words "I’m not at all authoritative (and probably not even well-informed)" @cgwalters and Jeffrey Altman (the real AuriStorFS expert) discussed this in the Spring at RedHat Summit.

This would probably have to be another change to the ALS filesystem conventions...

Perhaps it could just be additive rather than a change. For example ALS capabilities could be added to the additionallayerstore option in storage.conf, something like:

    additionallayerstores = [
        "/home/gerry/.local/share/aca/acaMountpoint:ref layerDigestOrder=toc,compressed",
    ]

with these dangling off of the additionalLayerStore struct and realized by AdditionalLayer interface functionality (along with appropriate logic in storage_dest.go, etc)

That way the end-user can configure whether or not they will allow the ALS endpoints that support compressed digests for ALS targets.

If something like this makes sense, I think it could be surgically done on the existing codebase without changing too many things. Of course further design discussions and feedback would be necessary before proceeding.

Finally, the ability to support 'legacy' ALS is important because our customer has a huge number of containers stored in their container registries, none of which were built with ```podman push -compress-format=zstd:chunked ...". It is unrealistic to require them to rebuild or pull/re-repush their existing images to leverage ALS

@mtrmac
Copy link
Collaborator

mtrmac commented Dec 9, 2024

Since AuriStorFS maintains its own cache, this also begs the question of what the AuriStor ALS should report as the uncompressed size (in .../info)? Arguably, if the uncompressed size is only used for things like helping OpenShift make image garbage collection decisions

(No, it is used to create a ~valid manifest for pushes from c/storage, and it must be exactly the size of the blob returned by /diff, which in turn must match the diff-digest field. See

storage/layers.go

Lines 138 to 145 in ebc19d3

// UncompressedSize is the length of the blob that was last passed to
// ApplyDiff() or create(), after we decompressed it.
//
// - If UncompressedDigest is set, this must be set to a valid value.
// - Otherwise, if TOCDigest is set, this is either valid or -1.
// - If neither of this digests is set, this should be treated as if it were
// an uninitialized value.
UncompressedSize int64 `json:"diff-size,omitempty"`
. )

For example there is always the chance that a malicious actor could poison the container runtime cache of "read-only" diff layers.

That rather depends on who the “malicious actor” is.

It is the exclusive responsibility of the ALS backend to ensure that the layer returned by an ALS digest query (whether TOC digest or uncompressed digest) precisely matches the requested digest; c/image+c/storage can’t do that itself without triggering an immediate full read of the layer, which would rather defeat the purpose (at least for TOC-based layers which support on-demand pulls of individual files).

The user writing to c/storage, or configuring storage.conf, is trusted by definition, and users attacking themselves is not a privilege escalation.

If the “owner” of the AFS filesystem is some other person, the trust relationship between the user deciding to use that AFS filesystem and the owner of that AFS filesystem is invisible to c/image+c/storage, and up to the ALS backend to define. (As one possibility, the user choosing to use the ALS might be, by that choice, delegating trust to the AFS filesystem owner, and then the AFS filesystem owner would be responsible for running, and not interfering with, tooling that manages the layer pulls and verifies the digests. But the tooling must be verifying the digests.)

My understanding is that composeFS+fsVerity would address, but borrowing your words "I’m not at all authoritative (and probably not even well-informed)"

If the goal is to protect against the AFS filesystem’s owner, that might be possible in the future, if we were verifying image signatures per the local system’s policy, and those image signatures covered the composeFS/verity data. And we would need some way to integrate composefs with the ALS API. Most of this currently does not exist.


Finally, the ability to support 'legacy' ALS is important because our customer has a huge number of containers stored in their container registries, none of which were built with ```podman push -compress-format=zstd:chunked ...". It is unrealistic to require them to rebuild or pull/re-repush their existing images to leverage ALS

I think supporting the in-production customers, and managing the config file / API transition (if any, and possibly both the one that has already happened and one that we might need to do to fix things) is an important conversation, but probably for a different, product-specific, channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants