Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: add platform support RFC #1065

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions rfc/008-platform-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# RFC 008: Platform Support

## Objective

Implement support for diverse platforms in Contrast, ensuring that:

1. Users can mix-and-match features throughout all platforms supported by Contrast
2. Code with a platform dependency can be deduplicated
3. New platforms and platform features can be added with low complexity (that's `O(1)`, at best)

## Problem statement

Contrast currently distinguishes between platforms based on their distinct features.
Taking the `AKS-CLH-SNP` platform as an example, this serialized platform string conveys
the following information:

- Kubernetes engine: `AKS`
- (Implicit: Kata upstream / Microsoft fork)
- Hypervisor: `CLH`
- Hardware platform: `SNP`

The user picks such a platform through its serialized string (for example `AKS-CLH-SNP`), and the
code then implements actions based upon that, such as when writing platform-specific Kata
configuration values in the [node-installer].

With new features being added to Contrast, such as GPU support, more information needs
to be contained in such a platform description.

This comes at the cost of growing complexity of conditional statements
depending on the platform throughout the implementation code, as the
current implementation doesn't distinguish between platform features
(for example which hypervisor is used) but only between full platform combinations.

[node-installer]: https://github.com/edgelesssys/contrast/blob/3f39682ea9a383b2557923e257bd065e461b8ee6/nodeinstaller/internal/constants/constants.go#L48

## Proposal

### Structured Representation

Throughout the implementation code, platforms shouldn't be represented as _individual combinations_,
but rather through a generic, structured type that contains information about their _features_.

So `AKS-CLH-SNP` could become something like this:

```go
type Platform struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional fields that come to mind: Snapshotter, ImageVariant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on ImageVariant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with GPU support or without, debug or prod, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those should be additional fields, ultimately. That's the key point I want to convey in this proposal. It seems I need to make that more clear?

KubernetesDistribution KubernetesDistribution
Hypervisor Hypervisor
HardwarePlatform HardwarePlatform
}

// AKS-CLH-SNP
foo := Platform{
KubernetesDistribution: AKS,
Hypervisor: CloudHypervisor,
HardwarePlatform: SNPMilan,
}
```

This would allow the implementation code (for example the aforementioned switch for [Kata configuration])
to be reduced to only depend on the information bits it requires.

For example, this:

```go
case platforms.MetalQEMUTDX, platforms.K3sQEMUTDX, platforms.RKE2QEMUTDX:
```

Could become this:

```go
case TDX
```

This would also enable easy validation of user error, as the individual types can be checked for
validity at parsing time, making invalid states unrepresentable [^1].
Comment on lines +75 to +76
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it changes much about the proposal at hand, but "make invalid states unrepresentable" is not achieved if I can do

myFancyPlatform := Platform{
    KubernetesDistribution: RKE2,
    Hypervisor: CloudHypervisor,
    HardwarePlatform: Unknown,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the assumption of only validating constructors being used in the program, of course. I do acknowledge that my example is counter-intuitive for that, but it's a simplification, after all.


However, to get to such a structured format, the information needs to be passed to Contrast in another
structured format, such as a configuration file or another canonical serialized format, which should
be discussed in the following section.

[Kata configuration]: https://github.com/edgelesssys/contrast/blob/3f39682ea9a383b2557923e257bd065e461b8ee6/nodeinstaller/internal/constants/constants.go#L48

### Serialization

As platforms are passed throughout multiple (programming) languages, such as the `justfile` and Go code for
developers, but - in the most primitive way - from a user invoking the CLI in a shell to Go code, there must
be a serializable, machine-readable representation of a platform.

This could be a configuration file that's read by the CLI and other dependants.

This would reduce implementation complexity, as Contrast could rely on existing YAML / TOML / JSON / etc. parsing
libraries and no canonical string representation of a platform would need to be implemented.

Additionally, adding platform features would not lead to unreasonably large string representations, which might
be hard to read for humans at some point.

## Considerations

### Node-Installer Images
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do Platforms relate to runtimes? A runtime could be used with more than one image (not easily, but possible), but not with more than one hypervisor or snapshotter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtimes in the sense of Kata runtimes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes runtimes.


As it would be complex and costly to publish node-installer images for all supported platform
combinations, it might be better to ship a singular one-size-fits-all node installer with support
for all platforms. This could also be a foundation for heterogeneous deployments that utilize multiple
platforms (for example GPU- and Non-GPU machines).
Comment on lines +104 to +105
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node-installer could figure out a lot of configuration without user interaction - TEE flavour, containerd paths, maybe even hypervisor. On the other hand, outside the runtime the config is only used to decide which genpolicy to use and to populate reference values. I wonder whether we actually need to expose the entire runtime config to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think large parts of it have to be exposed. It might be possible that you have an instance that has GPUs available but don't want to run GPU-enabled Contrast on it, for example.

Also, reference value generation needs almost all values, if I'm not mistaken?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You picked the one dimension that I did not list as candidate for automation ;)

It's still an interesting question whether / how we want to deal with mixed images. As a user, how do I tell Contrast to use a GPU image for pod A, but a plain image for pod B? I imagine a config file is not granular enough here. If there is only one image per runtime, how do I mix two runtimes with one coordinator? (I think that's a scenario we might want to support eventually, though)

The reference values need the image measurements, the genpolicy choice and the hardware reference values, where the last two are only a question of AKS-CLH vs. rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference values need the image measurements, the genpolicy choice and the hardware reference values, where the last two are only a question of AKS-CLH vs. rest.

For hardware reference values, I think it's also a matter of TDX vs. SNP (Genoa) vs. SNP (Milan) and so on, right?

About mixed images, how the customer decides to do that is TBD still, I think. I don't have the solution at hand for this.

About "one image per runtime": I think we could ship all variants in the node-installer if we can make them reasonably small?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't fill hardware values for bare metal, so it's just image measurements I think.


To ensure a good developer experience, bandwidth used for transferring the node-installer images should
be kept to a minimum, which is contrary to the aforementioned idea, but could be counteracted with
minimizing the individual components in the node-installer, such as the node images.

[^1]: https://geeklaunch.io/blog/make-invalid-states-unrepresentable/
3 changes: 3 additions & 0 deletions tools/vale/styles/config/vocabularies/edgeless/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -137,3 +137,6 @@ RTMR
RTMRs
precalculator
initrd
deduplicated
unrepresentable
serializable