Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1981: WindowsHostProcessContainers beta major changes #3311

Merged
merged 20 commits into from
Sep 22, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-windows/1981.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ alpha:
approver: "@deads2k"
beta:
approver: "@deads2k"
stable:
approver: "@deads2k"
Original file line number Diff line number Diff line change
Expand Up @@ -478,46 +478,44 @@ Because Windows privileged containers will work much differently than Linux priv
#### Resource Limits

- Resource limits (disk, memory, cpu count) will be applied to the job and will be job wide. For example, with a limit of 10 MB is set for the job, if every process in the jobs memory allocations added up exceeds 10 MB this limit would be reached. This is the same behavior as other Windows container types. These limits would be specified the same way they are currently for whatever orchestrator/runtime is being used.
- Disk resource tracking may work slightly differently for `hostProcess` containers due to how these containers are bootstrapped. Resource usage will be trackable and the differences would be in how resource usage is calculated.
- Note: HostProcess containers will have access to nodes root filesystem. Disk limits and resource usage will only apply to the scatch volume provisioned for each HostProcess container.

#### Container Lifecycle

- The container's lifecycle will be managed by the container runtime just like other Windows container types.

#### Container users

- The `hostProcess` container can run as any user that's available on the host or in the domain of the host machine.
Running privileged containers as non SYSTEM/admin accounts will be the primary way operators can restrict access to system resources (files, registry, named pipes, WMI, etc).
More information on Windows resource access can be found at https://docs.microsoft.com/en-us/archive/msdn-magazine/2008/november/access-control-understanding-windows-file-and-registry-permissions.
- Note: Support for local accounts with passwords is being investigated. Support for this scenario will most likely involve retrieving credential from the 'Windows Credential Manager' as shown in the following [proposed hcsshim changes](https://github.com/marosset/hcsshim/commit/cf42f301cf507f98d7137c8be008902306df9609). Any changes here will not require any changes to Kubernetes or pod/deployment manifests.
- By default `hostProcess` containers can run one of the following system accounts:
- `NT AUTHORITY\SYSTEM`
- `NT AUTHORITY\Local service`
- `NT AUTHORITY\NetworkService`
- Running privileged containers as non SYSTEM/admin accounts will be the primary way operators can restrict access to system resources (files, registry, named pipes, WMI, etc).
marosset marked this conversation as resolved.
Show resolved Hide resolved
To run a `hostProcess` container as a non SYSTEM/admin account a local users Group must first be created on the host. When a new `hostProcess` contianer is created with the name of a local users Group set as the `runAsUserName` then a temporary user account will be created as a member of the specified group for the container to run as.
marosset marked this conversation as resolved.
Show resolved Hide resolved
marosset marked this conversation as resolved.
Show resolved Hide resolved
- More information on Windows resource access can be found at <https://docs.microsoft.com/archive/msdn-magazine/2008/november/access-control-understanding-windows-file-and-registry-permissions>
- Example of configuring non SYSTEM/admin account can be found at <https://github.com/microsoft/hcsshim/pull/1286#issuecomment-1030223306>

#### Container Mounts

- When `hostProcess` containers are started a new Windows volume will be created on the host which will contain the contents of the container image. Containers will have a default working directory that points to this container volume.
Containers will also have full access to the host file-system (unless restricted by filed-based ACLs and the run_as_username used to start the container.) Processes should use absolute paths when accessing files on the host and relative paths when accessing files brought in via the container image.
- Note: there will be no `chroot` equivalent.
- An environment variable `$CONTAINER_SANDBOX_MOUNT_POINT` will be set to the absolute path where the container volume is mounted for `hostProcess` containers.
- Note: Syntax for referencing environment variables differs depending on what shell you are using. In cmd.exe env vars are surrounded by %'s (ex: `%CONTAINER_SANDBOX_MOUNT_POINT%`) and in powershell env vars are prefixed with $env: (ex: `$env:CONTAINER_SANDBOX_MOUNT_POINT`).
- This environment variable will be set to `c:\c\<containerid>\` (trailing \ included!) for each container.
- This environment variable can be used inside the Pod manifest / command line args for containers. See files in this [pull request](https://github.com/kubernetes-sigs/sig-windows-tools/pull/161/files#diff-b8195f7a2ad8f9ae9ebdd1bde8a0f3756c4508c1d9d9dd99f4a3bfa19fc3b828R135) for examples of using `$CONTAINER_SANDBOX_MOUNT_POINT` inside deployment manifests.
- `$CONTAINER_SANDBOX_MOUNT_POINT` will not be set for non-`hostProcess` containers.
- Volume mounts (including service account tokens) will be supported for privileged containers and will be mounted under the container volume. Programs running inside the container can either access volume mounts be using a relative path or by prefixing `$CONTAINER_SANDBOX_MOUNT_POINT` to their paths (example: use either `.\var\run\secrets\kubernetes.io\serviceaccount\` or `$CONTAINER_SANDBOX_MOUNT_POINT\var\run\secrets\kubernetes.io\serviceaccount\` to access service account tokens). These relative paths will be based on `Pod.containers.volumeMounts.mountPath`.
- Note: We are prototyping a new approach to how the file system is created for `hostProcess` containers that would present the filesystem in a similar manner to non-hostProcess containers running on Windows (`c:\` (trailing \ included) would be the root instead of `c:\c\<container id>\`).
This would make it so files from volume mounts would be accessible via relative paths (ex: `/foo.exe` instead of needing to specify `$CONTAINER_SANDBOX_MOUNT_POINT/foo.exe`)
HostProcess containers would still have full access to the host file-system and `$CONTAINER_SANDBOX_MOUNT_POINT` would continue to be set so that workloads which already access files from inside volume months using this environment variable would continue to work without modification.
https://github.com/microsoft/hcsshim/pull/1107 is tracking this exploratory work.
This functionality will most-likely not be ready during Kubernetes v1.23 and any changes made to how volume mounts work would be done before this features becomes stable.

- Client libraries such as https://pkg.go.dev/k8s.io/client-go/rest#InClusterConfig may be updated to prefix paths with `$CONTAINER_SANDBOX_MOUNT_POINT` if the environment variable is set for Windows so these libraries will work in `hostProcess` containers.
The decision to update client libraries (or not) will be postponed until the above mentioned merged container/OS filesystem investigations are concluded. Until then various workloads running in `hostProcess` containers need to communicate with the cluster can inject the `$CONTAINER_SANDBOX_MOUNT_POINT` environment variables into a kubeconfig file manually. Here is an example of how to do this today - https://github.com/jsturtevant/sig-windows-tools/blob/c9be1f0a9e95a34fda91bb7e8fc519e3447d8d93/hostprocess/calico/kube-proxy/start.ps1#L44-L52.
- Note: it is not possible to feature-gate this behavior in client libraries and because of this the functionality should not be added to client libraries after `hostProcess` containers while this feature is in `alpha`.
- [kubernetes/kubernetes#104490](https://github.com/kubernetes/kubernetes/pull/104490) adds support for `HostProcess` containers to the golang client library.
- Named Pipe mounts will **not** be supported. Instead named pipes should be accessed via their path on the host (\\\\.\\pipe\\*).
- The following error will be returned if `hostProcess` containers attempt to use name pipe mounts - https://github.com/microsoft/hcsshim/blob/358f05d43d423310a265d006700ee81eb85725ed/internal/jobcontainers/mounts.go#L40.
- Unix domain sockets mounts support will be added before `HostProcess` containers graduate to `stable`. For `alpha` and `beta` Unix domain sockets can be accessed via their paths on the host like named pipes.
- The Windows APIs needed to support mounting unix domain socket mounts in `hostProcess` containers was introduced in Windows Server Ver, 2004. Microsoft is planning on backporting these APIs to Windows Server 2019 (min support Windows Server OS) to provide a consistent user experience.
- Mounting directories from the host OS into `hostProcess` containers will work just like with normal containers - kubelet will explicitly block this scenario.
- All other volume types supported for normal containers on Windows will work with `hostProcess` containers.
- Window's bind-filter driver will be used to create a view that merges the host's OS filesystem with container-local volumes.
marosset marked this conversation as resolved.
Show resolved Hide resolved
marosset marked this conversation as resolved.
Show resolved Hide resolved
When `hostProcess` containers are started a new volume will be created which contains the contents of the contaner image.
This volume will be mounted to `c:\hpc`. The default working directory for `hostProcess` containers will also be set to `c:\hpc`.
marosset marked this conversation as resolved.
Show resolved Hide resolved
marosset marked this conversation as resolved.
Show resolved Hide resolved
- Volume mounts (includinge service account tokens) will be supported for `hostProcess` containers and can be accessed just the same way as regular Windows Server containers.
- Named Pipe mounts will **not** be supported.
Instead named pipes should be accessed via their path on the host (\\\\.\\pipe\\*).
The following error will be returned if `hostProcess` containers attempt to use name pipe mounts -
https://github.com/microsoft/hcsshim/blob/358f05d43d423310a265d006700ee81eb85725ed/internal/jobcontainers/mounts.go#L40.
- Unix domain sockets mounts also not not be supported for `hostProces` containers.
Unix domain sockets can be accessed via their paths on the host like named pipes.
- Mounting directories from the host OS into `hostProcess` containers will work just like with normal containers but this is not recommend.
Instead workloads should access the host OS's file-system as if it was not being run in a contianer.
- All other volume types supported for normal containers on Windows will work with `hostProcess` containers.
- `HostProcess` Containers will have full access to the host file-system (unless restricted by filed-based ACLs and the run_as_username used to start the container).
- There will be no `chroot` equivalent.

- Note: Behavior of volume mounts will differ between the alpha/beta (old) implementation of this feature and the stable (new) implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you summarize the differences here and how maintaining backwards compatibility is planned? I know it's detailed in the links below, but I think it would be good to state it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will summarize the differences and also explain the backwards compatibility.

For backwards compatibility there is an annotation you can set for the pod that will use the old or the new behavior.

This must be set during pod sandbox creation and will apply to all containers in the pod.
If the new behavior is requested and the required APIs are not present on the machine (for Windows Server 2019) CreatePodSandbox CRI call will fail.

If the annotation is not set you'll get the new behavior is the required APIs are available and or the old behavior if they are not.
Note: The APIs will be available for Windows Server 2019 in July 2022 and have been present in Windows Server 2022 since launch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42 I made some updates to address your comments PTAL and thanks!

Designing/testing/validation of an acceptable solution for handling volume mounts w.r.t. `hostProcess` containers was a primary reason for keeping the featuer in `beta`.
A recording of the behaviors differces from a SIG-Windows community meeting can be found [here](https://youtu.be/8GeZKXgvkdY?t=309). Also note that behavior will be the same for WS2019 onward.
marosset marked this conversation as resolved.
Show resolved Hide resolved
marosset marked this conversation as resolved.
Show resolved Hide resolved

#### Container Images

Expand Down Expand Up @@ -718,6 +716,11 @@ Beta
- Validate behaviors of various volume mount types as described in [Container Mounts](#container-mounts) with e2e tests
- Add e2e tests to test different ways to construct paths for container command, args, and workingDir fields for both `hostProcess` and non-hostProcess containers. These tests will include constructing paths with and without `$CONTAINER_SANDBOX_MOUNT_POINT` set and with different combinations of forward and backward slashes.

Graduation

- Add e2e tests to validate running `hostProcess` containers as non SYSTEM/admin accounts
- Update e2e tests for new volume mount behavior as desdribed in [Container Mounts](#container-mounts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Update e2e tests for new volume mount behavior as desdribed in [Container Mounts](#container-mounts)
- Update e2e tests for new volume mount behavior as described in [Container Mounts](#container-mounts)

since bind mount is not yet released in containerd 1.7, do we even have the ability to add tests for this yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do.

We have a github action that runs every night that builds containerd and hcsshim from main and publishes the package to https://github.com/kubernetes-sigs/sig-windows-tools/releases/tag/windows-containerd-nightly.
This package already has the changes to use the new bind mount behavior (if running on Windows Server 2022 for now, it will work on Windows Server 2019 in a few weeks).

We use this package in https://testgrid.k8s.io/sig-windows-signal#capz-windows-containerd-nightly-master (and a few others).

My plans were to update the e2e tests to check for the contaienrd version being used on the nodes and add skips to the e2e tests that require a different version of containerd.


### Graduation Criteria

<!--
Expand Down Expand Up @@ -800,9 +803,9 @@ Graduation to Beta

Graduation to GA:

- Address any issues uncovered in alpha/beta
- Add documentation for running as a non-SYSTEM/admin account to k8s.io
- Update documention on how volume mounts are set up for `hostProcess` containers on k8s.io
marosset marked this conversation as resolved.
Show resolved Hide resolved
marosset marked this conversation as resolved.
Show resolved Hide resolved
- Set `WindowsHostProcessContainers` feature gate to `GA`
- TBD

### Upgrade / Downgrade Strategy

Expand Down Expand Up @@ -1016,7 +1019,10 @@ _This section must be completed when targeting beta graduation to a release._
- **2021-12-17:** Initial KEP draft merged - [#2037](https://github.com/kubernetes/enhancements/pull/2037).
- **2021-02-17:** KEP approved for alpha release - [#2288](https://github.com/kubernetes/enhancements/pull/2288).
- **2021-05-20:** Alpha implementation PR merged - [kubernetes/kubernetes#99576](https://github.com/kubernetes/kubernetes/pull/99576).
- **2021-08-05:** K8s 1.22 released with alpha support for `HostProcess` containers.
- **2021-08-05:** K8s 1.22 released with alpha support for `WindowsHostProcessContainers` feature.
- **2021-08-21:** HostProcessContainers (via CRI) support added to contianerd - [containerd/contianerd#5131](https://github.com/containerd/containerd/pull/5131).
- **2021-12-07:** K8s 1.23 released with beta support for `WindowsHostProcessContainers` feature.
- **2022-02-15:** Containerd 1.6.0 relased with support for HostProcessContianers.

<!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,18 @@ replaces:


# The target maturity stage in the current dev cycle for this KEP.
stage: beta
stage: stable

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.23"
latest-milestone: "v1.25"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.22"
beta: "v1.23"
stable: "v1.25"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Expand Down