Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Using AppArmor causes provisioning failures on certain k8s clusters #4174

Closed
romilbhardwaj opened this issue Oct 25, 2024 · 0 comments · Fixed by #4176
Closed

[k8s] Using AppArmor causes provisioning failures on certain k8s clusters #4174

romilbhardwaj opened this issue Oct 25, 2024 · 0 comments · Fixed by #4176

Comments

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Oct 25, 2024

User reported using file_mounts for bucket mounting in sky launch was causing provisioning to fail on their EKS cluster:

e.g.,

file_mounts:
  /data:
    name: romilb-sky-test
    source: ~/tmp-workdir
    mode: MOUNT

run: |
  ls -l /data

would cause:

HTTP response headers: HTTPHeaderDict({'Audit-Id': '42ea60ef-3305-43e9-a554-ef028e808b2f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '7f0f3024-9f2f-4abb-af09-80757fbbd065', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a6ea3db1-56e6-4506-b031-942b90da496c', 'Date': 'Wed, 23 Oct 2024 19:11:07 GMT', 'Transfer-Encoding': 'chunked'})
(xxx-nemo-train-eks, pid=1713) W 10-23 19:11:07 instance.py:607] HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"xxx-nemo-train-eks-1-4ce9-9a3a-worker\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`,`spec.initContainers[*].image`,`spec.activeDeadlineSeconds`,`spec.tolerations` (only additions to existing tolerations),`spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \tVolumes:        {{Name: \"secret-volume\", VolumeSource: {Secret: \u0026{SecretName: \"sky-ssh-keys\", DefaultMode: \u0026420}}}, {Name: \"dshm\", VolumeSource: {EmptyDir: \u0026{Medium: \"Memory\"}}}, {Name: \"dev-fuse\", VolumeSource: {HostPath: \u0026{Path: \"/dev/fuse\", Type: \u0026\"\"}}}, {Name: \"nvme\", VolumeSource: {HostPath: \u0026{Path: \"/nvme\", Type: \u0026\"Directory\"}}}, ...},\n  \tInitContainers: nil,\n  \tContainers: []core.Container{\n  \t\t{\n  \t\t\t... // 18 identical fields\n  \t\t\tTerminationMessagePolicy: \"File\",\n  \t\t\tImagePullPolicy:          \"IfNotPresent\",\n  \t\t\tSecurityContext: \u0026core.SecurityContext{\n  \t\t\t\t... // 9 identical fields\n  \t\t\t\tProcMount:       nil,\n  \t\t\t\tSeccompProfile:  nil,\n- \t\t\t\tAppArmorProfile: \u0026core.AppArmorProfile{Type: \"Unconfined\"},\n+ \t\t\t\tAppArmorProfile: nil,\n  \t\t\t},\n  \t\t\tStdin:     false,\n  \t\t\tStdinOnce: false,\n  \t\t\tTTY:       false,\n  \t\t},\n  \t},\n  \tEphemeralContainers: nil,\n  \tRestartPolicy:       \"Never\",\n  \t... // 28 identical fields\n  }\n","reason":"Invalid","details":{"name":"xxx-nemo-train-eks-1-4ce9-9a3a-worker","kind":"Pod","causes":[{"reason":"FieldValueForbidden","message":"Forbidden: pod updates may not change fields other than `spec.containers[*].image`,`spec.initContainers[*].image`,`spec.activeDeadlineSeconds`,`spec.tolerations` (only additions to existing tolerations),`spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \tVolumes:        {{Name: \"secret-volume\", VolumeSource: {Secret: \u0026{SecretName: \"sky-ssh-keys\", DefaultMode: \u0026420}}}, {Name: \"dshm\", VolumeSource: {EmptyDir: \u0026{Medium: \"Memory\"}}}, {Name: \"dev-fuse\", VolumeSource: {HostPath: \u0026{Path: \"/dev/fuse\", Type: \u0026\"\"}}}, {Name: \"nvme\", VolumeSource: {HostPath: \u0026{Path: \"/nvme\", Type: \u0026\"Directory\"}}}, ...},\n  \tInitContainers: nil,\n  \tContainers: []core.Container{\n  \t\t{\n  \t\t\t... // 18 identical fields\n  \t\t\tTerminationMessagePolicy: \"File\",\n  \t\t\tImagePullPolicy:          \"IfNotPresent\",\n  \t\t\tSecurityContext: \u0026core.SecurityContext{\n  \t\t\t\t... // 9 identical fields\n  \t\t\t\tProcMount:       nil,\n  \t\t\t\tSeccompProfile:  nil,\n- \t\t\t\tAppArmorProfile: \u0026core.AppArmorProfile{Type: \"Unconfined\"},\n+ \t\t\t\tAppArmorProfile: nil,\n  \t\t\t},\n  \t\t\tStdin:     false,\n  \t\t\tStdinOnce: false,\n  \t\t\tTTY:       false,\n  \t\t},\n  \t},\n  \tEphemeralContainers: nil,\n  \tRestartPolicy:       \"Never\",\n  \t... // 28 identical fields\n  }\n","field":"spec"}]},"code":422}

The error is complaining about removal of the AppArmorProfile cannot be done once the pod is already running (pod updates may not change fields)

- AppArmorProfile: &core.AppArmorProfile{Type: "Unconfined"},
+ AppArmorProfile: nil,

The error was traced to the use of container.apparmor.security.beta.kubernetes.io/ray-node: unconfined in our kubernetes template when storage is used:

{% if k8s_fuse_device_required %}
annotations:
# Required for FUSE mounting to access /dev/fuse
container.apparmor.security.beta.kubernetes.io/ray-node: unconfined
{% endif %}

This is generally required for FUSE mounting (tested on GKE, EKS and Rancher). However, on the user's cluster, this annotation is not required for fuse mounting and Mandatory Access Control (MAC) is not enforced on their cluster. Using this annotation would cause the above error.

It's unclear if there's an admission controller or something else removing/updating the pod.

User info:

System Info:                                                                                                           │
│   Machine ID:                 ec2f818ba21dddda0dc3507fe141bd49                                                         │
│   System UUID:                ec2f818b-a21d-ddda-0dc3-507fe141bd49                                                     │
│   Boot ID:                    a80b2cdb-f82a-4962-97b3-80a98a88195e                                                     │
│   Kernel Version:             6.8.0-1015-aws                                                                           │
│   OS Image:                   Ubuntu 22.04.5 LTS                                                                       │
│   Operating System:           linux                                                                                    │
│   Architecture:               amd64                                                                                    │
│   Container Runtime Version:  containerd://1.7.12                                                                      │
│   Kubelet Version:            v1.30.2                                                                                  │
│   Kube-Proxy Version:         v1.30.2   
sky -c
bc30c0b2e5c92dc00e52c6f4240b603fad4b258a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants