Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multus: Add sample job manifest for multus config validation #12495

Merged
merged 1 commit into from
Aug 11, 2023
Merged

multus: Add sample job manifest for multus config validation #12495

merged 1 commit into from
Aug 11, 2023

Conversation

Nikhil-Ladha
Copy link
Contributor

Description of your changes:
Added a sample job manifest named multus-validation that validates the multus configuration in the cluster.

Which issue is resolved by this Pull Request:
Resolves #12172

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

@travisn travisn requested a review from BlaineEXE July 10, 2023 15:05
@Nikhil-Ladha
Copy link
Contributor Author

Hi, can someone please help me understand how to fix the SCC issue for the multus-validation-test-web-server pod creation during the validation process?

@parth-gr
Copy link
Member

Hi, can someone please help me understand how to fix the SCC issue for the multus-validation-test-web-server pod creation during the validation process?

what is the specific error?
Is it smtg CI failing??

@Nikhil-Ladha
Copy link
Contributor Author

Nikhil-Ladha commented Jul 13, 2023

Not in CI, but in local testing of the job.
This is the error that I am getting in the logs of the pod deployed by the job

RESULT: multus validation test failed: failed to start web server: failed to create web server pod: pods "multus-validation-test-web-server" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .containers[0].runAsUser: Invalid value: 101: must be in the ranges: [1000580000, 1000589999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, pod.metadata.annotations[seccomp.security.alpha.kubernetes.io/pod]: Forbidden: seccomp may not be set, pod.metadata.annotations[container.seccomp.security.alpha.kubernetes.io/multus-validation-test-web-server]: Forbidden: seccomp may not be set, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "rook-ceph-csi": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]

Note: I am not providing any <nad-name> as such, but I anyway the error seems to be unrelated.

Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission @Nikhil-Ladha !

deploy/examples/common.yaml Outdated Show resolved Hide resolved
deploy/examples/multus-validation.yaml Outdated Show resolved Hide resolved
deploy/examples/multus-validation.yaml Outdated Show resolved Hide resolved
deploy/examples/multus-validation.yaml Outdated Show resolved Hide resolved
deploy/examples/psp.yaml Outdated Show resolved Hide resolved
deploy/charts/library/templates/_cluster-psp.tpl Outdated Show resolved Hide resolved
build/csv/csv-gen.sh Outdated Show resolved Hide resolved
Documentation/CRDs/Cluster/ceph-cluster-crd.md Outdated Show resolved Hide resolved
Documentation/CRDs/Cluster/ceph-cluster-crd.md Outdated Show resolved Hide resolved
Comment on lines 14 to 37
- apiGroups: [""]
resources: ["configmaps", "configmaps/finalizers", "pods"]
verbs: ["get", "create", "update", "delete"]
- apiGroups: ["k8s.cni.cncf.io"]
resources: ["network-attachment-definitions"]
verbs: ["get"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "delete"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These definitely aren't all the privileges the serviceaccount will need. It'll also need the ability to get,create,delete daemonsets and get,list,create,delete,deletecollection deployments. Probably list pods also.

@Nikhil-Ladha am I wrong, or have you not been running the job in a development environment to see if it works?

Copy link
Contributor Author

@Nikhil-Ladha Nikhil-Ladha Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been running the job in a dev env to confirm it's working, and that's when I am facing the above mentioned SCC issue.
Regarding the above privileges, they maybe required, but the thing is that the job is stuck at the SCC issue, so I never actually found out if these were also needed.

@subhamkrai
Copy link
Contributor

@Nikhil-Ladha are you still facing the issue?

I quickly tested your yaml in Minikube

ka multus-validation.yaml 
role.rbac.authorization.k8s.io/rook-ceph-multus-validation created
rolebinding.rbac.authorization.k8s.io/rook-ceph-multus-validation created
serviceaccount/rook-ceph-multus-validation created
job.batch/rook-ceph-multus-validation created


kc get pods
NAME                                  READY   STATUS    RESTARTS       AGE
multus-validation-test-web-server     1/1     Running   0              10m
rook-ceph-multus-validation-7jngc     0/1     Error     0              10m
rook-ceph-multus-validation-95xpn     0/1     Error     0              10m
rook-ceph-multus-validation-bkzhz     0/1     Error     0              7m43s
rook-ceph-multus-validation-btcjn     0/1     Error     0              11m
rook-ceph-multus-validation-gg556     0/1     Error     0              10m
rook-ceph-multus-validation-t79sl     0/1     Error     0              10m
rook-ceph-multus-validation-vzc97     0/1     Error     0              10m
rook-ceph-operator-58665dfccf-ft5ng   1/1     Running   4 (2m4s ago)   11m
~/go/src/github.com/rook/deploy/examples
srai@192 ~ (pr/12495) $ kc get pods rook-ceph-multus-validation-95xpn -f
error: flag needs an argument: 'f' in -f
See 'kubectl get --help' for usage.
~/go/src/github.com/rook/deploy/examples
srai@192 ~ (pr/12495) $ kc logs rook-ceph-multus-validation-95xpn -f
2023-07-19 14:55:59.452611 I | multus-validation: starting multus validation test with the following config:
namespace: rook-ceph
publicNetwork: <nad-name>
clusterNetwork: <nad-name>
daemonsPerNode: 16
resourceTimeout: 3m0s
nginxImage: nginxinc/nginx-unprivileged:stable-alpine

RESULT: multus validation test failed: failed to create validation test config object: failed to create validation test config object [{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:multus-validation-test-owner GenerateName: Namespace: SelfLink: UID: ResourceVersion: Generation:0 CreationTimestamp:0001-01-01 00:00:00 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[] ManagedFields:[]} Immutable:<nil> Data:map[] BinaryData:map[]}]: configmaps "multus-validation-test-owner" already exists

---
---
---
To clean up resources when you are done debugging: rook multus validation cleanup --namespace rook-ceph

I don't see the rbac issue

@Nikhil-Ladha
Copy link
Contributor Author

Nikhil-Ladha commented Jul 19, 2023

@Nikhil-Ladha are you still facing the issue?

I quickly tested your yaml in Minikube

ka multus-validation.yaml 
role.rbac.authorization.k8s.io/rook-ceph-multus-validation created
rolebinding.rbac.authorization.k8s.io/rook-ceph-multus-validation created
serviceaccount/rook-ceph-multus-validation created
job.batch/rook-ceph-multus-validation created


kc get pods
NAME                                  READY   STATUS    RESTARTS       AGE
multus-validation-test-web-server     1/1     Running   0              10m
rook-ceph-multus-validation-7jngc     0/1     Error     0              10m
rook-ceph-multus-validation-95xpn     0/1     Error     0              10m
rook-ceph-multus-validation-bkzhz     0/1     Error     0              7m43s
rook-ceph-multus-validation-btcjn     0/1     Error     0              11m
rook-ceph-multus-validation-gg556     0/1     Error     0              10m
rook-ceph-multus-validation-t79sl     0/1     Error     0              10m
rook-ceph-multus-validation-vzc97     0/1     Error     0              10m
rook-ceph-operator-58665dfccf-ft5ng   1/1     Running   4 (2m4s ago)   11m
~/go/src/github.com/rook/deploy/examples
srai@192 ~ (pr/12495) $ kc get pods rook-ceph-multus-validation-95xpn -f
error: flag needs an argument: 'f' in -f
See 'kubectl get --help' for usage.
~/go/src/github.com/rook/deploy/examples
srai@192 ~ (pr/12495) $ kc logs rook-ceph-multus-validation-95xpn -f
2023-07-19 14:55:59.452611 I | multus-validation: starting multus validation test with the following config:
namespace: rook-ceph
publicNetwork: <nad-name>
clusterNetwork: <nad-name>
daemonsPerNode: 16
resourceTimeout: 3m0s
nginxImage: nginxinc/nginx-unprivileged:stable-alpine

RESULT: multus validation test failed: failed to create validation test config object: failed to create validation test config object [{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:multus-validation-test-owner GenerateName: Namespace: SelfLink: UID: ResourceVersion: Generation:0 CreationTimestamp:0001-01-01 00:00:00 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[] ManagedFields:[]} Immutable:<nil> Data:map[] BinaryData:map[]}]: configmaps "multus-validation-test-owner" already exists

---
---
---
To clean up resources when you are done debugging: rook multus validation cleanup --namespace rook-ceph

I don't see the rbac issue

The issue is in an openshift cluster, I haven't tested in a Minikube cluster.
Also, rook-ceph-multus-validation-btcjn is the first pod created by the job (as per the age), can you please also check the logs of that pod?

@subhamkrai
Copy link
Contributor

@Nikhil-Ladha for openshift-based clusters we need to create SecurityContextConstraints look at the operator-openshift file. We have a example use case there

@Nikhil-Ladha
Copy link
Contributor Author

Finally, able to run the job successfully on a Minkube cluster :)

nikhil_ladha@nladha ~ % kubectl logs rook-ceph-multus-validation-vmqcr -n rook-ceph
2023-07-21 10:00:43.296074 I | multus-validation: starting multus validation test with the following config:
namespace: rook-ceph
publicNetwork: default/public-net
clusterNetwork: default/cluster-net
daemonsPerNode: 2
resourceTimeout: 3m0s
nginxImage: nginxinc/nginx-unprivileged:stable-alpine
2023-07-21 10:00:43.325536 I | multus-validation: continuing: expected number of image pull pods not yet ready: a daemonset expects zero scheduled pods
2023-07-21 10:00:45.330846 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:00:47.334816 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:00:49.338159 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:00:51.341939 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:00:53.344195 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:00:55.347439 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:00:57.351120 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:00:59.354086 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:01:01.358121 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:01:03.360733 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:01:05.364017 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:01:07.366305 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:01:09.370877 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:01:11.376688 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:01:13.379732 I | multus-validation: waiting to ensure num expected image pull pods to stabilize at 1
2023-07-21 10:01:15.382661 I | multus-validation: expecting 1 image pull pods to be 'Ready'
2023-07-21 10:01:17.387181 I | multus-validation: cleaning up all 1 'Running' image pull pods
2023-07-21 10:01:19.391805 I | multus-validation: getting web server info for clients
2023-07-21 10:01:21.394286 I | multus-validation: starting 2 clients on each node
2023-07-21 10:01:23.407632 I | multus-validation: verifying 2 client pods begin 'Running'
2023-07-21 10:01:25.420980 I | multus-validation: verifying all 2 'Running' client pods reach 'Ready' state
2023-07-21 10:01:27.424338 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:29.427590 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:31.431343 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:33.434923 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:35.439804 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:37.444582 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:39.447885 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:41.451598 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:43.454700 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:45.458316 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:47.462595 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:49.465154 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:51.470287 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:53.482476 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:55.490183 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:57.493890 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:01:59.497851 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:02:01.501983 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:02:03.505818 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:02:05.508772 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:02:07.512016 I | multus-validation: continuing: number of ready clients [0] is not the number expected [2]
2023-07-21 10:02:09.515994 I | multus-validation: all 2 clients are 'Ready'

RESULT: multus validation test succeeded!

cleaning up multus validation test resources in namespace "rook-ceph"
multus validation test resources were successfully cleaned up

If we want to configure the job for openshift clusters as well, then we would need to fix the SCC issue.

@Nikhil-Ladha
Copy link
Contributor Author

At last the SCC issue is resolved, now I have added the extra SCC required for the job deployment on an openshift cluster.

@Nikhil-Ladha
Copy link
Contributor Author

@BlaineEXE can you please take a look again? TIA :)

Comment on lines 6 to 41
# ---
# scc for the Rook and Ceph daemons
# kind: SecurityContextConstraints
# apiVersion: security.openshift.io/v1
# metadata:
# name: rook-ceph-multus-validation
# allowPrivilegedContainer: true
# allowHostDirVolumePlugin: true
# allowHostPID: false
# # set to true if running rook with host networking enabled
# allowHostNetwork: true
# # set to true if running rook with the provider as host
# allowHostPorts: true
# priority:
# allowedCapabilities: ["MKNOD"]
# allowHostIPC: true
# readOnlyRootFilesystem: false
# # drop all default privileges
# requiredDropCapabilities: ["All"]
# defaultAddCapabilities: []
# runAsUser:
# type: RunAsAny
# seLinuxContext:
# type: RunAsAny
# fsGroup:
# type: RunAsAny
# supplementalGroups:
# type: RunAsAny
# seccompProfiles:
# - "*"
# volumes:
# - configMap
# - emptyDir
# - projected
# users:
# - system:serviceaccount:rook-ceph:rook-ceph-multus-validation # serviceaccount:namespace:cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include this at the end.

value: DEBUG
restartPolicy: Never
---
# apiVersion: rbac.authorization.k8s.io/v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add instruction for PSP

Comment on lines 113 to 116
- "--public-network"
- "<nad-name>"
- "--cluster-network"
- "<nad-name>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's comment these out by default so that anyone running this without reading it won't have the tool seem to run with environment/test failures, and instead it will be failures from not providing input.

Suggested change
- "--public-network"
- "<nad-name>"
- "--cluster-network"
- "<nad-name>"
# - --public-network=<NAD-NAME> # uncomment and replace NAD name if using public network
# - --cluster-network=<NAD-NAME> # uncomment and replace NAD name if using cluster network

Comment on lines 106 to 108
# TODO: Insert the NAD name for public network and cluster network
# If you want to use any other flags along with the basic command,
# add the `--help` flag in the end to see the list of flags available, and use accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO is more of a developer comment.

Suggested change
# TODO: Insert the NAD name for public network and cluster network
# If you want to use any other flags along with the basic command,
# add the `--help` flag in the end to see the list of flags available, and use accordingly.
# Insert the NAD name for public network and cluster network
# If you want to use any other flags along with the basic command,
# add the `--help` flag in the end to see the list of flags available, and use accordingly.

Comment on lines 109 to 116
args:
- "multus"
- "validation"
- "run"
- "--public-network"
- "<nad-name>"
- "--cluster-network"
- "<nad-name>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another flag that is very commonly used is the --nginx-image flag. Let's include that also (commented-out) with a quick comment note as suggested above.

Comment on lines +87 to +56
# A job that runs the multus validation tool
apiVersion: batch/v1
kind: Job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is the primary "thing" that users will be modifying, make sure this is the first yaml definition in this manifest file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, this would cause some delay (warnings) in the pod creation as the service account and roles used by the job won't be ready to use when the job is created. For example, check the output below after updating the yaml and moving the job definition to the top.

Events:
  Type     Reason            Age                From            Message
  ----     ------            ----               ----            -------
  Warning  FailedCreate      46s (x2 over 46s)  job-controller  Error creating: pods "rook-ceph-multus-validation-" is forbidden: error looking up service account rook-ceph/rook-ceph-multus-validation: serviceaccount "rook-ceph-multus-validation" not found
  Normal   SuccessfulCreate  36s                job-controller  Created pod: rook-ceph-multus-validation-7924n

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate, but I think you're right to leave where it is then. Users generally report errors like this as bugs. As long as the comment at the top of the file is the primary documentation place, users still can see the primary docs early in the file, which is good.

Comment on lines 106 to 108
# TODO: Insert the NAD name for public network and cluster network
# If you want to use any other flags along with the basic command,
# add the `--help` flag in the end to see the list of flags available, and use accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, it seems best to me to move this doc to the header of the file. It will be easiest for users if there is one primary place to read about how to use the file, and that primary doc is at the very top. Any other comments in the file should be supplementary, to help users find "where" to make changes with brief reminders.

added a sample job manifest named `multus-validation` that
validates the multus configuration in the cluster.

Signed-off-by: Nikhil-Ladha <[email protected]>
@BlaineEXE BlaineEXE merged commit 853bab3 into rook:master Aug 11, 2023
45 of 49 checks passed
BlaineEXE added a commit that referenced this pull request Aug 11, 2023
multus: Add sample job manifest for multus config validation (backport #12495)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a sample job manifest to run multus validation test
4 participants