Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SCC more restrictive #513

Merged
merged 7 commits into from
Aug 22, 2022
Merged

Make SCC more restrictive #513

merged 7 commits into from
Aug 22, 2022

Conversation

hvaghani221
Copy link
Contributor

@hvaghani221 hvaghani221 commented Aug 12, 2022

#509

The agent doesn't write in the root directory. So it makes sense to enable readOnlyRootFilesystem.
By default, CRI-O will add the following capabilities. That can be dropped as well.

  • CHOWN
  • DAC_OVERRIDE
  • FSETID
  • FOWNER
  • SETGID
  • SETUID
  • SETPCAP
  • NET_BIND_SERVICE
  • KILL

You can checked current capabilities using:

$oc rsh sck-otel-splunk-otel-collector-agent-pgg77 bash
bash-4.4# cat /proc/1/status | grep Cap
CapInh: 0000000000000000
CapPrm: 00000000000005fb
CapEff: 00000000000005fb
CapBnd: 00000000000005fb
CapAmb: 0000000000000000
capsh --decode=00000000000005fb
0x00000000000005fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service

@hvaghani221 hvaghani221 requested review from a team as code owners August 12, 2022 12:21
@dmitryax
Copy link
Contributor

The agent doesn't write in the root directory. So it makes sense to enable readOnlyRootFilesystem.

The agent writes log checkpoints to /var/addon/splunk/otel_pos. Please make sure it's still working.

@jvoravong, can you please validate this change from Splunk O11y side? We need to make sure that all host metrics are collected correctly with the dropped capabilities?

Also we need to make sure that container and journald logs are collected using both log engines fluend/otel.

@jvoravong
Copy link
Contributor

I'll validate these changes early next week.

@mtcolman
Copy link

Hi, I can see that you're dropping the default capabilities provided when CRI-O is the container engine (as per the "default_capabilities" section of https://github.com/cri-o/cri-o/blob/main/docs/crio.conf.5.md).

Docker however, grants additional capabilities as default, see here: https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities. Can you please assess these additional capabilities and look at dropping as many of these that aren't required?

The additional ones granted by Docker are:

  • MKNOD
  • NET_RAW
  • SETFCAP
  • SYS_CHROOT
  • AUDIT_WRITE

Alternatively, dropping "ALL" and then adding back in only the capabilities the otel-collector needs would be best practice.

Thank you!

@hvaghani221
Copy link
Contributor Author

I have executed functional tests manually with these changes on Red Hat OpenShift Local.

All tests are passing except 3
image

These are expected failures since the hostname would be different and docker/containerd logs wouldn't be there.

@jvoravong
Copy link
Contributor

jvoravong commented Aug 18, 2022

Finished up some testing:

  • Metrics -> Look Good
  • Traces -> Look Good
  • Logs -> Found an issue. When using the the otel log engine everything is good. When using the fluentd log engine, the agent goes into a crash loop.

Using Kuberenetes 1.24 and Openshift 4.11 (Rosa)

splunk-otel-collector-chart % k version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.4", GitCommit:"95ee5ab382d64cfe6c28967f36b53970b8374491", GitTreeState:"clean", BuildDate:"2022-08-17T18:46:11Z", GoVersion:"go1.19", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0+9546431", GitCommit:"0a57f1f59bda75ea2cf13d9f3b4ac5d202134f2d", GitTreeState:"clean", BuildDate:"2022-07-08T19:55:26Z", GoVersion:"go1.18.4", Compiler:"gc", Platform:"linux/amd64"}

Warning log printed with a Helm Install

W0818 15:34:17.467292   41736 warnings.go:70] would violate PodSecurity "restricted:v1.24": host namespaces (hostNetwork=true), hostPort (container "otel-collector" uses hostPorts 14250, 14268, 4317, 4318, 55681, 8006, 9080, 9411, 9943), allowPrivilegeEscalation != false (containers "prepare-fluentd-config", "fluentd", "otel-collector" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "prepare-fluentd-config", "fluentd", "otel-collector" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volumes "varlog", "varlogdest", "journallogpath", "host-dev", "host-etc", "host-proc", "host-run-udev-data", "host-sys", "host-var-run-utmp" use restricted volume type "hostPath"), runAsNonRoot != true (pod or containers "prepare-fluentd-config", "fluentd", "otel-collector" must set securityContext.runAsNonRoot=true), runAsUser=0 (containers "prepare-fluentd-config", "fluentd" must not set runAsUser=0), seccompProfile (pod or containers "prepare-fluentd-config", "fluentd", "otel-collector" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
W0818 15:34:17.553879   41736 warnings.go:70] would violate PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "otel-collector" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "otel-collector" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "otel-collector" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "otel-collector" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Fluentd containers entering a crashloop

k get pod splunk-otel-collector-agent-zz4ld
splunk-otel-collector-agent-zz4ld                            1/2     CrashLoopBackOff   5 (2m3s ago)    5m11s   XX.X.XXX.XX   ip-XX-XX-XXX-XX.ec2.internal   <none>           <none>
k describe pod splunk-otel-collector-agent-zz4ld
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  91s                default-scheduler  Successfully assigned monitoring/splunk-otel-collector-agent-27wr9 to ip-XX-XX-XXX-XX.ec2.internal by ip-XX-XX-XXX-XX
  Normal   Pulled     91s                kubelet            Container image "splunk/fluentd-hec:1.2.8" already present on machine
  Normal   Created    91s                kubelet            Created container prepare-fluentd-config
  Normal   Started    91s                kubelet            Started container prepare-fluentd-config
  Normal   Pulled     90s                kubelet            Container image "quay.io/signalfx/splunk-otel-collector:0.57.0" already present on machine
  Normal   Created    90s                kubelet            Created container otel-collector
  Normal   Started    90s                kubelet            Started container otel-collector
  Warning  Unhealthy  89s                kubelet            Readiness probe failed: Get "http://XX.X.XXX.XXX:13133/": dial tcp XX.X.XXX.XXX:13133: connect: connection refused
  Normal   Created    47s (x4 over 90s)  kubelet            Created container fluentd
  Normal   Started    47s (x4 over 90s)  kubelet            Started container fluentd
  Normal   Pulled     47s (x4 over 90s)  kubelet            Container image "splunk/fluentd-hec:1.2.8" already present on machine
  Warning  BackOff    31s (x6 over 86s)  kubelet            Back-off restarting failed container

@harshit-splunk please look into the Fluentd crashes.

CHANGELOG.md Outdated Show resolved Hide resolved
@dmitryax
Copy link
Contributor

Logs -> Found an issue. When using the the otel log engine everything is good. When using the fluentd log engine, the agent goes into a crash loop.

I believe this is caused by disabling readOnlyRootFilesystem. I suspect that otel collector also cannot write checkpoints to /var/addon/splunk/otel_pos, just doesn't fail. @harshit-splunk please validate that checkpoints can be written.

@jvoravong
Copy link
Contributor

I did specifically miss mentioning that. I observed the receiver offset files in /var/addon/splunk/otel_pos were successfully being written to. Which was surprising...

@dmitryax
Copy link
Contributor

Interesting, if this cannot be fixed for fluentd, we can just update the template and readOnlyRootFilesystem based on logsEngine configuration.

Anyway, first we need to know why fluentd is failing.

@hvaghani221
Copy link
Contributor Author

Actually, the checkpoint directory is mounted as the volume when otel logs engine is used. So, it is able to write there.

- name: checkpoint
mountPath: {{ .Values.logsCollection.checkpointPath }}

It's missing in fluentd. I'll include that in this PR.

@dmitryax
Copy link
Contributor

@harshit-splunk fluentd configuration is a bit confusing but it's also mounted at /var/log

- name: varlog
mountPath: {{ .Values.fluentd.config.containers.path }}
.

@hvaghani221
Copy link
Contributor Author

@dmitryax @jvoravong Fluentd internally writes to /tmp/fluentd directory. Since it was read-only, it wasn't able to create the directory. So I have mounted an empty directory to /tmp

@hvaghani221 hvaghani221 requested a review from jvoravong August 22, 2022 12:33
Copy link
Contributor

@jvoravong jvoravong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fluentd works with these latest changes. Good work. Giving my approval, let's also give dmitryax a chance to respond.

Copy link
Contributor

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmitryax dmitryax merged commit 3819317 into signalfx:main Aug 22, 2022
@hvaghani221 hvaghani221 deleted the restrict-scc branch August 23, 2022 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants