You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Alluxio Version:
What version of Alluxio are you using?
Version 2.5.0
Describe the bug
A clear and concise description of what the bug is.
See PR #13061 for the original discussion, but when using Alluxio worker in Kubernetes with short-circuit enabled the Daemonset template mounts the directory /opt/domain. However when doing so the worker would crash with the following error:
2021-03-13 00:32:55,274 INFO GrpcDataServer - Alluxio worker gRPC server started, listening on /0.0.0.0:29999
2021-03-13 00:32:55,275 INFO AlluxioWorkerProcess - Domain socket data server is enabled at /opt/domain/b7f2d035-e081-4a55-a9d3-a87351d8daad.
2021-03-13 00:32:55,281 WARN RetryUtils - Failed to Starting gRPC server (attempt 1): java.io.IOException: Failed to bind
2021-03-13 00:32:55,482 WARN RetryUtils - Failed to Starting gRPC server (attempt 2): java.io.IOException: Failed to bind
2021-03-13 00:32:55,883 WARN RetryUtils - Failed to Starting gRPC server (attempt 3): java.io.IOException: Failed to bind
2021-03-13 00:32:56,384 WARN RetryUtils - Failed to Starting gRPC server (attempt 4): java.io.IOException: Failed to bind
2021-03-13 00:32:56,885 WARN RetryUtils - Failed to Starting gRPC server (attempt 5): java.io.IOException: Failed to bind
2021-03-13 00:32:57,386 WARN RetryUtils - Failed to Starting gRPC server (attempt 6): java.io.IOException: Failed to bind
2021-03-13 00:32:57,387 ERROR GrpcDataServer - Alluxio worker gRPC server failed to start on /opt/domain/b7f2d035-e081-4a55-a9d3-a87351d8daad
java.io.IOException: Failed to bind
at io.grpc.netty.NettyServer.start(NettyServer.java:252)
at io.grpc.internal.ServerImpl.start(ServerImpl.java:184)
at io.grpc.internal.ServerImpl.start(ServerImpl.java:90)
at alluxio.grpc.GrpcServer.lambda$start$0(GrpcServer.java:77)
at alluxio.retry.RetryUtils.retry(RetryUtils.java:39)
at alluxio.grpc.GrpcServer.start(GrpcServer.java:77)
at alluxio.worker.grpc.GrpcDataServer.<init>(GrpcDataServer.java:107)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at alluxio.util.CommonUtils.createNewClassInstance(CommonUtils.java:285)
at alluxio.worker.DataServer$Factory.create(DataServer.java:47)
at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:162)
at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:46)
at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:38)
at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:72)
Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Permission denied
2021-03-13 00:32:57,391 ERROR AlluxioWorker - Fatal error: Failed to create worker process
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/b7f2d035-e081-4a55-a9d3-a87351d8daad
at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:168)
at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:46)
at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:38)
at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:72)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/b7f2d035-e081-4a55-a9d3-a87351d8daad
at alluxio.util.CommonUtils.createNewClassInstance(CommonUtils.java:287)
at alluxio.worker.DataServer$Factory.create(DataServer.java:47)
at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:162)
... 3 more
Caused by: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/b7f2d035-e081-4a55-a9d3-a87351d8daad
at alluxio.worker.grpc.GrpcDataServer.<init>(GrpcDataServer.java:112)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at alluxio.util.CommonUtils.createNewClassInstance(CommonUtils.java:285)
... 5 more
Caused by: java.io.IOException: Failed to bind
at io.grpc.netty.NettyServer.start(NettyServer.java:252)
at io.grpc.internal.ServerImpl.start(ServerImpl.java:184)
at io.grpc.internal.ServerImpl.start(ServerImpl.java:90)
at alluxio.grpc.GrpcServer.lambda$start$0(GrpcServer.java:77)
at alluxio.retry.RetryUtils.retry(RetryUtils.java:39)
at alluxio.grpc.GrpcServer.start(GrpcServer.java:77)
at alluxio.worker.grpc.GrpcDataServer.<init>(GrpcDataServer.java:107)
... 10 more
Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Permission denied
Upon investigating the alluxio-worker container in my Alluxio worker pod, we see that /opt/domain gets mounted with root:root as the user:group:
[kubernetes@ip-172-31-76-132 shortcircuit]$ kubectl exec -it alluxio-worker-b28rm -c alluxio-job-worker /bin/bash
bash-4.4$ ls -l /opt
total 4
lrwxrwxrwx 1 alluxio alluxio 13 Mar 5 20:17 alluxio -> alluxio-2.5.0
drwxr-xr-x 1 alluxio alluxio 30 Mar 5 19:52 alluxio-2.5.0
drwxr-xr-x 3 root root 4096 Mar 5 20:18 arthas
drwxr-xr-x 3 root root 104 Mar 5 20:18 async-profiler
drwxr-xr-x 2 root root 6 Mar 12 23:51 domain
According to the Kubernetes SecurityContext docs using fsGroup should have been sufficient but it appears that the volume is still being mounted with the default user:group of root:root.
To Reproduce
Steps to reproduce the behavior (as minimally and precisely as possible)
I believe it gets skipped for non-PV hostPath volumes; this is implied from this Kubernetes blog post where they say a benefit to using a Local PV over a hostPath volume is ownership via fsGroup
Similar observations about differing behaviour for hostPath has been seen before
As for why the container-level runAsUser and runAsGroup didn't apply to the volume:
The Kubernetes docs say container-level runAsUser and runAsGroup override Pod-level, but do not affect the volumes.
I can't seem to find any documentation that suggests runAsUser or runAsGroup at the Pod-level is supposed to affect the permissions of the volume, but in my testing environment it seemed to do so anyway
Conclusion
The main takeaway I'm getting from all this is probably just to change how the volume is being defined for the worker such that fsGroup works as expected. Our docs would need to be updated to reflect that users should not tamper with the way the PV for the worker is configured; and if they do that they are responsible for ensuring these permissions apply to their volumes properly.
Alluxio Version:
What version of Alluxio are you using?
Version 2.5.0
Describe the bug
A clear and concise description of what the bug is.
See PR #13061 for the original discussion, but when using Alluxio worker in Kubernetes with short-circuit enabled the Daemonset template mounts the directory
/opt/domain
. However when doing so the worker would crash with the following error:Upon investigating the
alluxio-worker
container in my Alluxio worker pod, we see that/opt/domain
gets mounted withroot:root
as the user:group:According to the Kubernetes SecurityContext docs using
fsGroup
should have been sufficient but it appears that the volume is still being mounted with the default user:group of root:root.To Reproduce
Steps to reproduce the behavior (as minimally and precisely as possible)
Install Alluxio on a Kubernetes cluster following the Alluxio documentation.
kubectl
tab (this issue may occur when using Helm too, but it was not tested in my set-up)hostPath
volume (still with thekubectl
tab)Expected behavior
A clear and concise description of what you expected to happen.
The Alluxio worker pod(s) should be in the "Running" state.
Urgency
Describe the impact and urgency of the bug.
The Helm chart Kubernetes templates do not correctly configure the Alluxio worker pod for short-circuit access.
Additional context
Add any other context about the problem here.
See PR #13061 for the original discussion.
The text was updated successfully, but these errors were encountered: