Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alluxio short-circuit worker volume is not writable when using Helm chart #13096

Closed
ZhuTopher opened this issue Mar 18, 2021 · 2 comments
Closed
Labels
type-bug This issue is about a bug

Comments

@ZhuTopher
Copy link
Contributor

Alluxio Version:
What version of Alluxio are you using?

Version 2.5.0

Describe the bug
A clear and concise description of what the bug is.

See PR #13061 for the original discussion, but when using Alluxio worker in Kubernetes with short-circuit enabled the Daemonset template mounts the directory /opt/domain. However when doing so the worker would crash with the following error:

2021-03-13 00:32:55,274 INFO  GrpcDataServer - Alluxio worker gRPC server started, listening on /0.0.0.0:29999
2021-03-13 00:32:55,275 INFO  AlluxioWorkerProcess - Domain socket data server is enabled at /opt/domain/b7f2d035-e081-4a55-a9d3-a87351d8daad.
2021-03-13 00:32:55,281 WARN  RetryUtils - Failed to Starting gRPC server (attempt 1): java.io.IOException: Failed to bind
2021-03-13 00:32:55,482 WARN  RetryUtils - Failed to Starting gRPC server (attempt 2): java.io.IOException: Failed to bind
2021-03-13 00:32:55,883 WARN  RetryUtils - Failed to Starting gRPC server (attempt 3): java.io.IOException: Failed to bind
2021-03-13 00:32:56,384 WARN  RetryUtils - Failed to Starting gRPC server (attempt 4): java.io.IOException: Failed to bind
2021-03-13 00:32:56,885 WARN  RetryUtils - Failed to Starting gRPC server (attempt 5): java.io.IOException: Failed to bind
2021-03-13 00:32:57,386 WARN  RetryUtils - Failed to Starting gRPC server (attempt 6): java.io.IOException: Failed to bind
2021-03-13 00:32:57,387 ERROR GrpcDataServer - Alluxio worker gRPC server failed to start on /opt/domain/b7f2d035-e081-4a55-a9d3-a87351d8daad
java.io.IOException: Failed to bind
    at io.grpc.netty.NettyServer.start(NettyServer.java:252)
    at io.grpc.internal.ServerImpl.start(ServerImpl.java:184)
    at io.grpc.internal.ServerImpl.start(ServerImpl.java:90)
    at alluxio.grpc.GrpcServer.lambda$start$0(GrpcServer.java:77)
    at alluxio.retry.RetryUtils.retry(RetryUtils.java:39)
    at alluxio.grpc.GrpcServer.start(GrpcServer.java:77)
    at alluxio.worker.grpc.GrpcDataServer.<init>(GrpcDataServer.java:107)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at alluxio.util.CommonUtils.createNewClassInstance(CommonUtils.java:285)
    at alluxio.worker.DataServer$Factory.create(DataServer.java:47)
    at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:162)
    at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:46)
    at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:38)
    at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:72)
Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Permission denied
2021-03-13 00:32:57,391 ERROR AlluxioWorker - Fatal error: Failed to create worker process
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/b7f2d035-e081-4a55-a9d3-a87351d8daad
    at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:168)
    at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:46)
    at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:38)
    at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:72)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/b7f2d035-e081-4a55-a9d3-a87351d8daad
    at alluxio.util.CommonUtils.createNewClassInstance(CommonUtils.java:287)
    at alluxio.worker.DataServer$Factory.create(DataServer.java:47)
    at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:162)
    ... 3 more
Caused by: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/b7f2d035-e081-4a55-a9d3-a87351d8daad
    at alluxio.worker.grpc.GrpcDataServer.<init>(GrpcDataServer.java:112)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at alluxio.util.CommonUtils.createNewClassInstance(CommonUtils.java:285)
    ... 5 more
Caused by: java.io.IOException: Failed to bind
    at io.grpc.netty.NettyServer.start(NettyServer.java:252)
    at io.grpc.internal.ServerImpl.start(ServerImpl.java:184)
    at io.grpc.internal.ServerImpl.start(ServerImpl.java:90)
    at alluxio.grpc.GrpcServer.lambda$start$0(GrpcServer.java:77)
    at alluxio.retry.RetryUtils.retry(RetryUtils.java:39)
    at alluxio.grpc.GrpcServer.start(GrpcServer.java:77)
    at alluxio.worker.grpc.GrpcDataServer.<init>(GrpcDataServer.java:107)
    ... 10 more
Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Permission denied

Upon investigating the alluxio-worker container in my Alluxio worker pod, we see that /opt/domain gets mounted with root:root as the user:group:

[kubernetes@ip-172-31-76-132 shortcircuit]$ kubectl exec -it alluxio-worker-b28rm -c alluxio-job-worker /bin/bash
bash-4.4$ ls -l /opt
total 4
lrwxrwxrwx    1 alluxio  alluxio         13 Mar  5 20:17 alluxio -> alluxio-2.5.0
drwxr-xr-x    1 alluxio  alluxio         30 Mar  5 19:52 alluxio-2.5.0
drwxr-xr-x    3 root     root          4096 Mar  5 20:18 arthas
drwxr-xr-x    3 root     root           104 Mar  5 20:18 async-profiler
drwxr-xr-x    2 root     root             6 Mar 12 23:51 domain

According to the Kubernetes SecurityContext docs using fsGroup should have been sufficient but it appears that the volume is still being mounted with the default user:group of root:root.

To Reproduce
Steps to reproduce the behavior (as minimally and precisely as possible)

Install Alluxio on a Kubernetes cluster following the Alluxio documentation.

  • Follow up to and including step "2.1 Deploy" use the kubectl tab (this issue may occur when using Helm too, but it was not tested in my set-up)
  • Follow the steps in "3.2 Enable Short-circuit Access" using a hostPath volume (still with the kubectl tab)

Expected behavior
A clear and concise description of what you expected to happen.

The Alluxio worker pod(s) should be in the "Running" state.

Urgency
Describe the impact and urgency of the bug.

The Helm chart Kubernetes templates do not correctly configure the Alluxio worker pod for short-circuit access.

Additional context
Add any other context about the problem here.

See PR #13061 for the original discussion.

@ZhuTopher ZhuTopher added the type-bug This issue is about a bug label Mar 18, 2021
@ZhuTopher
Copy link
Contributor Author

From the original PR

New information about why the fsGroup didn't seem to apply (@jiacheliu3 PTAL):

As for why the container-level runAsUser and runAsGroup didn't apply to the volume:

  • The Kubernetes docs say container-level runAsUser and runAsGroup override Pod-level, but do not affect the volumes.
  • I can't seem to find any documentation that suggests runAsUser or runAsGroup at the Pod-level is supposed to affect the permissions of the volume, but in my testing environment it seemed to do so anyway

Conclusion

The main takeaway I'm getting from all this is probably just to change how the volume is being defined for the worker such that fsGroup works as expected. Our docs would need to be updated to reflect that users should not tamper with the way the PV for the worker is configured; and if they do that they are responsible for ensuring these permissions apply to their volumes properly.

@jiacheliu3
Copy link
Contributor

@ZhuTopher if this is resolved by #13061 we can close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

No branches or pull requests

2 participants