Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker failed to start due to failure of creating directories under 'var' #11166

Closed
FrankChen021 opened this issue Apr 27, 2021 · 2 comments · Fixed by #11167
Closed

Docker failed to start due to failure of creating directories under 'var' #11166

FrankChen021 opened this issue Apr 27, 2021 · 2 comments · Fixed by #11167
Labels
Bug Docker https://hub.docker.com/r/apache/druid
Milestone

Comments

@FrankChen021
Copy link
Member

Affected Version

This problem is first reported based on 0.21.0-rc1. And it also exists on master branch

Description

When starting druid cluster in docker with the docker-compose(distribution/docker/docker-compose.yml), ALL druid's service nodes failed to start with messages as below:

coordinator      | mkdir: can't create directory 'var/tmp': Permission denied
coordinator      | mkdir: can't create directory 'var/druid/': Permission denied
coordinator      | mkdir: can't create directory 'var/druid/': Permission denied
coordinator      | mkdir: can't create directory 'var/druid/': Permission denied
coordinator      | mkdir: can't create directory 'var/druid/': Permission denied
coordinator      | mkdir: can't create directory 'var/druid/': Permission denied

Inside the container, listing the owner of all directories under /opt/druid showes

/opt/apache-druid-0.21.0 $ ls -l
total 196
-rw-r--r--    1 druid    druid        70924 Apr 27 04:17 LICENSE
-rw-r--r--    1 druid    druid        71187 Apr 27 04:17 NOTICE
-rw-r--r--    1 druid    druid         8228 Apr 27 04:17 README
drwxr-xr-x    2 druid    druid         4096 Apr 27 09:13 bin
drwxr-xr-x    5 druid    druid         4096 Apr 27 09:13 conf
drwxr-xr-x   29 druid    druid         4096 Apr 27 09:13 extensions
drwxr-xr-x    3 druid    druid         4096 Apr 27 09:13 hadoop-dependencies
drwxr-xr-x    2 druid    druid        12288 Apr 27 09:13 lib
drwxr-xr-x    4 druid    druid         4096 Apr 16 18:33 licenses
drwxr-xr-x    4 druid    druid         4096 Apr 27 09:13 quickstart
drwxr-xr-x    2 root     root          4096 Apr 26 10:38 var

Note that var directory is belong to root instead of druid. Since the process inside container is launched by user druid, of course it has no permission to create directories under var.

Analysis

This problem is introduced by #10506 . Looking at the scripts after 10506,

RUN addgroup -S -g 1000 druid \
 && adduser -S -u 1000 -D -H -h /opt/druid -s /bin/sh -g '' -G druid druid \
 && mkdir -p /opt/druid/var \
 && chown -R druid:druid /opt \
 && chmod 775 /opt/druid/var

COPY --chown=druid:druid --from=builder /opt /opt
COPY distribution/docker/druid.sh /druid.sh

At first, we create /opt/druid/var directory and change owner of /opt and its all sub-dirs to druid. This instruction looks OK.

But the following command COPY --chown=druid:druid --from=builder /opt /opt replaces the entire /opt, including its sub-directory opt/druid/var, which means there's no such directory inside the container.

Since /opt/druid/var is declared as a VOLUME, when cluster is brought up, docker is responsible for creating such directory. And docker is running as root on user's computer, the owner of var is now root instead of druid we expect.

Before 10506, there's no such problem, see the scripts below, /opt/druid/var is created after COPY, so that dir exists inside the container after build.

COPY --from=builder /opt /opt
COPY distribution/docker/druid.sh /druid.sh

RUN addgroup -S -g 1000 druid \
 && adduser -S -u 1000 -D -H -h /opt/druid -s /bin/sh -g '' -G druid druid \
 && mkdir -p /opt/druid/var \
 && chown -R druid:druid /opt \
 && chmod 775 /opt/druid/var

Some proof

To find out the problem, I added "ls" command to Dockerfile to observe directories and their owner during image building.

  1. directories before COPY command, there's a directory druid we created by command RUN before COPY
Step 12/20 : RUN ["ls", "-l", "/opt"]
 ---> Running in c33a81079773
total 4
drwxr-xr-x    3 druid    druid         4096 Apr 27 09:20 druid
  1. execute COPY command
Step 13/20 : COPY --chown=druid:druid --from=builder /opt /opt
Step 14/20 : COPY distribution/docker/druid.sh /druid.sh
  1. directories after COPY, druid now changes to symbolic link we created at the beginning of Dockerfile
Step 15/20 : RUN ["ls", "-l", "/opt/druid"]
lrwxrwxrwx    1 druid    druid           24 Apr 27 09:20 /opt/druid -> /opt/apache-druid-0.21.0
  1. directories of /opt/apache-druid-0.21.0, note that there' NO var directory
Step 16/20 : RUN ["ls", "-l", "/opt/apache-druid-0.21.0"]
total 192
-rw-r--r--    1 druid    druid        70924 Apr 27 04:17 LICENSE
-rw-r--r--    1 druid    druid        71187 Apr 27 04:17 NOTICE
-rw-r--r--    1 druid    druid         8228 Apr 27 04:17 README
drwxr-xr-x    2 druid    druid         4096 Apr 27 09:20 bin
drwxr-xr-x    5 druid    druid         4096 Apr 27 09:20 conf
drwxr-xr-x   29 druid    druid         4096 Apr 27 09:20 extensions
drwxr-xr-x    3 druid    druid         4096 Apr 27 09:20 hadoop-dependencies
drwxr-xr-x    2 druid    druid        12288 Apr 27 09:20 lib
drwxr-xr-x    4 druid    druid         4096 Apr 16 18:33 licenses
drwxr-xr-x    4 druid    druid         4096 Apr 27 09:20 quickstart

I'm not sure why this problem didn't come out in some other environment. I guess it has something to do with VOLUME. I'm not familiar with that, and this is my guess: since volume is also on HOST env, if there's such a directory (saying created by previous image), the var dir won't be created as root.

Fix

The fix I can come up with is putting mkdir -p /opt/druid/var after COPY command.
Back to what 10506 tries to solve, the change I propose only creates a new directory and makes no changes to the files, and it won't double the image size.

On my test environment, the image size shows 547MiB

cc @jihoonson @gianm

@FrankChen021 FrankChen021 added Bug Docker https://hub.docker.com/r/apache/druid labels Apr 27, 2021
@xvrl
Copy link
Member

xvrl commented Apr 28, 2021

@FrankChen021 this might be specific to docker for mac and how it mounts volumes. There might also be some issue with how COPY handles the existing var directory, complicated by the fact that /opt/druid is a symlink.
This might cause things to work a little differently across docker versions and OSes.

A workaround is to always mount a volume on the docker run command line. At least in my testing this somehow makes /opt/druid/var take on the right druid:druid ownership instead of root when the volume is left unset.

@EwanValentine
Copy link

I'm seeing this on K8s/eks, using the Druid Operator as well

xvrl pushed a commit that referenced this issue Apr 30, 2021
Docker volume directory was accidentally removed due to reordering of statements.
This causes ownership and permissions on the volume directory to be reset, preventing startup.

fixes #11166
Signed-off-by: frank chen <[email protected]>
@clintropolis clintropolis added this to the 0.21.1 milestone May 4, 2021
clintropolis pushed a commit that referenced this issue May 4, 2021
Docker volume directory was accidentally removed due to reordering of statements.
This causes ownership and permissions on the volume directory to be reset, preventing startup.

fixes #11166
Signed-off-by: frank chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Docker https://hub.docker.com/r/apache/druid
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants