Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quarkus boot fails when vertx-cache temporary directory exists #7678

Closed
rsvoboda opened this issue Mar 8, 2020 · 49 comments · Fixed by #15541
Closed

Quarkus boot fails when vertx-cache temporary directory exists #7678

rsvoboda opened this issue Mar 8, 2020 · 49 comments · Fixed by #15541
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@rsvoboda
Copy link
Member

rsvoboda commented Mar 8, 2020

Quarkus boot fails when vertx-cache temporary directory exists.
Running on Windows 10, using GraalVM 20 JDK 11 based, native image

2020-03-08 16:15:06,354 ERROR [io.qua.application] (main) Failed to start application: java.lang.IllegalStateException: 
 Failed to create cache dir: C:\Users\ROSTIS~1\AppData\Local\Temp\\vertx-cache/file-cache-9cf306ad-0c1b-4d8f-8468-f6d148da5859

First I thought the problem is in the file path which mixes Windows and Linux path separators, but that is not the problem.
The problem is that file-cache-9cf306ad-0c1b-4d8f-8468-f6d148da5859 directory exists and Quarkus native binary boot fails due to that fact. This directory was generated during mvn clean install -Dnative execution but wasn't cleaned at the end.

Workaround is to remove that temporary directory and after this step Quarkus .exe app boots without problem, temporary directory gets cleaned during the shutdown. So following executions run without problem. The same file-cache directory gets generated on each run.

Few points:

  • Vert.x code should handle the case when vertx-cache temporary directory exists
  • Cleanup of the temporary dir doesn't happen on mvn clean install -Dnative execution
  • Temporary directory is resolved during the build time

The third point is the biggest issue from my perspective as it blocks portability (C:\Users\ROSTIS~1\AppData\Local\Temp\ would be expected on every system the app gets started?) I tried on another windows machine and the path is not fully hard-coded, user's tmp directory is reflected, but the final directory is always file-cache-9cf306ad-0c1b-4d8f-8468-f6d148da5859,
Always the same path can be risky from a security perspective, knowledgeable users know where to look beforehand.

Steps to reproduce on sample bootstrap app:

mvn clean install -Dnative
target\getting-started-1.0-SNAPSHOT-runner.exe

Screenshot 2020-03-08 16 15 40

@rsvoboda rsvoboda added the kind/bug Something isn't working label Mar 8, 2020
@jaikiran
Copy link
Member

jaikiran commented Mar 9, 2020

but the final directory is always file-cache-9cf306ad-0c1b-4d8f-8468-f6d148da5859

Looking at the code in vert.x[1], it always uses a call to UUID.randomUUID() for the directory name. Do you mean the same UUID is being generated across multiple runs of the build process? What Windows OS is this?

[1] https://github.com/eclipse-vertx/vert.x/blob/master/src/main/java/io/vertx/core/file/impl/FileResolver.java#L325

@cescoffier
Copy link
Member

I was wondering the same. While the base dir is fixed the actually name is generated at runtime.

@rsvoboda
Copy link
Member Author

rsvoboda commented Mar 9, 2020

I'm talking about native image on Windows 10.

In JVM mode different directory gets used with every run, with native image binary the directory is always the same (for that one binary).

@rsvoboda
Copy link
Member Author

rsvoboda commented Mar 9, 2020

While the base dir is fixed the actually name is generated at runtime.

True for JVM mode, not in native mode.

Native mode will make our heads hairless:)

@cescoffier
Copy link
Member

Just to be sure, it only happens on windows?

@gsmet
Copy link
Member

gsmet commented Mar 9, 2020

Maybe the boot failing is happening only on Windows but I would bet the hardcoded temp directory is a global issue if the directory name is generated at static init.

@cescoffier
Copy link
Member

The directory name is computed at runtime and is not stored in a static field (the File Resolver is created in the VertxImpl constructor). Also, the directory should be clearer on shutdown (at least it's what the code is trying to do...).

@rsvoboda
Copy link
Member Author

rsvoboda commented Mar 9, 2020

I've seen it on Windows only, maybe it's because Windows are more strict when it comes to filesystem and locks on files.

the directory should be clearer on shutdown

Sure, but that didn't happen with mvn clean install -Dnative, vertx-cache directory was left there and target\getting-started-1.0-SNAPSHOT-runner.exe (after maven command) failed because the directory existed.

@rsvoboda
Copy link
Member Author

rsvoboda commented Mar 9, 2020

So I tried on macOS and the tmp folder name is unique on every run of native app.
On macOS I'm using 19.3.1, on Windows machine I used GraalVM 20 as suggested in PR for Windows native by Tomaz.

3x execution of native binary with Installed features: [cdi, resteasy]

/var/folders/nn/w1bsdgrx69n9gch4w4vkrkv40000gn/T/vertx-cache/file-cache-2554416b-2b60-4a9e-bf62-4de1b02029e4
/var/folders/nn/w1bsdgrx69n9gch4w4vkrkv40000gn/T/vertx-cache/file-cache-def0845c-f09f-4888-8869-b7555b488519
/var/folders/nn/w1bsdgrx69n9gch4w4vkrkv40000gn/T/vertx-cache/file-cache-8eae8c90-220d-417b-81a3-b82c8f0de9b4

@cescoffier
Copy link
Member

cescoffier commented Mar 11, 2020 via email

@cescoffier
Copy link
Member

Unfortunately, I'm not able to build a native-image on Windows...

Error C2059: syntax error: '~'

@rsvoboda
Copy link
Member Author

@cescoffier do you have the right tools installed on your Windows box?
See #7663 (comment)

@jaikiran
Copy link
Member

@rsvoboda I forgot to ask, does this only happen with 20.x of GraalVM or even with 19.3.x?
Either way, I still can't see how this exception can happen unless either:

  • The UUID.randomString().toString() in that code someone gets inlined to a precomputed value, which seems impossible (even for native image semantics)
  • OR the UUID.randomString() on that setup seems to be generating duplicates (which apparently is possible). But for it to generate the duplicates this consistently seems really odd and points to a potentially bigger problem.
    Is this some Windows VM by the way or is Windows really the host OS?

@rsvoboda
Copy link
Member Author

It's my home laptop, Windows is the host OS, no VM. Using GraalVM 20, didn't try 19.3 as Windows support for native-image improved with GraalVM 20.

@rsvoboda
Copy link
Member Author

I tried to use 19.3.1 on Windows machine, but it can't be used with Quarkus, getting error

[INFO] [io.quarkus.deployment.pkg.steps.NativeImageBuildStep] C:\Program Files\GraalVM\graalvm-ce-java11-19.3.1\bin\native-image.cmd -J-Dsun.nio.ch.maxUpdateArraySize=100 -J-Djava.util.logging.manager=org.jboss.logmanager.LogManager -J-Dvertx.logger-delegate-factory-class-name=io.quarkus.vertx.core.runtime.VertxLogDelegateFactory -J-Dvertx.disableDnsResolver=true -J-Dio.netty.leakDetection.level=DISABLED -J-Dio.netty.allocator.maxOrder=1 -J-Duser.language=cs -J-Dfile.encoding=Cp1250 --initialize-at-build-time= -H:InitialCollectionPolicy=com.oracle.svm.core.genscavenge.CollectionPolicy$BySpaceAndTime -H:+JNI -jar getting-started-1.0-SNAPSHOT-runner.jar -H:FallbackThreshold=0 -H:+ReportExceptionStackTraces -H:-AddAllCharsets -H:-IncludeAllTimeZones -H:EnableURLProtocols=http -H:-UseServiceLoaderFeature -H:+StackTrace getting-started-1.0-SNAPSHOT-runner
Warning: the '=' character in program arguments is not fully supported.
Make sure that command line arguments using it are wrapped in double quotes.
Example:
"--vm.Dfoo=bar"

Error: Unknown arguments: org.jboss.logmanager.LogManager, io.quarkus.vertx.core.runtime.VertxLogDelegateFactory, true, DISABLED, 1, cs, Cp1250, com.oracle.svm.core.genscavenge.CollectionPolicy$BySpaceAndTime, 0, http, getting-started-1.0-SNAPSHOT-runner

@rsvoboda
Copy link
Member Author

Windows native-image based binary generates always the same UUID.

I filed an issue in GraalVM tracker - oracle/graal#2265

@karesti
Copy link
Member

karesti commented Feb 16, 2021

@rsvoboda @cescoffier I'm experiencing this issue with jvm and native mode in openshift. not sure what I'm doing wrong

@karesti
Copy link
Member

karesti commented Feb 16, 2021

not sure if this is the issue exactly
I get this:
2021-02-16 22:27:32,212 ERROR [io.qua.run.Application] (main) Failed to start application (with profile prod): java.lang.IllegalStateException: Failed to create cache dir: /tmp/vertx-cache/file-cache-840facac-d89f-4b7b-9980-0f48350ea364

@cescoffier
Copy link
Member

@karesti in your case I believe it's because you don't have the permission to write in /tmp, can you check that you have the permissions?

Any way to get a reproducer if that's not the case?

@karesti
Copy link
Member

karesti commented Feb 17, 2021

@cescoffier oh ok. I will check this with the team. How can I give those permissions in openshift ?

@karesti
Copy link
Member

karesti commented Feb 17, 2021

@cescoffier I will check with the team so I get it solved if it's a matter of permissions

@cescoffier
Copy link
Member

Normally on OpenShift /tmp is writable. Did it change?

@karesti
Copy link
Member

karesti commented Feb 18, 2021

So, the problem is that the docker file uses RUN chown 1001 /work
and in openshift the user changes so the /tmp file has no rights.
I fixed the issue by using this docker file:

FROM registry.access.redhat.com/ubi8/ubi-minimal:8.3
WORKDIR /work/
RUN chown 1001 /work \
    && chmod "g+rwX" /work \
    && mkdir /tmp/vertx-cache \
    && chown 1001:root /work \
    && chmod 775 /tmp/vertx-cache

COPY --chown=1001:root target/*-runner /work/application

EXPOSE 8080
USER 1001

CMD ["./application", "-Dquarkus.http.host=0.0.0.0"]

@karesti
Copy link
Member

karesti commented Feb 18, 2021

@cescoffier I forgot to ping you in my comment above

@rsvoboda
Copy link
Member Author

rsvoboda commented Feb 18, 2021

The default Dockerfile.native is:

FROM registry.access.redhat.com/ubi8/ubi-minimal:8.3
WORKDIR /work/
RUN chown 1001 /work \
    && chmod "g+rwX" /work \
    && chown 1001:root /work
COPY --chown=1001:root target/*-runner /work/application

EXPOSE 8080
USER 1001

CMD ["./application", "-Dquarkus.http.host=0.0.0.0"]

So it needs to be adjusted?

In other words: is /tmp/vertx-cache something specific just for your use-case or it's common for users to define it on their own?

@karesti
Copy link
Member

karesti commented Feb 18, 2021

@rsvoboda I did not use anything specific or custom. the quarkus app dit not start simply. I used 4 extensions: websockets, infinispan, rest and scheduler

@cescoffier
Copy link
Member

I just tried (with a different set of extensions but would still use vert.x) on an OpenShift on Azure and no problem.

D'oh didn't see the ubi-minimal version, I may have an older version.

@rsvoboda
Copy link
Member Author

Which version of OpenShift was used? Especially interested in the version used by @karesti.

@cescoffier
Copy link
Member

Just checked with the very last ubi:8.3 - no problem. So might be an OpenShift permission issue.

@karesti
Copy link
Member

karesti commented Feb 19, 2021

@rsvoboda I'm using a dedicated version 4.6.8

@cescoffier
Copy link
Member

@rsvoboda is that a setting you can reproduced? I don't have "dedicated", I only have cloud instances.

@rsvoboda
Copy link
Member Author

We were not able to reproduce. As if there were some specific permissions specified in the used OpenShift cluster.
@pjgg was trying to reproduce this trouble.

@pjgg
Copy link
Contributor

pjgg commented Feb 22, 2021

@karesti I am trying to reproduce your issue without success.
Could you try to deploy on your cluster the following image:

quay.io/pjgg/webclient:0.0.1

For example, a valid YAML could be something like this:

apiVersion: v1
kind: Pod
metadata:
  name: security-demo
spec:
  volumes:
  - name: demo
    emptyDir: {}
  containers:
  - name: demo
    image: quay.io/pjgg/webclient:0.0.1
    livenessProbe:
      httpGet:
          path: /q/health/live
          port: 8080
      initialDelaySeconds: 59
      timeoutSeconds: 5
    readinessProbe:
      httpGet:
          path: /q/health/ready
          port: 8080
      initialDelaySeconds: 30
      timeoutSeconds: 5

@karesti
Copy link
Member

karesti commented Feb 23, 2021

@pjgg this pod deploy worked! I used the scheduler and websockets extension (if this might be an issue)

@karesti
Copy link
Member

karesti commented Feb 23, 2021

I will test the deployment in another way

@karesti
Copy link
Member

karesti commented Feb 23, 2021

@pjgg So I tested using deployment config and 2 pods and works as well. Might the scheduler or websockets extensions be related?

@pjgg
Copy link
Contributor

pjgg commented Feb 24, 2021

I am not sure, yet. Could you show me your dependencies? (maven/gradle) just to set up an scenario with the same version as you?.

When you said:

I tested using deployment config and 2 pods and works as well.

Do you mean that you use your application and the default Docker file, and works?

I have tested with a scheduler (quartz), and looks ok!.
Could be tested on your environment, with this docker image: quay.io/pjgg/webclient:0.0.1-a

@thomas-mc-work
Copy link
Contributor

Another aspect that might be relevant: I'm running two instances of the same native runner on a machine on separate user accounts. The first one can be launched normally. The second one throws the same exception as above. The cause is that the first user has created and is thus owning /tmp/vertx-cache. This makes the second process fail to write into the same path.

Both users have defined the paths to separate temp dirs in the environment, e.g. for one of them:

# env | grep -i tmp
TEMPDIR=/tmp/user/107329
TMPDIR=/tmp/user/107329
TEMP=/tmp/user/107329
TMP=/tmp/user/107329

I guess that the /tmp base path is either hard coded or taken from my development machine (or even from within the GraalVM container). So reading this value at runtime would be a benefit.

For me the workaround using -Djava.io.tmpdir=$TMPDIR has helped.

@cescoffier
Copy link
Member

The vertx-cache directory should be world-writable, and it might not be the case. I need to investigate this.

@cescoffier
Copy link
Member

So, the vertx-cache is configured by quarkus but created by vert.x.
A workaround could be to generate a unique cache at each startup.

@rsvoboda any concern with such an approach?

@rsvoboda
Copy link
Member Author

rsvoboda commented Mar 1, 2021

I spoke with @pjgg about this. In general we are ok with such change.

The only concern we see is around @karesti use-case whe Katia worked with known path /tmp/vertx-cache. On the other side we were not able to reproduce her problem. If the unique cache is based on timestamp we could create a vertx-cache/4163465465465 temporal folder, instead of vertx-cache-4163465465465 so karesti workaround could still work.

About unique cache generate at each startup - also in NATIVE mode?

@cescoffier
Copy link
Member

Yes, it's going to be generated on every application start, so even in native mode, multiple runs will generate multiple directories.

@cescoffier cescoffier self-assigned this Mar 7, 2021
cescoffier added a commit to cescoffier/quarkus that referenced this issue Mar 8, 2021
…table.

Under that directory, another random directory is created that is only readable and writable from the current user.

The vertx-cache directory creation is disabled if:

* the user specifies a cache directory
* the vertx-cache directory already exists

Fix quarkusio#7678
@quarkus-bot quarkus-bot bot added this to the 1.13 - master milestone Mar 10, 2021
@juangon
Copy link

juangon commented Feb 24, 2022

FYI @rsvoboda I am able to reproduce this error on Quarkus 2.7.X in NO NATIVE (didn't tried on native), not reproducible in Quarkus 2.6.3.Final.

@juangon
Copy link

juangon commented Feb 24, 2022

Oh BTW, not testing in Windows. It happens in a Kubernetes cluster, using a new PVC mounted in /tmp directory. None of this fails in Quarkus 2.6.3.Final

@rsvoboda
Copy link
Member Author

Please file new issue with steps to reproduce, this sounds like a new variant of the problem.

@juangon
Copy link

juangon commented Mar 1, 2022

Please file new issue with steps to reproduce, this sounds like a new variant of the problem.

Ok so it seems I had to create an init-containers in order to set the permissions properly. Thanks very much!

@rsvoboda
Copy link
Member Author

rsvoboda commented Mar 2, 2022

@juangon Can you provide a snippet of changes you did so that other people can learn from your fix? Assuming you have custom yaml descriptor and not relying on kubernetes yaml generated by Quarkus extension.

@juangon
Copy link

juangon commented Mar 2, 2022

Yes sure, I don't have a custom YAML descriptor, but will write here anyway.

Here is the PVC for a microservice:

quarkus.kubernetes.pvc-volumes.backend-persistent-storage.claim-name=pv-claim
quarkus.kubernetes.mounts.backend-persistent-storage.path=/tmp

And here the init-containers

quarkus.kubernetes.init-containers.set-tmp-permissions.image=registry.access.redhat.com/ubi8/ubi:latest
quarkus.kubernetes.init-containers.set-tmp-permissions.mounts.backend-persistent-storage.path=/tmp
quarkus.kubernetes.init-containers.set-tmp-permissions.command=chown,-R,185,/tmp

User is 185 as existing in latest Dockerfile

@rsvoboda
Copy link
Member Author

rsvoboda commented Mar 2, 2022

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
8 participants