Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace read/write lock in JarResource to avoid virtual threads pinning #42139

Merged
merged 1 commit into from
Jul 31, 2024

Conversation

mariofusco
Copy link
Contributor

@mariofusco mariofusco commented Jul 25, 2024

This pull request fixes the race condition present in this commit that, because of it, had to be reverted.

/cc @geoand @gsmet @franz1981 @jedla97 @michalvavrik

Fixes: #42067

The nature of the fix is described at #42139 (review)

@quarkus-bot quarkus-bot bot added the area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins label Jul 25, 2024

private static JarFileReference asyncLoadAcquiredJarFile(JarResource jarResource) {
CompletableFuture<JarFileReference> newJarRefFuture = new CompletableFuture<>();
CompletableFuture<JarFileReference> existingJarRefFuture = null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of this variable is the actual fix. The old algorithm wrongly reused the newJarFileRef one. This means that in the unfortunate case when a thread was about to close the old jar, but before setting the AtomicReference to null, another thread could fail the compareAndSet(null) check here, take at the end of the loop from the AtomicReference the old and no longer valid CompletableFuture and in subsequent loop (after that the close finally managed to set the AtomicReference to null) mistakenly set it back into the same AtomicReference.

@jedla97
Copy link
Contributor

jedla97 commented Jul 25, 2024

@famod can you check as well if it's fixed on your end? As you encounter that problem as well.

I'll will check it on our side but it's always good to have second confirmation.

@geoand
Copy link
Contributor

geoand commented Jul 25, 2024

Thanks for the fix @mariofusco !

@jedla97 thanks for signing up to test the fix! Hopefully @famod can do so as well

This comment has been minimized.

@jedla97
Copy link
Contributor

jedla97 commented Jul 25, 2024

@mariofusco Big thanks for the fix as I believe it was hard to find it. I tested it (run two reproducer multiple times) and the race condition didn't appear. So from my POV the original issue is fixed (hopefully we won't see any different 🙏 ).

This comment has been minimized.

@franz1981
Copy link
Contributor

It would be great if there is a test which reproduce it too, with the changes to fix

@famod
Copy link
Member

famod commented Jul 26, 2024

Sorry, I am unable to test this ATM. On my work machine, I'm getting a puzzling dependency resolution issue while building, it's related to gradle-tooling.

@mariofusco
Copy link
Contributor Author

It would be great if there is a test which reproduce it too, with the changes to fix

I did a few more checks on this fix and also added a proper unit test. The unit test reproduces exactly the same situation that caused the IllegalStateException in the former version of this class loader as I described it here. Of course, being a race condition, the test is not deterministic and it fails with the older version extremely seldomly. However it's descriptive of the problem and I'm afraid it will be impossible to make it reproducible in a more reliable way.

@geoand @franz1981 the pull request has been already rebased and squashed as usual, so it should be ready to be merged. Please let me know if this is enough or you see any room for further improvements.

Copy link
Contributor

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM well done as usual Mario!

This comment has been minimized.

Copy link
Contributor

@geoand geoand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

This comment has been minimized.

@mariofusco
Copy link
Contributor Author

Failing Jobs

Status Name Step Failures Logs Raw logs Build scan
✖ JVM Tests - JDK 17 Build Failures Logs Raw logs 🔍
✖ JVM Tests - JDK 17 Windows Build Failures Logs Raw logs 🔍

These failures are caused by the fact that I compiled the minimal jars that I used for this test with Java 21, so they cannot be loaded with Java 17.

java.lang.UnsupportedClassVersionError: org/simple/SimplePojo1 has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0

I will recompile those jars with Java 17 and replace them in this pull request immediately.

add test

format test

recompile test jars with jdk17
@mariofusco
Copy link
Contributor Author

I will recompile those jars with Java 17 and replace them in this pull request immediately.

Done.

@geoand geoand added the triage/waiting-for-ci Ready to merge when CI successfully finishes label Jul 30, 2024

This comment has been minimized.

Copy link

quarkus-bot bot commented Jul 30, 2024

Status for workflow Quarkus CI

This is the status report for running Quarkus CI on commit 0c479ce.

Failing Jobs

Status Name Step Failures Logs Raw logs Build scan
✔️ Native Tests - Messaging1 Failures Logs Raw logs 🚧

Full information is available in the Build summary check run.

Failures

⚙️ Native Tests - Messaging1 #

📦 integration-tests/kafka-avro-apicurio2

io.quarkus.it.kafka.KafkaAvroIT.testApicurioAvroConsumer - History - More details - Source on GitHub

java.lang.RuntimeException: 
java.lang.RuntimeException: io.quarkus.builder.BuildException: Build failure: Build failed due to errors
	[error]: Build step io.quarkus.kafka.client.deployment.DevServicesKafkaProcessor#startKafkaDevService threw an exception: java.lang.RuntimeException: org.testcontainers.containers.ContainerFetchException: Can't get Docker image: RemoteDockerImage(imageName=docker.io/vectorized/redpanda:v24.1.2, imagePullPolicy=DefaultPullPolicy(), imageNameSubstitutor=org.testcontainers.utility.ImageNameSubstitutor$LogWrappedImageNameSubstitutor@74c1204d)
	at io.quarkus.kafka.client.deployment.DevServicesKafkaProcessor.startKafkaDevService(DevServicesKafkaProcessor.java:105)
	at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:732)
	at io.quarkus.deployment.ExtensionLoader$3.execute(ExtensionLoader.java:854)
	at io.quarkus.builder.BuildContext.run(BuildContext.java:256)
	at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18...

io.quarkus.it.kafka.KafkaAvroIT.testApicurioAvroProducer - History - More details - Source on GitHub

java.lang.RuntimeException: 
java.lang.RuntimeException: io.quarkus.builder.BuildException: Build failure: Build failed due to errors
	[error]: Build step io.quarkus.kafka.client.deployment.DevServicesKafkaProcessor#startKafkaDevService threw an exception: java.lang.RuntimeException: org.testcontainers.containers.ContainerFetchException: Can't get Docker image: RemoteDockerImage(imageName=docker.io/vectorized/redpanda:v24.1.2, imagePullPolicy=DefaultPullPolicy(), imageNameSubstitutor=org.testcontainers.utility.ImageNameSubstitutor$LogWrappedImageNameSubstitutor@66eb61f4)
	at io.quarkus.kafka.client.deployment.DevServicesKafkaProcessor.startKafkaDevService(DevServicesKafkaProcessor.java:105)
	at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:732)
	at io.quarkus.deployment.ExtensionLoader$3.execute(ExtensionLoader.java:854)
	at io.quarkus.builder.BuildContext.run(BuildContext.java:256)
	at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18...

io.quarkus.it.kafka.KafkaAvroIT.testConfluentAvroConsumer - History - More details - Source on GitHub

java.lang.RuntimeException: 
java.lang.RuntimeException: io.quarkus.builder.BuildException: Build failure: Build failed due to errors
	[error]: Build step io.quarkus.kafka.client.deployment.DevServicesKafkaProcessor#startKafkaDevService threw an exception: java.lang.RuntimeException: org.testcontainers.containers.ContainerFetchException: Can't get Docker image: RemoteDockerImage(imageName=docker.io/vectorized/redpanda:v24.1.2, imagePullPolicy=DefaultPullPolicy(), imageNameSubstitutor=org.testcontainers.utility.ImageNameSubstitutor$LogWrappedImageNameSubstitutor@563e520)
	at io.quarkus.kafka.client.deployment.DevServicesKafkaProcessor.startKafkaDevService(DevServicesKafkaProcessor.java:105)
	at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:732)
	at io.quarkus.deployment.ExtensionLoader$3.execute(ExtensionLoader.java:854)
	at io.quarkus.builder.BuildContext.run(BuildContext.java:256)
	at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18)...

io.quarkus.it.kafka.KafkaAvroIT.testConfluentAvroProducer - History - More details - Source on GitHub

java.lang.RuntimeException: 
java.lang.RuntimeException: io.quarkus.builder.BuildException: Build failure: Build failed due to errors
	[error]: Build step io.quarkus.kafka.client.deployment.DevServicesKafkaProcessor#startKafkaDevService threw an exception: java.lang.RuntimeException: org.testcontainers.containers.ContainerFetchException: Can't get Docker image: RemoteDockerImage(imageName=docker.io/vectorized/redpanda:v24.1.2, imagePullPolicy=DefaultPullPolicy(), imageNameSubstitutor=org.testcontainers.utility.ImageNameSubstitutor$LogWrappedImageNameSubstitutor@171be7de)
	at io.quarkus.kafka.client.deployment.DevServicesKafkaProcessor.startKafkaDevService(DevServicesKafkaProcessor.java:105)
	at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:732)
	at io.quarkus.deployment.ExtensionLoader$3.execute(ExtensionLoader.java:854)
	at io.quarkus.builder.BuildContext.run(BuildContext.java:256)
	at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18...

io.quarkus.it.kafka.KafkaAvroIT.testUrls - History - More details - Source on GitHub

java.lang.RuntimeException: 
java.lang.RuntimeException: io.quarkus.builder.BuildException: Build failure: Build failed due to errors
	[error]: Build step io.quarkus.kafka.client.deployment.DevServicesKafkaProcessor#startKafkaDevService threw an exception: java.lang.RuntimeException: org.testcontainers.containers.ContainerFetchException: Can't get Docker image: RemoteDockerImage(imageName=docker.io/vectorized/redpanda:v24.1.2, imagePullPolicy=DefaultPullPolicy(), imageNameSubstitutor=org.testcontainers.utility.ImageNameSubstitutor$LogWrappedImageNameSubstitutor@1717cdcc)
	at io.quarkus.kafka.client.deployment.DevServicesKafkaProcessor.startKafkaDevService(DevServicesKafkaProcessor.java:105)
	at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:732)
	at io.quarkus.deployment.ExtensionLoader$3.execute(ExtensionLoader.java:854)
	at io.quarkus.builder.BuildContext.run(BuildContext.java:256)
	at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18...

Flaky tests - Develocity

⚙️ JVM Tests - JDK 21

📦 integration-tests/opentelemetry

io.quarkus.it.opentelemetry.OpenTelemetryInjectionsTest.testOTelInjections - History

  • Condition with Lambda expression in io.quarkus.it.opentelemetry.OpenTelemetryInjectionsTest was not fulfilled within 5 seconds. - org.awaitility.core.ConditionTimeoutException
org.awaitility.core.ConditionTimeoutException: Condition with Lambda expression in io.quarkus.it.opentelemetry.OpenTelemetryInjectionsTest was not fulfilled within 5 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
	at org.awaitility.core.CallableCondition.await(CallableCondition.java:78)
	at org.awaitility.core.CallableCondition.await(CallableCondition.java:26)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:975)
	at io.quarkus.it.opentelemetry.OpenTelemetryInjectionsTest.reset(OpenTelemetryInjectionsTest.java:26)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)

@mariofusco
Copy link
Contributor Author

Is this ready to be merged? Since the 3.13.0 is already out there shouldn't be any reason to hold it, right?

@geoand
Copy link
Contributor

geoand commented Jul 31, 2024

Correct

@geoand geoand merged commit 085556a into quarkusio:main Jul 31, 2024
53 checks passed
@quarkus-bot quarkus-bot bot added kind/bugfix and removed triage/waiting-for-ci Ready to merge when CI successfully finishes labels Jul 31, 2024
@quarkus-bot quarkus-bot bot added this to the 3.14 - main milestone Jul 31, 2024
@geoand
Copy link
Contributor

geoand commented Jul 31, 2024

Let's keep an eye out on weird failures as we did with the previous iteration.

@mariofusco
Copy link
Contributor Author

Let's keep an eye out on weird failures as we did with the previous iteration.

Thank you for merging. Sure, please ping me as soon as you will notice something wrong.

@geoand
Copy link
Contributor

geoand commented Jul 31, 2024

Will do! Thanks again for tackling this!

If anything comes up, it will be by QE running their testsuite over the next few days. Otherwise if there is an issue, reports from users will start coming in when 3.14.0.CR1 is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins kind/bugfix triage/flaky-test
Projects
None yet
5 participants