Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix randomly failures when deploying to OpenShift/K8s on JVM/Native #31503

Merged
merged 1 commit into from
Mar 3, 2023

Conversation

Sgitario
Copy link
Contributor

@Sgitario Sgitario commented Mar 1, 2023

Finally, I've found out the root cause of this issue. There were several places where a new KubernetesClient was being created:

  • Some places use Clients.fromConfig.
  • Other places use KubernetesClientUtils.createConfig.
  • And other places use KubernetesClientBuildItem.getClient

And the problem is that every time we invoke any of these methods, Fabric8 KubernetesClient tries to locate one HttpClient.Factory and it seems that the logic to get one HttpClient.Factory sometimes gets the VertxHttpClientFactory implementation over the expected one QuarkusHttpClientFactory.

The previous solution about keeping the Disable Http Dns system property worked fine because this property was used either by VertxHttpClientFactory or QuarkusHttpClientFactory.

However, I've updated the pull request with a better solution and more efficient one as it avoids finding a HttpClient.Factory via the ServiceLoader logic in Fabric8 Kubernetes Client, but instead it directly provides the expected QuarkusHttpClientFactory implementation always.

Also, the warning message that always appeared, it's now gone:

[WARNING] [io.netty.resolver.dns.DefaultDnsServerAddressStreamProvider] Default DNS servers: [/[2001:4860:4860:0:0:0:0:8888]:53, /[2001:4860:4860:0:0:0:0:8844]:53] (Google Public DNS as a fallback)
[WARNING] There are multiple httpclient implementation in the classpath, choosing the first non-default implementation. You should exclude dependencies that aren't needed or use an explicit association of the HttpClient.Factory.

Fix #31476

@Sgitario Sgitario requested a review from geoand March 1, 2023 10:19
@Sgitario Sgitario requested a review from yrodiere March 1, 2023 10:20
@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 1, 2023

@yrodiere with these changes, it seems to be consistently working for me. Could you try it yourself too just to be sure?

@geoand
Copy link
Contributor

geoand commented Mar 1, 2023

Can you please explain this change. I don't understand it

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 1, 2023

Can you please explain this change. I don't understand it

This is based on my findings here: #31476 (comment)

Copy link
Member

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't easily test with a Quarkus snapshot as I've only ever deployed from CI, but @gsmet is giving a try with another app that he usually deploys from his laptop, so let's wait for that.

As to the solution, I'm not sure I understand the content, but I added a comment about the form :)

@geoand
Copy link
Contributor

geoand commented Mar 1, 2023

Can you please explain this change. I don't understand it

This is based on my findings here: #31476 (comment)

I'd like a more complete explanation please. Both to help me understand why this change fixes the problem and to have an easily accessible record of why this change was made.

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 1, 2023

Can you please explain this change. I don't understand it

This is based on my findings here: #31476 (comment)

I'd like a more complete explanation please. Both to help me understand why this change fixes the problem and to have an easily accessible record of why this change was made.

My guess was that the system property was cleared up and a new vert.x instance was created with the wrong value. Anyway, @gsmet is getting a different error after trying these changes. So, I'm changing this pull request to draft and keep working on it.

@Sgitario Sgitario marked this pull request as draft March 1, 2023 12:17
@geoand
Copy link
Contributor

geoand commented Mar 1, 2023

So, I'm changing this pull request to draft and keep working on it.

👍🏼.

Let me know if you need me to look into anything

@Sgitario Sgitario marked this pull request as ready for review March 2, 2023 05:41
@Sgitario Sgitario requested a review from yrodiere March 2, 2023 05:44
@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 2, 2023

@geoand I've just updated the pull request, more information about the changes is in the description.
@yrodiere I've tried these changes using the GitHub lottery repo and it seems to be consistently working fine either in JVM and Native.

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 2, 2023

@gsmet I've also tried to deploy the (quarkus-github-bot)[https://github.com/quarkusio/quarkus-github-bot] repo into OpenShift and it's working fine too (no warnings messages are printed).

@gsmet @yrodiere I would like to try these changes yourself too just to be surer that all the issues are gone.

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 2, 2023

there is a regression with these changes that cause an out-of-memory error. I can reproduce it locally, investigating.

@geoand
Copy link
Contributor

geoand commented Mar 2, 2023

That almost certainly means that there are multiple Vert.x instances that have not been closed

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 2, 2023

It should be fixed with the latest changes.

@@ -12,7 +14,9 @@ public class KubernetesClientBuildStep {
private KubernetesClientBuildConfig buildConfig;

@BuildStep
public KubernetesClientBuildItem process(TlsConfig tlsConfig) {
return new KubernetesClientBuildItem(createConfig(buildConfig, tlsConfig));
public KubernetesClientBuildItem process(TlsConfig tlsConfig, QuarkusBuildCloseablesBuildItem closeablesBuildItem) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about QuarkusBuildCloseablesBuildItem :)

@quarkus-bot

This comment has been minimized.

@gsmet
Copy link
Member

gsmet commented Mar 2, 2023

I will test it with the Quarkus Bot and the production cluster this afternoon.

@gsmet
Copy link
Member

gsmet commented Mar 2, 2023

So... I don't think it's the same issue but I still have an error when pushing the Quarkus bot to the production cluster (error which apparently doesn't prevent the deployment as I have a success in the end):

[INFO] [io.quarkus.container.image.openshift.deployment.OpenshiftProcessor] Applied: ImageStream ubi-quarkus-native-binary-s2i
[INFO] [io.quarkus.container.image.openshift.deployment.OpenshiftProcessor] Applied: ImageStream quarkus-bot
[INFO] [io.quarkus.container.image.openshift.deployment.OpenshiftProcessor] Applied: BuildConfig quarkus-bot
[ERROR] Failed to upload archive file for the build: quarkus-bot
[ERROR] Please check cluster events via `oc get events` to see what could have possibly gone wrong
[WARNING] [io.quarkus.container.image.openshift.deployment.OpenshiftProcessor] An exception: 'Can't instantiate binary build, due to error reading/writing stream. Can be caused if the output stream was closed by the server.See if something's wrong in recent events in Cluster = Scheduled quarkus-bot-1-build.17489735390326d6 Successfully assigned prod-quarkus-bot/quarkus-bot-1-build to ip-10-0-142-232.us-west-2.compute.internal by ip-10-0-182-138
Created quarkus-bot-1-build.17489735b5f74d2d Created container git-clone
Started quarkus-bot-1-build.17489735b86fbe3a Started container git-clone
BuildStarted quarkus-bot-1.17489735e85618ad Build prod-quarkus-bot/quarkus-bot-1 is now running
AddedInterface quarkus-bot-1-build.17489735a33f5430 Add eth0 [10.128.3.49/23] from openshift-sdn

Talking about: [ERROR] Failed to upload archive file for the build: quarkus-bot.

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 2, 2023

So... I don't think it's the same issue but I still have an error when pushing the Quarkus bot to the production cluster (error which apparently doesn't prevent the deployment as I have a success in the end):

[INFO] [io.quarkus.container.image.openshift.deployment.OpenshiftProcessor] Applied: ImageStream ubi-quarkus-native-binary-s2i
[INFO] [io.quarkus.container.image.openshift.deployment.OpenshiftProcessor] Applied: ImageStream quarkus-bot
[INFO] [io.quarkus.container.image.openshift.deployment.OpenshiftProcessor] Applied: BuildConfig quarkus-bot
[ERROR] Failed to upload archive file for the build: quarkus-bot
[ERROR] Please check cluster events via `oc get events` to see what could have possibly gone wrong
[WARNING] [io.quarkus.container.image.openshift.deployment.OpenshiftProcessor] An exception: 'Can't instantiate binary build, due to error reading/writing stream. Can be caused if the output stream was closed by the server.See if something's wrong in recent events in Cluster = Scheduled quarkus-bot-1-build.17489735390326d6 Successfully assigned prod-quarkus-bot/quarkus-bot-1-build to ip-10-0-142-232.us-west-2.compute.internal by ip-10-0-182-138
Created quarkus-bot-1-build.17489735b5f74d2d Created container git-clone
Started quarkus-bot-1-build.17489735b86fbe3a Started container git-clone
BuildStarted quarkus-bot-1.17489735e85618ad Build prod-quarkus-bot/quarkus-bot-1 is now running
AddedInterface quarkus-bot-1-build.17489735a33f5430 Add eth0 [10.128.3.49/23] from openshift-sdn

Talking about: [ERROR] Failed to upload archive file for the build: quarkus-bot.

I've also seen this issue even without this change. My guess is that the client is closing the connection after some time. How long did the build take for you? I would like to address this issue as part of this pull request as well, so I'll investigate it.

@gsmet
Copy link
Member

gsmet commented Mar 2, 2023

The build is a native one so it's a bit long, something like 2-3 minutes.

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 2, 2023

After having tried up to 10 times a Native build to be deployed into OpenShift, I could reproduce the latter issue only once, and seems to be caused because OpenShift rejects the compressed file:

error: unable to extract binary build input, must be a zip, tar, or gzipped tar, or specified as a file: exit status 1 

And I think this is caused by Files.move(see here). It moves files but depending on your environment, they might not be fully ready yet and hence OpenShift can't read it. Adding the ATOMIC_MOVE flag should address this issue.

With this change, I've tried the deployment to OpenShift and I could not reproduce the issue any longer.

@gsmet can you try again?

@gsmet
Copy link
Member

gsmet commented Mar 2, 2023

Sure. I’m at the doctor right now but will have a look when I’m back.

The System.out.println should probably go away though :)

@quarkus-bot

This comment has been minimized.

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 3, 2023

And the problem is that every time we invoke any of these methods, Fabric8 KubernetesClient tries to locate one HttpClient.Factory and it seems that the logic to get one HttpClient.Factory sometimes gets the VertxHttpClientFactory implementation over the expected one QuarkusHttpClientFactory.

Not really, the logic I mentioned is this one (Note that there is an extra parenthesis in the third condition, but I'm not sure if this was intentional, however this was not the issue). Maybe, the root problem is how to load the factories here.

Anyway, I thought it was better to not call this logic at all to be more efficient and directly use the factory we want at build time.

This is kind of messed up, have you identified the root cause?

From the changes, I understand that the explicit HttpClient Factory is only used at build time, The runtime Bean will continue to be created by using the SPI-provided factory, right?

Yes, this is meant to be used only at build time. At runtime, the KubernetesClient instance will still be produced by https://github.com/quarkusio/quarkus/blob/main/extensions/kubernetes-client/runtime/src/main/java/io/quarkus/kubernetes/client/runtime/KubernetesClientProducer.java#L21.

There were several places where a new KubernetesClient was being created:
- Some places use `Clients.fromConfig`.
- Other places use `KubernetesClientUtils.createConfig`.
- And other places use `KubernetesClientBuildItem.getClient`

And the problem is that every time we invoke any of these methods, Fabric8 KubernetesClient tries to locate one HttpClient.Factory and it seems that the logic to get one HttpClient.Factory sometimes gets the `VertxHttpClientFactory` implementation over the expected one `QuarkusHttpClientFactory`

With this solution, it avoids to find a HttpClient.Factory using the ServiceLoader logic in Fabric8 Kubernetes Client, but it provides the expected `QuarkusHttpClientFactory` implementation always.

Fix quarkusio#31476
@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 3, 2023

PR updated because I forgot to remove the file which after these changes is no longer needed: https://github.com/quarkusio/quarkus/pull/31503/files#diff-19ff5f9f5bb70cfddc9f0388d562c8b9ef15e74ead765a18f2ca00c4b99b1faa

@quarkus-bot
Copy link

quarkus-bot bot commented Mar 3, 2023

✔️ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

@yrodiere yrodiere merged commit 4f9a65e into quarkusio:main Mar 3, 2023
@quarkus-bot quarkus-bot bot added kind/bugfix and removed triage/waiting-for-ci Ready to merge when CI successfully finishes labels Mar 3, 2023
@yrodiere
Copy link
Member

yrodiere commented Mar 3, 2023

Merged, thanks!

@quarkus-bot quarkus-bot bot added this to the 3.0 - main milestone Mar 3, 2023
@Sgitario Sgitario deleted the 31476 branch March 3, 2023 09:20
@gsmet
Copy link
Member

gsmet commented Mar 3, 2023

Yes, this is meant to be used only at build time. At runtime, the KubernetesClient instance will still be produced by

@Sgitario but in this case, don't we have the same problem of having the two factories to compete while we want the Quarkus one to be used at runtime?

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 3, 2023

Yes, this is meant to be used only at build time. At runtime, the KubernetesClient instance will still be produced by

@Sgitario but in this case, don't we have the same problem of having the two factories to compete while we want the Quarkus one to be used at runtime?

If this was not working at runtime either, then we should amend the producer to also use our custom HttpClient.Factory. Can you clarify @manusa or @geoand ?

@manusa
Copy link
Contributor

manusa commented Mar 3, 2023

Note that there is an extra parenthesis in the third condition, but I'm not sure if this was intentional, however this was not the issue

🤦 OK, I think I messed up in that one. The last condition looks definitely wrong.

@Sgitario but in this case, don't we have the same problem of having the two factories to compete while we want the Quarkus one to be used at runtime?

One of the topics I discussed with @geoand while implementing the Quarkus factory was still providing the option for users to override the runtime HttpClient implementation. That's why I was specifically asking this. The Factory scoring system logic in the client impl needs to be fixed, then everything should work as expected.

@manusa
Copy link
Contributor

manusa commented Mar 3, 2023

Sending the fix to the Client now.

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 3, 2023

So, for runtime what should be the right behavior? Use the QuarkusHttpClientFactory impl by default? If used, I have to include back the service binding file.

@manusa
Copy link
Contributor

manusa commented Mar 3, 2023

So, for runtime what should be the right behavior? Use the QuarkusHttpClientFactory impl by default? If used, I have to include back the service binding file.

As I see it:
The QuarkusHttpClientFactory should score one point higher than the Vert.x factory provided by the client.
If a user wants to provide their own factory, it should score higher than any other factory.
If the scoring system works as expected (and there aren't classloader issues), then the highest scoring non-default factory should be selected.

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 3, 2023

So, for runtime what should be the right behavior? Use the QuarkusHttpClientFactory impl by default?

If so, I'm not sure if I should include back the service binding file.
But not sure it was needed it at all because we were already registering the service on this line. I would need to test this.

@manusa
Copy link
Contributor

manusa commented Mar 3, 2023

But not sure it was needed it at all because we were already registering the service on this line. I would need to test this.

AFAIU you need both. The file is used to register the SPI class. The second is used by Quarkus to make use of the SPI providers.

@gsmet
Copy link
Member

gsmet commented Mar 3, 2023

@Sgitario and I think you probably need to exclude the service binding file coming from the Kubernetes Client Vert.x jar to make sure the Quarkus one is the only one seen at runtime.

@gsmet
Copy link
Member

gsmet commented Mar 3, 2023

(by using RemovedResourceBuildItem to remove the service file from the jar)

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 3, 2023

I've been playing with the quarkus-kubernetes-client extension and:

Therefore, I think we're ok with using the VertxHttpClientFactory impl at runtime and only QuarkusHttpClientFactory at build time. So we don't need more changes. Wdyt?

Note that I'm seeing the following warning when using the Kubernetes Client:

2023-03-03 12:19:48,711 WARN  [io.ver.cor.imp.VertxImpl] (executor-thread-0) You're already on a Vert.x context, are you sure you want to create a new Vertx instance?

But this is unrelated to my changes.

@manusa
Copy link
Contributor

manusa commented Mar 3, 2023

Therefore, I think we're ok with using the VertxHttpClientFactory impl at runtime and only QuarkusHttpClientFactory at build time. So we don't need more changes. Wdyt?

Besides disabling the DNS verification, the purpose of the factory is to reuse the Vertx object instance at runtime which is why it was originally added.

Which probably relates with the warning you mention

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 3, 2023

I see. So, if we want to let users to overwrite it via ServiceLoader and maintain QuarkusHttpClientFactory, we definitely need to exclude the VertxHttpClientFactory impl using RemovedResourceBuildItem as @gsmet .
Let me prepare a pull request with this.

@gsmet
Copy link
Member

gsmet commented Mar 3, 2023

Thanks @Sgitario and @manusa !

@Sgitario
Copy link
Contributor Author

Sgitario commented Mar 3, 2023

Fix: #31582

Sgitario added a commit to Sgitario/quarkus that referenced this pull request Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants