-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix randomly failures when deploying to OpenShift/K8s on JVM/Native #31503
Conversation
@yrodiere with these changes, it seems to be consistently working for me. Could you try it yourself too just to be sure? |
Can you please explain this change. I don't understand it |
This is based on my findings here: #31476 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't easily test with a Quarkus snapshot as I've only ever deployed from CI, but @gsmet is giving a try with another app that he usually deploys from his laptop, so let's wait for that.
As to the solution, I'm not sure I understand the content, but I added a comment about the form :)
...netes/vanilla/deployment/src/main/java/io/quarkus/kubernetes/deployment/KnativeDeployer.java
Outdated
Show resolved
Hide resolved
...tes/vanilla/deployment/src/main/java/io/quarkus/kubernetes/deployment/OpenshiftDeployer.java
Outdated
Show resolved
Hide resolved
I'd like a more complete explanation please. Both to help me understand why this change fixes the problem and to have an easily accessible record of why this change was made. |
My guess was that the system property was cleared up and a new vert.x instance was created with the wrong value. Anyway, @gsmet is getting a different error after trying these changes. So, I'm changing this pull request to draft and keep working on it. |
👍🏼. Let me know if you need me to look into anything |
@gsmet I've also tried to deploy the (quarkus-github-bot)[https://github.com/quarkusio/quarkus-github-bot] repo into OpenShift and it's working fine too (no warnings messages are printed). @gsmet @yrodiere I would like to try these changes yourself too just to be surer that all the issues are gone. |
there is a regression with these changes that cause an out-of-memory error. I can reproduce it locally, investigating. |
That almost certainly means that there are multiple Vert.x instances that have not been closed |
It should be fixed with the latest changes. |
@@ -12,7 +14,9 @@ public class KubernetesClientBuildStep { | |||
private KubernetesClientBuildConfig buildConfig; | |||
|
|||
@BuildStep | |||
public KubernetesClientBuildItem process(TlsConfig tlsConfig) { | |||
return new KubernetesClientBuildItem(createConfig(buildConfig, tlsConfig)); | |||
public KubernetesClientBuildItem process(TlsConfig tlsConfig, QuarkusBuildCloseablesBuildItem closeablesBuildItem) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about QuarkusBuildCloseablesBuildItem
:)
This comment has been minimized.
This comment has been minimized.
I will test it with the Quarkus Bot and the production cluster this afternoon. |
So... I don't think it's the same issue but I still have an error when pushing the Quarkus bot to the production cluster (error which apparently doesn't prevent the deployment as I have a success in the end):
Talking about: |
I've also seen this issue even without this change. My guess is that the client is closing the connection after some time. How long did the build take for you? I would like to address this issue as part of this pull request as well, so I'll investigate it. |
The build is a native one so it's a bit long, something like 2-3 minutes. |
After having tried up to 10 times a Native build to be deployed into OpenShift, I could reproduce the latter issue only once, and seems to be caused because OpenShift rejects the compressed file:
And I think this is caused by With this change, I've tried the deployment to OpenShift and I could not reproduce the issue any longer. @gsmet can you try again? |
Sure. I’m at the doctor right now but will have a look when I’m back. The System.out.println should probably go away though :) |
This comment has been minimized.
This comment has been minimized.
Not really, the logic I mentioned is this one (Note that there is an extra parenthesis in the third condition, but I'm not sure if this was intentional, however this was not the issue). Maybe, the root problem is how to load the factories here. Anyway, I thought it was better to not call this logic at all to be more efficient and directly use the factory we want at build time.
Yes, this is meant to be used only at build time. At runtime, the KubernetesClient instance will still be produced by https://github.com/quarkusio/quarkus/blob/main/extensions/kubernetes-client/runtime/src/main/java/io/quarkus/kubernetes/client/runtime/KubernetesClientProducer.java#L21. |
There were several places where a new KubernetesClient was being created: - Some places use `Clients.fromConfig`. - Other places use `KubernetesClientUtils.createConfig`. - And other places use `KubernetesClientBuildItem.getClient` And the problem is that every time we invoke any of these methods, Fabric8 KubernetesClient tries to locate one HttpClient.Factory and it seems that the logic to get one HttpClient.Factory sometimes gets the `VertxHttpClientFactory` implementation over the expected one `QuarkusHttpClientFactory` With this solution, it avoids to find a HttpClient.Factory using the ServiceLoader logic in Fabric8 Kubernetes Client, but it provides the expected `QuarkusHttpClientFactory` implementation always. Fix quarkusio#31476
PR updated because I forgot to remove the file which after these changes is no longer needed: https://github.com/quarkusio/quarkus/pull/31503/files#diff-19ff5f9f5bb70cfddc9f0388d562c8b9ef15e74ead765a18f2ca00c4b99b1faa |
✔️ The latest workflow run for the pull request has completed successfully. It should be safe to merge provided you have a look at the other checks in the summary. |
Merged, thanks! |
@Sgitario but in this case, don't we have the same problem of having the two factories to compete while we want the Quarkus one to be used at runtime? |
If this was not working at runtime either, then we should amend the producer to also use our custom HttpClient.Factory. Can you clarify @manusa or @geoand ? |
🤦 OK, I think I messed up in that one. The last condition looks definitely wrong.
One of the topics I discussed with @geoand while implementing the Quarkus factory was still providing the option for users to override the runtime HttpClient implementation. That's why I was specifically asking this. The Factory scoring system logic in the client impl needs to be fixed, then everything should work as expected. |
Sending the fix to the Client now. |
So, for runtime what should be the right behavior? Use the QuarkusHttpClientFactory impl by default? If used, I have to include back the service binding file. |
As I see it: |
So, for runtime what should be the right behavior? Use the QuarkusHttpClientFactory impl by default? If so, I'm not sure if I should include back the service binding file. |
AFAIU you need both. The file is used to register the SPI class. The second is used by Quarkus to make use of the SPI providers. |
@Sgitario and I think you probably need to exclude the service binding file coming from the Kubernetes Client Vert.x jar to make sure the Quarkus one is the only one seen at runtime. |
(by using |
I've been playing with the
Therefore, I think we're ok with using the VertxHttpClientFactory impl at runtime and only QuarkusHttpClientFactory at build time. So we don't need more changes. Wdyt? Note that I'm seeing the following warning when using the Kubernetes Client:
But this is unrelated to my changes. |
Besides disabling the DNS verification, the purpose of the factory is to reuse the Vertx object instance at runtime which is why it was originally added. Line 24 in f14cb40
Which probably relates with the warning you mention |
I see. So, if we want to let users to overwrite it via ServiceLoader and maintain QuarkusHttpClientFactory, we definitely need to exclude the VertxHttpClientFactory impl using RemovedResourceBuildItem as @gsmet . |
Fix: #31582 |
Finally, I've found out the root cause of this issue. There were several places where a new KubernetesClient was being created:
Clients.fromConfig
.KubernetesClientUtils.createConfig
.KubernetesClientBuildItem.getClient
And the problem is that every time we invoke any of these methods, Fabric8 KubernetesClient tries to locate one HttpClient.Factory and it seems that the logic to get one HttpClient.Factory sometimes gets the
VertxHttpClientFactory
implementation over the expected oneQuarkusHttpClientFactory
.The previous solution about keeping the Disable Http Dns system property worked fine because this property was used either by
VertxHttpClientFactory
orQuarkusHttpClientFactory
.However, I've updated the pull request with a better solution and more efficient one as it avoids finding a HttpClient.Factory via the ServiceLoader logic in Fabric8 Kubernetes Client, but instead it directly provides the expected
QuarkusHttpClientFactory
implementation always.Also, the warning message that always appeared, it's now gone:
Fix #31476