Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve onFailure to ignore/log RuntimeException so that all onFailure handling completes #4353

Closed
scottmarlow opened this issue Aug 25, 2022 · 5 comments

Comments

@scottmarlow
Copy link

scottmarlow commented Aug 25, 2022

Is your enhancement related to a problem? Please describe

https://issues.jenkins.io/browse/JENKINS-68234 reports a io.fabric8.kubernetes.client.http.WebSocketHandshakeException at io.fabric8.kubernetes.client.okhttp.OkHttpWebSocketImpl$BuilderImpl$1.onFailure(OkHttpWebSocketImpl.java:66) that is occurring that no one yet understands why it is happening. We are seeing this same problem in Eclipse CI, especially for the Jakarta EE Platform TCK testing (as well as other Eclipse CI testing).

I don't know anything about Fabric8 kubernetes-client but I am wondering if the https://github.com/fabric8io/kubernetes-client/blob/master/httpclient-okhttp/src/main/java/io/fabric8/kubernetes/client/okhttp/OkHttpWebSocketImpl.java#L66 code should be ignoring or just logging caught RuntimeException instead of throwing the RuntimeException to the caller which has already caught the passed Throwable t which is likely is more important to handle with the code after the call to response.close().

Describe the solution you'd like

This code currently contains:

 @Override
        public void onFailure(okhttp3.WebSocket webSocket, Throwable t, Response response) {
          if (response != null) {
            response.close();
          }
          if (!opened) {
            if (response != null) {
              try {
                future.completeExceptionally(
                    new WebSocketHandshakeException(new OkHttpResponseImpl<>(response, null)).initCause(t));
              } catch (IOException e) {
                // can't happen
              }
            } else {
              future.completeExceptionally(t);
            }
          } else {
            listener.onError(new OkHttpWebSocketImpl(webSocket, this::request), t);
          }
        }

Perhaps something like this would be better:

 @Override
        public void onFailure(okhttp3.WebSocket webSocket, Throwable t, Response response) {
          if (response != null) {
            try {
              response.close();
            } catch (RuntimeException e) {
              // Log the caught exception and continue with the handling of the passed `Throwable t` parameter
              logger.log(Level.WARNING, "OkHttpWebSocketImpl.onFailure close of response failed", e);
            }
          }
          if (!opened) {
            if (response != null) {
              try {
                future.completeExceptionally(
                    new WebSocketHandshakeException(new OkHttpResponseImpl<>(response, null)).initCause(t));
              } catch (IOException e) {
                // can't happen
              }
            } else {
              future.completeExceptionally(t);
            }
          } else {
            listener.onError(new OkHttpWebSocketImpl(webSocket, this::request), t);
          }
        }

Describe alternatives you've considered

No response

Additional context

No response

@shawkins
Copy link
Contributor

@scottmarlow I think there is a version mix up here. The current master is likely not what is referenced in the Jenkin issue. There's a reference to the 5.12 plugin in a comment and in the stack trace that shows PodOperationsImpl.java:332 - which lines up to the 5.12 waiting for the websocket to become ready.

Based upon the stacktrace it looks like the initial websocket request is getting a 500 response back, so it's failing. That would indicate a problem on the api server end. Unlike other operations we don't current attempt to do a exponential back off retry of websocket requests, so it's more likely that you'll see these errors here.

@scottmarlow
Copy link
Author

The driving issue for us using the Eclipse CI environment is that we are seeing more frequent failures like https://ci.eclipse.org/kuksa/job/build-and-publish-website-staging/2287/console (reported from https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/1676 + other similar failures reported from https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/1685).

A part of the https://ci.eclipse.org/kuksa/job/build-and-publish-website-staging/2287/console is copied here to
consoleoutput.zip.

@scottmarlow I think there is a version mix up here. The current master is likely not what is referenced in the Jenkin issue. There's a reference to the 5.12 plugin in a comment and in the stack trace that shows PodOperationsImpl.java:332 - which lines up to the 5.12 waiting for the websocket to become ready.

I'm guessing the exception call stack I copied is also 5.12 based as well? Is there a certain tag or branch I should use to view the 5.12 plugin source?

@shawkins
Copy link
Contributor

I'm guessing the exception call stack I copied is also 5.12 based as well? Is there a certain tag or branch I should use to view the 5.12 plugin source?

I'm assuming that the Jenkins plugin version lines up to the fabric version. If so, you can use tags to browse the corresponding fabric8 client source https://github.com/fabric8io/kubernetes-client/tree/v5.12.1

@scottmarlow
Copy link
Author

I'm assuming that the Jenkins plugin version lines up to the fabric version. If so, you can use tags to browse the corresponding fabric8 client source https://github.com/fabric8io/kubernetes-client/tree/v5.12.1

Thanks, line 66 contains the future.completeExceptionally(new WebSocketHandshakeException(new OkHttpResponseImpl<>(response, null)).initCause(t)) which looks correct. Thanks for the help! I'll close this now.

@shawkins
Copy link
Contributor

@scottmarlow in the referenced jenkins issue it looks like this is occurring with multiple containers in a single pod. The fabric8 client does not have a concept of a default container, so it's not choosing one when one is not specified - that differs from the behavior of kubectl and could be opened as an issue. Without specifying a container that would typically lead to a "bad request" response - see #4319. I'm not sure why your seeing a 500. If that is the problem, the fix is to add an inContainer("something") prior to the exec() call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants