Healthz bad gateway when pod/service ip take time to propagate #23067

batleforc · 2024-07-31T18:00:38Z

Describe the bug

During the startup process of workspace's pod, sometimes the ip take time to propagate and the double call of the healthz endpoint immediately came back with a bad gateway.

Che version

7.88

Steps to reproduce

Start a DevSpaces/Eclipse Che env on an openshift/kubernetes cluster that can take some time to propagate the ip address of the service/pod.
Start a workspace
if you have luck the workspace will start likely in the second, if not the workspace will take approximatively 5/10 min or more to start.

Expected behavior

Don't wait 5 more minute when, the cluster take a short time to propagate the corresponding ip (like most of our case) and wait the 5 more minute when side resources take time loading.

Runtime

Kubernetes (vanilla), OpenShift

Screenshots

No response

Installation method

chectl/latest, chectl/next, OperatorHub

Environment

Linux, Amazon

Eclipse Che Logs

No response

Additional context

No response

ibuziuk · 2024-08-05T17:13:45Z

@batleforc exposure of the route should be relatively fast, could you please clarify when exactly are you facing this issue ( 5/10 min for the route to be accessible). The default hard startup timeout is 5 mins and at this point we do not plan to change it.

batleforc · 2024-08-05T17:40:15Z

Due to this case, we upped the timeout to 900s.
In theory, it should be fast, but I encountered the case either on an Openshif on AWS (with like ~7 user) and on Kubernetes on bare metal (1 to 4 user). The initial two call of the healthz endpoint end up immediately returning with a bad gateway from the main gateway and the user has at least 5 min to wait (the case of the 10 minute isn't narrowed down precisely, but we need to reduce this one first)

We found out that the propagation of the service's ip to the targeted pod take some time and some time came a little bit after the pod are up but not soon enough for the backend. That's why I add a little retry in eclipse-che/che-operator#1874 that should cover the propagation time, but I would love to make it a parameter that the end user could tune in case of pretty slow CNI.

batleforc · 2024-08-05T17:46:59Z

To debug that, we used the different pod to debug the full chain of acknowledgement that the deployment is ready for the next step of startup. And have seen that either we need to add a little time in between the two call of the health on the backend side, or we add retry directly in the gateway. (need test with replacing the different element in the kube)

batleforc · 2024-09-05T08:56:39Z

Is it possible to have help in order to check if the change added in eclipse-che/che-operator#1874 can fix the problem we encounter ? (Building the image mostly and a possible case on how we can make the retry healthz modular https://github.com/eclipse-che/che-operator/pull/1874/files#diff-ebca2eefe12f7ba4a722c53d574ba1b2adee412909da8cdbc974c8f7fcbfb02fR655 ?)

tolusha · 2024-09-05T10:20:50Z

Hello
Please try this image based on the PR
quay.io/abazko/operator:23067

tolusha · 2024-09-10T12:38:09Z

Hello @batleforc
Does it work for you?

batleforc · 2024-09-10T13:18:27Z

Hello @tolusha ,
i've set it up but i think i need to fine tune the initial Interval

batleforc · 2024-09-10T13:22:56Z

Is the provided image (quay.io/abazko/operator:23067) automatically updated ?

tolusha · 2024-09-10T13:50:09Z

Unfortunately now.
You can build the image by the following command:
make docker-build docker-push IMG=<IMAGE_NAME> SKIP_TESTS=true

batleforc · 2024-09-11T11:41:58Z

So, the build seems okay, but I encounter a Client.Timeout exceeded while awaiting headers and I can't find where the devworkspace-controller-manager does the call to the healthz endpoint

batleforc · 2024-09-16T21:45:21Z

@tolusha So I found out the connected dot.
While setup in my own env (and starting like 10/20 workspace) it works, and I can't reproduce the case where I stay stuck because the endpoint had two consecutive bad gateways for 5 minute (but ended up overloading the cluster 🤣 ). I have both the che-operator and devworkspace-operator setup with corresponding branch (Work-on-timeout) .

tolusha · 2024-09-18T16:56:51Z

Hello @batleforc
Thank you for the information.
So, does it mean that PR is good to review and merge?

batleforc · 2024-09-18T16:58:42Z

Hello @tolusha
For me yes
But it will need both pr

AObuchow · 2024-10-15T15:44:22Z

devfile/devworkspace-operator#1321 has now been merged, which seems to resolve this issue. This change will appear when DevWorkspace Operator 0.32.0 is released.

batleforc added the kind/bug Outline of a bug - must adhere to the bug report template. label Jul 31, 2024

che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Jul 31, 2024

batleforc mentioned this issue Jul 31, 2024

Work on timeout eclipse-che/che-operator#1874

Closed

10 tasks

ibuziuk added this to Eclipse Che Team A Backlog Aug 7, 2024

ibuziuk assigned tolusha Aug 7, 2024

ibuziuk moved this to Analyzing in Eclipse Che Team A Backlog Aug 7, 2024

batleforc mentioned this issue Sep 16, 2024

feat: turn healthCheckHttpClient timeout from 500ms to 3s devfile/devworkspace-operator#1321

Merged

3 tasks

dkwon17 added this to Red Hat OpenShift Dev Spaces and WebTerminal Priorities Oct 3, 2024

ibuziuk assigned batleforc and dkwon17 and unassigned tolusha and batleforc Oct 8, 2024

ibuziuk added this to Eclipse Che Team B Backlog Oct 9, 2024

ibuziuk removed this from Eclipse Che Team A Backlog Oct 9, 2024

AObuchow closed this as completed Oct 15, 2024

github-project-automation bot moved this to Done in Red Hat OpenShift Dev Spaces and WebTerminal Priorities Oct 15, 2024

github-project-automation bot moved this to ✅ Done in Eclipse Che Team B Backlog Oct 15, 2024

dkwon17 mentioned this issue Oct 21, 2024

Cherry pick healthz retry to 0.31.x devfile/devworkspace-operator#1331

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healthz bad gateway when pod/service ip take time to propagate #23067

Healthz bad gateway when pod/service ip take time to propagate #23067

batleforc commented Jul 31, 2024

ibuziuk commented Aug 5, 2024

batleforc commented Aug 5, 2024

batleforc commented Aug 5, 2024

batleforc commented Sep 5, 2024

tolusha commented Sep 5, 2024

tolusha commented Sep 10, 2024

batleforc commented Sep 10, 2024

batleforc commented Sep 10, 2024

tolusha commented Sep 10, 2024 •

edited

Loading

batleforc commented Sep 11, 2024

batleforc commented Sep 16, 2024 •

edited

Loading

tolusha commented Sep 18, 2024

batleforc commented Sep 18, 2024 •

edited

Loading

AObuchow commented Oct 15, 2024

Healthz bad gateway when pod/service ip take time to propagate #23067

Healthz bad gateway when pod/service ip take time to propagate #23067

Comments

batleforc commented Jul 31, 2024

Describe the bug

Che version

Steps to reproduce

Expected behavior

Runtime

Screenshots

Installation method

Environment

Eclipse Che Logs

Additional context

ibuziuk commented Aug 5, 2024

batleforc commented Aug 5, 2024

batleforc commented Aug 5, 2024

batleforc commented Sep 5, 2024

tolusha commented Sep 5, 2024

tolusha commented Sep 10, 2024

batleforc commented Sep 10, 2024

batleforc commented Sep 10, 2024

tolusha commented Sep 10, 2024 • edited Loading

batleforc commented Sep 11, 2024

batleforc commented Sep 16, 2024 • edited Loading

tolusha commented Sep 18, 2024

batleforc commented Sep 18, 2024 • edited Loading

AObuchow commented Oct 15, 2024

tolusha commented Sep 10, 2024 •

edited

Loading

batleforc commented Sep 16, 2024 •

edited

Loading

batleforc commented Sep 18, 2024 •

edited

Loading