502: bad gateway on `<put>` when removing cluster / fleet agent overrides from an existing rke2 cluster #9012

slickwarren · 2023-05-12T16:09:56Z

Setup

Rancher version: v2.7-head(95edd02)
Rancher UI Extensions: n/a
Browser type & version:chrome

Describe the bug

when updating an rke2 cluster with existing cluster and fleet agent overrides -> removing all the settings, resulted in a 502 gateway error.

To Reproduce

deploy an rke2 cluster
- enter cluster agent overrides for resources, add a toleration, and add a custom affinity
- enter fleet agent overrides for resources, add a toleration, and add a custom affinity
allow the cluster to successfully get to an active state
upgrade the cluster
- remove all overrides that were entered in the first step

Result
An error is shown in the UI and the user is kept on the cluster edit screen, however the update was actually still applied and the overrides were removed

a padding to disable MSIE and chrome friendly error page

cluster did not appear to go into an updating state

Expected Result

if there's an error, I wouldn't expect the new spec to have applied to the cluster

Screenshots

Additional context

did not happen for an rke1 cluster

The text was updated successfully, but these errors were encountered:

aalves08 · 2023-05-15T16:55:42Z

As per our meeting, I couldn't reproduce this issue. Moving to test.

aalves08 · 2023-05-16T08:59:49Z

Still no luck here, even with your system @slickwarren .

Check running this without the browser extensions on your system, like you suggested. I went through your system and wasn't able to reproduce it. 🙏

slickwarren · 2023-05-16T18:01:45Z

I'm able to reproduce this on both chrome and safari. Safari has a slightly different error message. DM'd with the exact payload.

{"data":"<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"}

aalves08 · 2023-05-18T09:23:28Z

Good day @slickwarren . Since this has been open for a few days on the UI side and it's a 502 (Bad Gateway) error, which indicates to me that this most probably a backend issue, I would ask that this issue should be reassigned to the the backend team for further investigation. 🙏

On the UI side, not only we couldn't reproduce this, but also couldn't find any indication that this is a UI/frontend issue.

Thanks for taking the time to go over this with me in a couple of calls. 🤜 🙇

FYI @gaktive @nwmac

snasovich · 2023-05-18T14:30:07Z

@slickwarren , could you please provide the details from the call that returned 502 from "Network" tab of Chrome Developer Tools for the complete picture? At a glance it seems the call from the UI didn't even reach Rancher backend code or Rancher was down at that exact time for some reason. Similar issue - https://www.reddit.com/r/kubernetes/comments/oaxarg/intermittent_502_bad_gateway_issue/

gaktive · 2023-05-18T15:51:19Z

Transferred from rancher/dashboard based on @snasovich's comment since it doesn't look like UI. CC: @a-blender

slickwarren · 2023-05-18T16:46:48Z

Here's the info I have right now, please lmk if you need more:
safari error:

{"data":"<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"}

chrome error:

chrome console logs:

outside of these screenshots, I don't have the network data at this time. If the above info isn't including what you're looking for, please lmk and I can get it for you. @snasovich

a-blender · 2023-05-19T17:34:47Z

@slickwarren Thanks for the info!

From this description,

An error is shown in the UI and the user is kept on the cluster edit screen, however the update was actually still applied and the overrides were removed

a padding to disable MSIE and chrome friendly error page

cluster did not appear to go into an updating state.

If the update to remove to the agent customization went through, then that tells me that the cluster agent is still connected to rancher and rancher's network connection failed due to other reasons, local network or ingress issues. If the connection to the cluster agent failed, you'd see more like disconnected from cluster agent error not a 502.

Also, it's fine if sometimes you don't see the cluster go into an updating state because the update can be fast. To verify the agent was redeployed with the update, check the rancher logs for redeploying agent and verify the timer on the cluster-agent pod was reset indicating a redeploy.

Did you try with Firefox?
Did you see the overrides get removed from the downstream cluster agent or just the rke2 cluster obj?
Has the specific 502 error you are seeing been reproduced/intermittently reproduced in any other setting? Are there other open GH issues for it?

If you were having local network issues, I'd recommend trying to repro again today on a fresh install of Rancher. Rancher logs from the time of API call may also help.

slickwarren · 2023-05-19T17:39:01Z

This was not local network issues, and was regularly reproducible on chrome (regular and incognito), and safari, on multiple versions of v2.7-head throughout the last 2 weeks.
The request does seem to go through in every case I've reproduced this, despite the message displayed
There is no other issue filed at this time, but there was some offline (due to security concerns) conversation around this as well in internal channels

a-blender · 2023-05-19T17:46:31Z

@slickwarren When you see the request go through, do you see the overrides removed from the cluster agent? Or can you not access the cluster anymore?

Gotcha, I meant similar GH issues where a 502 issue was seen. We can also discuss offline.

slickwarren · 2023-05-19T17:50:28Z

The request does appear to update the spec / cluster / fleet agent appropriately, however the user doesn't know that until they leave the 502 error page, and go to edit the cluster again (or go and view the cluster / fleet agent deployments)

mantis-toboggan-md · 2023-05-31T17:51:25Z

I've made a follow-up issue here #9016

slickwarren · 2023-06-02T18:53:01Z

tested on v2.7-head (f0d4078):

deploy an rke2 cluster with all agentCustomization set for fleet and cluster agents
update the cluster, remove all agentCustomization / set back to default -- pass

notes:
since this has gone in, I haven't reproduced this issue
the following warnings are still in the headers of the removal requests:

Warning:
299 - unknown field "metadata.fields"
Warning:
299 - unknown field "metadata.relationships"
Warning:
299 - unknown field "metadata.state"
Warning:
299 - unknown field "spec.machineSelectorConfig"
Warning:
299 - unknown field "status.conditions[0].error"
Warning:
299 - unknown field "status.conditions[0].transitioning"
Warning:
299 - unknown field "status.conditions[10].error"
Warning:
299 - unknown field "status.conditions[10].transitioning"
Warning:
299 - unknown field "status.conditions[11].error"
Warning:
299 - unknown field "status.conditions[11].transitioning"
Warning:
299 - unknown field "status.conditions[12].error"
Warning:
299 - unknown field "status.conditions[12].transitioning"
Warning:
299 - unknown field "status.conditions[13].error"
Warning:
299 - unknown field "status.conditions[13].transitioning"
Warning:
299 - unknown field "status.conditions[14].error"
Warning:
299 - unknown field "status.conditions[14].transitioning"
Warning:
299 - unknown field "status.conditions[15].error"
Warning:
299 - unknown field "status.conditions[15].transitioning"
Warning:
299 - unknown field "status.conditions[16].error"
Warning:
299 - unknown field "status.conditions[16].transitioning"
Warning:
299 - unknown field "status.conditions[17].error"
Warning:
299 - unknown field "status.conditions[17].transitioning"
Warning:
299 - unknown field "status.conditions[18].error"
Warning:
299 - unknown field "status.conditions[18].transitioning"
Warning:
299 - unknown field "status.conditions[19].error"
Warning:
299 - unknown field "status.conditions[19].transitioning"
Warning:
299 - unknown field "status.conditions[1].error"
Warning:
299 - unknown field "status.conditions[1].transitioning"
Warning:
299 - unknown field "status.conditions[20].error"
Warning:
299 - unknown field "status.conditions[20].transitioning"
Warning:
299 - unknown field "status.conditions[21].error"
Warning:
299 - unknown field "status.conditions[21].transitioning"
Warning:
299 - unknown field "status.conditions[22].error"
Warning:
299 - unknown field "status.conditions[22].transitioning"
Warning:
299 - unknown field "status.conditions[2].error"
Warning:
299 - unknown field "status.conditions[2].transitioning"
Warning:
299 - unknown field "status.conditions[3].error"
Warning:
299 - unknown field "status.conditions[3].transitioning"
Warning:
299 - unknown field "status.conditions[4].error"
Warning:
299 - unknown field "status.conditions[4].transitioning"
Warning:
299 - unknown field "status.conditions[5].error"
Warning:
299 - unknown field "status.conditions[5].transitioning"
Warning:
299 - unknown field "status.conditions[6].error"
Warning:
299 - unknown field "status.conditions[6].transitioning"
Warning:
299 - unknown field "status.conditions[7].error"

Warning:
299 - unknown field "status.conditions[7].transitioning"
Warning:
299 - unknown field "status.conditions[8].error"
Warning:
299 - unknown field "status.conditions[8].transitioning"
Warning:
299 - unknown field "status.conditions[9].error"
Warning:
299 - unknown field "status.conditions[9].transitioning"

snasovich · 2023-06-02T18:58:03Z

@slickwarren , thank you for testing these. The warnings you mentioned should be covered as part of #9016.
@mantis-toboggan-md @richard-cox , FYI I've added metadata.fields and metadata.state to the description of that issue. Not sure if spec.machineSelectorConfig should be part of it - please feel free to update as needed.

slickwarren added status/release-blocker kind/bug-qa area/provisioning-rke2 labels May 12, 2023

slickwarren self-assigned this May 12, 2023

github-actions bot added the [zube]: To Triage label May 12, 2023

gaktive added kind/bug and removed [zube]: To Triage labels May 12, 2023

aalves08 self-assigned this May 15, 2023

aalves08 added the [zube]: Working label May 15, 2023

aalves08 added [zube]: To Test and removed [zube]: Working labels May 15, 2023

slickwarren added [zube]: Reopened and removed [zube]: To Test labels May 16, 2023

gaktive transferred this issue from rancher/dashboard May 18, 2023

snasovich added the team/area2 Hostbusters label May 18, 2023

zube bot removed the [zube]: Reopened label May 18, 2023

a-blender self-assigned this May 19, 2023

a-blender added the [zube]: Working label May 19, 2023

slickwarren unassigned aalves08 May 19, 2023

mantis-toboggan-md closed this as completed in #9011 May 31, 2023

zube bot added [zube]: Done and removed [zube]: Review labels May 31, 2023

mantis-toboggan-md added [zube]: To Test and removed [zube]: Done labels May 31, 2023

mantis-toboggan-md reopened this May 31, 2023

zube bot closed this as completed May 31, 2023

zube bot added [zube]: Done and removed [zube]: To Test labels May 31, 2023

mantis-toboggan-md reopened this May 31, 2023

zube bot added [zube]: To Triage and removed [zube]: Done labels May 31, 2023

mantis-toboggan-md added [zube]: Done and removed [zube]: To Triage labels May 31, 2023

zube bot closed this as completed May 31, 2023

zube bot removed the [zube]: To Test label May 31, 2023

mantis-toboggan-md reopened this May 31, 2023

mantis-toboggan-md added the [zube]: To Test label May 31, 2023

slickwarren added [zube]: QA Working and removed [zube]: To Test labels May 31, 2023

slickwarren closed this as completed Jun 2, 2023

zube bot added [zube]: Done and removed [zube]: QA Working labels Jun 2, 2023

KevinJoiner mentioned this issue Jun 7, 2023

[BUG] Allow Steve to handle generated fields. rancher/rancher#41772

Closed

zube bot removed the [zube]: Done label Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

502: bad gateway on `<put>` when removing cluster / fleet agent overrides from an existing rke2 cluster #9012

502: bad gateway on `<put>` when removing cluster / fleet agent overrides from an existing rke2 cluster #9012

slickwarren commented May 12, 2023 •

edited

Loading

aalves08 commented May 15, 2023

aalves08 commented May 16, 2023

slickwarren commented May 16, 2023

aalves08 commented May 18, 2023

snasovich commented May 18, 2023

gaktive commented May 18, 2023

slickwarren commented May 18, 2023

a-blender commented May 19, 2023 •

edited

Loading

slickwarren commented May 19, 2023

a-blender commented May 19, 2023 •

edited

Loading

slickwarren commented May 19, 2023

mantis-toboggan-md commented May 31, 2023

slickwarren commented Jun 2, 2023

snasovich commented Jun 2, 2023

502: bad gateway on <put> when removing cluster / fleet agent overrides from an existing rke2 cluster #9012

502: bad gateway on <put> when removing cluster / fleet agent overrides from an existing rke2 cluster #9012

Comments

slickwarren commented May 12, 2023 • edited Loading

aalves08 commented May 15, 2023

aalves08 commented May 16, 2023

slickwarren commented May 16, 2023

aalves08 commented May 18, 2023

snasovich commented May 18, 2023

gaktive commented May 18, 2023

slickwarren commented May 18, 2023

a-blender commented May 19, 2023 • edited Loading

slickwarren commented May 19, 2023

a-blender commented May 19, 2023 • edited Loading

slickwarren commented May 19, 2023

mantis-toboggan-md commented May 31, 2023

slickwarren commented Jun 2, 2023

snasovich commented Jun 2, 2023

502: bad gateway on `<put>` when removing cluster / fleet agent overrides from an existing rke2 cluster #9012

502: bad gateway on `<put>` when removing cluster / fleet agent overrides from an existing rke2 cluster #9012

slickwarren commented May 12, 2023 •

edited

Loading

a-blender commented May 19, 2023 •

edited

Loading

a-blender commented May 19, 2023 •

edited

Loading