Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

502: bad gateway on <put> when removing cluster / fleet agent overrides from an existing rke2 cluster #9012

Closed
slickwarren opened this issue May 12, 2023 · 23 comments · Fixed by #9011

Comments

@slickwarren
Copy link
Contributor

slickwarren commented May 12, 2023

Setup

  • Rancher version: v2.7-head(95edd02)
  • Rancher UI Extensions: n/a
  • Browser type & version:chrome

Describe the bug

when updating an rke2 cluster with existing cluster and fleet agent overrides -> removing all the settings, resulted in a 502 gateway error.

To Reproduce

  • deploy an rke2 cluster
    • enter cluster agent overrides for resources, add a toleration, and add a custom affinity
    • enter fleet agent overrides for resources, add a toleration, and add a custom affinity
  • allow the cluster to successfully get to an active state
  • upgrade the cluster
    • remove all overrides that were entered in the first step

Result
An error is shown in the UI and the user is kept on the cluster edit screen, however the update was actually still applied and the overrides were removed

a padding to disable MSIE and chrome friendly error page

cluster did not appear to go into an updating state

Expected Result

if there's an error, I wouldn't expect the new spec to have applied to the cluster

Screenshots

Screen Shot 2023-05-12 at 8 52 11 AM

Additional context

did not happen for an rke1 cluster

@aalves08
Copy link
Member

As per our meeting, I couldn't reproduce this issue. Moving to test.

@aalves08
Copy link
Member

Still no luck here, even with your system @slickwarren .

Check running this without the browser extensions on your system, like you suggested. I went through your system and wasn't able to reproduce it. 🙏

@slickwarren
Copy link
Contributor Author

I'm able to reproduce this on both chrome and safari. Safari has a slightly different error message. DM'd with the exact payload.

{"data":"<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"}

@aalves08
Copy link
Member

Good day @slickwarren . Since this has been open for a few days on the UI side and it's a 502 (Bad Gateway) error, which indicates to me that this most probably a backend issue, I would ask that this issue should be reassigned to the the backend team for further investigation. 🙏

On the UI side, not only we couldn't reproduce this, but also couldn't find any indication that this is a UI/frontend issue.

Thanks for taking the time to go over this with me in a couple of calls. 🤜 🙇

FYI @gaktive @nwmac

@snasovich
Copy link
Contributor

@slickwarren , could you please provide the details from the call that returned 502 from "Network" tab of Chrome Developer Tools for the complete picture? At a glance it seems the call from the UI didn't even reach Rancher backend code or Rancher was down at that exact time for some reason. Similar issue - https://www.reddit.com/r/kubernetes/comments/oaxarg/intermittent_502_bad_gateway_issue/

@gaktive gaktive transferred this issue from rancher/dashboard May 18, 2023
@snasovich snasovich added the team/area2 Hostbusters label May 18, 2023
@zube zube bot removed the [zube]: Reopened label May 18, 2023
@gaktive
Copy link
Member

gaktive commented May 18, 2023

Transferred from rancher/dashboard based on @snasovich's comment since it doesn't look like UI. CC: @a-blender

@slickwarren
Copy link
Contributor Author

Here's the info I have right now, please lmk if you need more:
safari error:

{"data":"<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"}

chrome error:
Screen Shot 2023-05-12 at 8 52 11 AM

chrome console logs:
Screen Shot 2023-05-12 at 8 52 54 AM
Screen Shot 2023-05-12 at 8 52 39 AM

outside of these screenshots, I don't have the network data at this time. If the above info isn't including what you're looking for, please lmk and I can get it for you. @snasovich

@a-blender
Copy link

a-blender commented May 19, 2023

@slickwarren Thanks for the info!

From this description,

An error is shown in the UI and the user is kept on the cluster edit screen, however the update was actually still applied and the overrides were removed

a padding to disable MSIE and chrome friendly error page

cluster did not appear to go into an updating state.

If the update to remove to the agent customization went through, then that tells me that the cluster agent is still connected to rancher and rancher's network connection failed due to other reasons, local network or ingress issues. If the connection to the cluster agent failed, you'd see more like disconnected from cluster agent error not a 502.

Also, it's fine if sometimes you don't see the cluster go into an updating state because the update can be fast. To verify the agent was redeployed with the update, check the rancher logs for redeploying agent and verify the timer on the cluster-agent pod was reset indicating a redeploy.

  • Did you try with Firefox?
  • Did you see the overrides get removed from the downstream cluster agent or just the rke2 cluster obj?
  • Has the specific 502 error you are seeing been reproduced/intermittently reproduced in any other setting? Are there other open GH issues for it?

If you were having local network issues, I'd recommend trying to repro again today on a fresh install of Rancher. Rancher logs from the time of API call may also help.

@slickwarren
Copy link
Contributor Author

  • This was not local network issues, and was regularly reproducible on chrome (regular and incognito), and safari, on multiple versions of v2.7-head throughout the last 2 weeks.
  • The request does seem to go through in every case I've reproduced this, despite the message displayed
  • There is no other issue filed at this time, but there was some offline (due to security concerns) conversation around this as well in internal channels

@a-blender
Copy link

a-blender commented May 19, 2023

@slickwarren When you see the request go through, do you see the overrides removed from the cluster agent? Or can you not access the cluster anymore?

Gotcha, I meant similar GH issues where a 502 issue was seen. We can also discuss offline.

@slickwarren
Copy link
Contributor Author

The request does appear to update the spec / cluster / fleet agent appropriately, however the user doesn't know that until they leave the 502 error page, and go to edit the cluster again (or go and view the cluster / fleet agent deployments)

@mantis-toboggan-md
Copy link
Member

I've made a follow-up issue here #9016

@slickwarren
Copy link
Contributor Author

tested on v2.7-head (f0d4078):

  • deploy an rke2 cluster with all agentCustomization set for fleet and cluster agents
  • update the cluster, remove all agentCustomization / set back to default -- pass

notes:
since this has gone in, I haven't reproduced this issue
the following warnings are still in the headers of the removal requests:

Warning:
299 - unknown field "metadata.fields"
Warning:
299 - unknown field "metadata.relationships"
Warning:
299 - unknown field "metadata.state"
Warning:
299 - unknown field "spec.machineSelectorConfig"
Warning:
299 - unknown field "status.conditions[0].error"
Warning:
299 - unknown field "status.conditions[0].transitioning"
Warning:
299 - unknown field "status.conditions[10].error"
Warning:
299 - unknown field "status.conditions[10].transitioning"
Warning:
299 - unknown field "status.conditions[11].error"
Warning:
299 - unknown field "status.conditions[11].transitioning"
Warning:
299 - unknown field "status.conditions[12].error"
Warning:
299 - unknown field "status.conditions[12].transitioning"
Warning:
299 - unknown field "status.conditions[13].error"
Warning:
299 - unknown field "status.conditions[13].transitioning"
Warning:
299 - unknown field "status.conditions[14].error"
Warning:
299 - unknown field "status.conditions[14].transitioning"
Warning:
299 - unknown field "status.conditions[15].error"
Warning:
299 - unknown field "status.conditions[15].transitioning"
Warning:
299 - unknown field "status.conditions[16].error"
Warning:
299 - unknown field "status.conditions[16].transitioning"
Warning:
299 - unknown field "status.conditions[17].error"
Warning:
299 - unknown field "status.conditions[17].transitioning"
Warning:
299 - unknown field "status.conditions[18].error"
Warning:
299 - unknown field "status.conditions[18].transitioning"
Warning:
299 - unknown field "status.conditions[19].error"
Warning:
299 - unknown field "status.conditions[19].transitioning"
Warning:
299 - unknown field "status.conditions[1].error"
Warning:
299 - unknown field "status.conditions[1].transitioning"
Warning:
299 - unknown field "status.conditions[20].error"
Warning:
299 - unknown field "status.conditions[20].transitioning"
Warning:
299 - unknown field "status.conditions[21].error"
Warning:
299 - unknown field "status.conditions[21].transitioning"
Warning:
299 - unknown field "status.conditions[22].error"
Warning:
299 - unknown field "status.conditions[22].transitioning"
Warning:
299 - unknown field "status.conditions[2].error"
Warning:
299 - unknown field "status.conditions[2].transitioning"
Warning:
299 - unknown field "status.conditions[3].error"
Warning:
299 - unknown field "status.conditions[3].transitioning"
Warning:
299 - unknown field "status.conditions[4].error"
Warning:
299 - unknown field "status.conditions[4].transitioning"
Warning:
299 - unknown field "status.conditions[5].error"
Warning:
299 - unknown field "status.conditions[5].transitioning"
Warning:
299 - unknown field "status.conditions[6].error"
Warning:
299 - unknown field "status.conditions[6].transitioning"
Warning:
299 - unknown field "status.conditions[7].error"

Warning:
299 - unknown field "status.conditions[7].transitioning"
Warning:
299 - unknown field "status.conditions[8].error"
Warning:
299 - unknown field "status.conditions[8].transitioning"
Warning:
299 - unknown field "status.conditions[9].error"
Warning:
299 - unknown field "status.conditions[9].transitioning"

@snasovich
Copy link
Contributor

@slickwarren , thank you for testing these. The warnings you mentioned should be covered as part of #9016.
@mantis-toboggan-md @richard-cox , FYI I've added metadata.fields and metadata.state to the description of that issue. Not sure if spec.machineSelectorConfig should be part of it - please feel free to update as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment