-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Rolling Update with Allocated GameServers #2420
Fix Rolling Update with Allocated GameServers #2420
Conversation
Build Failed 😱 Build Id: 873efebb-0a62-45b0-a32a-d491860e94a0 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: 118e3d6e-effc-4498-8993-1a6fec50c2a8 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
The potential solution in #2397 (comment) which would change
to
(in https://github.com/googleforgames/agones/blob/main/pkg/fleets/controller.go#L532) does scale down the old GSS, while leaving Ready replicas up: Screen.Recording.2022-01-05.at.17.38.18.mov |
Just letting you know we haven't forgotten about this - been clearing up a bunch of flakey test issues so we're able to get better velocity overall. |
Oh man, this has been languishing for too long. Sorry! Perf season has hit, and we all got swamped. I haven't forgotten about this. If it helps at all, happy for you try your proposed fix in this PR and see if it passed all our CI checks. |
8f28bf7
to
7dba592
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markmandel updated the PR to include the proposed fix, still need to make some other changes/cleanup if this works, but will leave as is to see if the tests pass
test/e2e/fleet_test.go
Outdated
fixtures := []bool{true} // , false} // TODO Enable these again | ||
maxSurge := []string{"25%"} // , "10%"} // TODO | ||
doCycle := true // TODO: fixture? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doCycle
should be a fixture as well (to test both with and without allocated GS), but that would double the number of tests to run from 4 to 8? Doesn't seem great, especially since they're all running in parallel in the same cluster?
Build Failed 😱 Build Id: aca927c1-717f-4e97-b529-4235b147bd09 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Finally come back around. I'm going to bump this CI test again (and upgrade to e2e-stable:
|
Build Failed 😱 Build Id: 081f46c4-bf8a-4fd0-8063-ee4fb5204e99 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Looks like e2e-stable was not happy:
I'll run it one more time, see if we get consistent failures. |
Build Failed 😱 Build Id: da010ed2-7d62-4229-aeab-40e08b5b1bf1 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Hmm, was able to reproduce the failure locally as well. Will take a look at what's going on The most recent build does look like it failed on a different test:
but this PR's test was running at the same time, might be related? Could the test cluster be running out of space? |
We lock on each build so only one e2e test is running at a time. I also have some dashboards to check if the cluster is full. There is some flakiness in the e2e test suite, but this doesn't 100% look like our usual flakes. But I would focus on the reproducible issues, and then we can dig deeper. |
Build Succeeded 👏 Build Id: 97b3b297-b7c1-4889-a0be-74ae139f8dde The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Build Failed 😱 Build Id: 2d178aa5-e01b-477a-92c3-2cdf20fbe95b To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: 7cd45406-508d-451a-9fdc-72a344778a16 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
#2420 (comment) was a conflict in updating the Fleet again. Should hopefully be fixed now after using Second build failure was a lint error |
Build Failed 😱 Build Id: 44276cda-2370-44cf-bde5-9ee8b0082429 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Something got stuck maybe?
|
|
Build Succeeded 👏 Build Id: db367282-675b-4cec-abb5-f5eee48aef3d The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Build Succeeded 👏 Build Id: 3b17e8b6-0ab6-4036-91e1-2a8bcf475365 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
I was going to take this for a spin, but looks like it's consistently passing, so that is awesome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took it for a spin! Worked!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: markmandel, WVerlaek The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
What this PR does / Why we need it:
See discussion in #2397. This PR reproduces the issue there, where a Fleet rolling update can get stuck if there's a significant percentage of GameServers in an Allocated state.
This is demonstrated by updating the existing e2e test for Fleet Rolling Update, by repeatedly allocating and shutting down GameServers in the Fleet, keeping around ~half of the Fleet in an Allocated state.
Recording of a run of the e2e test:
Screen.Recording.2022-01-05.at.15.44.02.mov
First, half of the fleet gets Allocated, then the Fleet update is made which creates the second GameServerSet. These two GSSs then get stuck indefinitely, the old GSS never fully scales down.
Which issue(s) this PR fixes:
Closes #2397
Special notes for your reviewer: