-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GameServer container restart before Ready, move to Unhealthy state After (v2) #1099
GameServer container restart before Ready, move to Unhealthy state After (v2) #1099
Conversation
Build Failed 😱 Build Id: 5b35e2d0-dd16-4745-8bec-ff6a360671c5 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
08d6f54
to
47975e6
Compare
/hold e2e test found a bug in my implementation. Need to fix! |
I'm waiting for @markmandel to remove the hold before I review. |
47975e6
to
60a955e
Compare
/hold cancel While I think this should stay in feature freeze, as I would like to go through a development cycle for extra confidence, it is ready for review again now. |
60a955e
to
2aab79d
Compare
Build Succeeded 👏 Build Id: 2b82f742-c6c7-431b-9123-024ed8b01fb6 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
OMG! We can finally get this reviewed an in a release! 💃 I feel like I've been working on this bug forever 😄 |
Now that the release is out this is the next review up on my list. |
@@ -91,6 +91,10 @@ const ( | |||
// DevAddressAnnotation is an annotation to indicate that a GameServer hosted outside of Agones. | |||
// A locally hosted GameServer is not managed by Agones it is just simply registered. | |||
DevAddressAnnotation = "agones.dev/dev-address" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(aside) Is there a reason that this one is prefixed with "agones.dev" instead of being agones.GroupName + "/dev-address"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, we should fix this.
test/e2e/gameserver_test.go
Outdated
logger.WithError(err).WithField("gs", newGs.ObjectMeta.Name).WithField("state", newGs.Status.State).Info("no ready, trying again") | ||
return false, nil | ||
}) | ||
if !assert.NoError(t, err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elsewhere I just see assert.NoError()
but this one wraps that and calls FailNow
. Is that to short circuit running the remainder of the test? If so, would it be simpler to just write
if err != nil {
assert.FailNow(t, ...)
}
I find the statements that check the boolean result of the assert to be hard to read, and they are basically just checking if err is or isn't nil and calling another assert, which can be done more simply.
test/e2e/gameserver_test.go
Outdated
// also gives the process a chance to come back up | ||
_, err = conn.Write([]byte("CRASH")) | ||
logger.WithError(err).Info("no crash, trying again") | ||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this return false, err
instead of nil? Or will returning an error short circuit the polling function? It seems weird that the error writing the udp packet is logged but otherwise ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are indeed correct - returning the error will short circuit the poll. It may return an error as the receiver may not be up and running yet, so it's useful for debugging purposes, but otherwise it is expected.
test/e2e/gameserver_test.go
Outdated
logger.WithError(err).Warn("could not send CRASH udp packet") | ||
return false, nil | ||
} | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is successfully sending the packet sufficient to break the loop? Shouldn't we verify that the pod actually exited (e.g. the game server received the packet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, now we have a helper method.
test/e2e/gameserver_test.go
Outdated
// now crash, should be unhealthy, since it's after being Ready | ||
logger.Info("crashing again, should be unhealthy") | ||
// retry on crash, as with the restarts, sometimes Go takes a moment to send this through. | ||
err = wait.Poll(time.Second, 3*time.Minute, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All three of these udp sending loops are very similar in structure (write a packet, check for state, try again). Have you considered moving them into a helper function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Done!
pkg/gameservers/health.go
Outdated
} | ||
|
||
hc.recorder.Event(gs, corev1.EventTypeWarning, string(gsCopy.Status.State), "Issue with Gameserver pod") | ||
if !skip { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be inverted.
if skip {
// The game server hasn't been marked ready, so it doesn't transition to unhealthy.
return nil
}
To leave the other code with one less level of indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah excellent suggestion!
pkg/gameservers/health.go
Outdated
|
||
return nil | ||
} | ||
|
||
// skipUnhealthy checks the GameServer is Ready or after Ready, and what the container crash state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is really hard to read. I tried taking a stab at rewriting it but didn't have any particularly great ways to phrase it. If you can think of a way to say it more clearly that'd be awesome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a stab. PTAL when I push my changes up. I think it's better though.
v.setup(gs, pod) | ||
|
||
m.KubeClient.AddReactor("list", "pods", func(action k8stesting.Action) (bool, runtime.Object, error) { | ||
return true, &corev1.PodList{Items: []corev1.Pod{*pod}}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this doesn't cover the branches where pod doesn't exist or k8s returns an error to put the request back into the queue.
pkg/gameservers/health_test.go
Outdated
Name: gs.Spec.Container, | ||
State: corev1.ContainerState{Terminated: &corev1.ContainerStateTerminated{}}, | ||
}} | ||
}, expected: true}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please put expected on a new line
}, | ||
"before ready, with no terminated container": { | ||
setup: func(gs *agonesv1.GameServer, pod *corev1.Pod) { | ||
gs.Status.State = agonesv1.GameServerStateScheduled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the before ready states are all the same, given the other test coverage it may not matter but you could vary which state is set to one of the other 4 states that are before ready.
This brings our implementation inline with what our health checking documentation states that we do. This is done by implementing extra checks in the HealthController to determine if it's appropriate to move to Unhealthy rather than allow a restart to occur. Replaced PR googleforgames#1069 Closes googleforgames#956
2aab79d
to
ee1c3dd
Compare
@roberthbailey this should be good to review again 👍 |
Build Succeeded 👏 Build Id: 8344dedc-9848-4200-8c4e-c35d71ca9555 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: markmandel, roberthbailey The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This brings our implementation inline with what our health checking documentation states that we do.
This is done by implementing extra checks in the HealthController to determine if it's appropriate to move to Unhealthy rather than allow a restart to occur.
Replaced PR #1069
Closes #956