-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix type assertion bug #910
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now. But, we have to find a better way of doing this in the future. We can have all Cannot*Container
errors or just all custom error types implement an interface that extends the error
interface with an additional GetErrorName()
method and use that to check the error instead of being inconsistent with pointer/non-pointer assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I agree with @aaithal that we may need to rethink our error wrappers to give them clearer meaning and remove the chance of the sort of inappropriate type assertion that bit us here. We should do an audit of these assertions and ensure we don't have another one of these bugs hiding in the code.
@aaithal we already have an What we really need here is a new interface type for the |
agent/engine/task_manager.go
Outdated
@@ -254,7 +254,7 @@ func (mtask *managedTask) handleContainerChange(containerChange dockerContainerC | |||
} | |||
// If docker returned a transient error while trying to stop a container, | |||
// reset the known status to the current status and return | |||
cannotStopContainerError, ok := event.Error.(*CannotStopContainerError) | |||
cannotStopContainerError, ok := event.Error.(CannotStopContainerError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this fixes it for now, this doesn't prevent the same problem from happening again. I would much prefer if we defined an interface for this and type-asserted to the interface. Interface type-assertions don't matter whether you have a struct or a pointer since either can satisfy the interface. See here.
Something like this:
type maybeUnretriableError interface {
IsUnretriableError() bool
}
You should be able to just define this above handleContainerChange
and use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much exactly what I'm doing... 🥇
Except that I'm taking the opportunity to change that method name!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if you fix the godoc 😄
agent/engine/errors.go
Outdated
@@ -154,15 +159,19 @@ func (err CannotStopContainerError) ErrorName() string { | |||
return "CannotStopContainerError" | |||
} | |||
|
|||
func (err CannotStopContainerError) IsUnretriableError() bool { | |||
// When stopping a container, most errors that we can get should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have this in a godoc location, but it's not godoc format. Change the beginning to read "IsRetriableError returns ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🦑 🐙 🌊 🐋 🐟 🌊 🐚
✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please address Sam's godoc comment before merging this.
b1d2604
to
6a2bcb7
Compare
… status. A type assertion of event.Error assumed that the variable being tested was a reference, when we actually weren't consistent in how we assigned that variable. This caused the assertion to fail in places where it should have passed, ultimately causing us to falsely report containers as stopped in the case where docker returned an API error in response to the 'stop' call. To fix this, we define a new 'cannotStopContainerError' interface and perform the type assertion on the interface.
…ntainerError.IsUnretryableError The only place this method was used was in a negated test. Negating negatives requires unnecessarily mental work on the part of the reader. IsRetryableError is quicker to understand.
6a2bcb7
to
c8405db
Compare
A type assertion of event.Error assumed that the variable being tested was a
reference, when we actually weren't consistent in how we assigned that
variable. This caused the assertion to fail in places where it should have
passed, ultimately causing us to leak containers in the case where docker
returned an API error in response to the 'stop' call.
Summary
Fix a bug in which containers can be leaked in the event that docker returns an error in response to a
stop
API call.Implementation details
This is a minimal fix that ensures that we're performing assertions of the same type that we're actually assigning. It additionally makes a minor change to log output to add some additional detail.
Testing
Using a custom build of Docker to force the
stop
API call to return a 500 error. Verify that without this change, agent logsError while pulling container; will try to run anyways
and leaks a container. (Leak, in this case, means that it has indicated to the ECS service that the task has successfully transitioned toSTOPPED
even though at least one associated container was still running, and no further efforts would be made to terminate the container.)With the change, Agent exhibits the expected behavior, which is that it retries the
stop
call indefinitely and does not signal that the task has transitioned toSTOPPED
.make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes:
Description for the changelog
Bug -- Fix a situation in which containers may be falsely reported as STOPPED in the case of a Docker "stop" API failure.
Licensing
This contribution is under the terms of the Apache 2.0 License: yes