-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
engine: add inactivity timeout for image pulling #1290
Conversation
agent/engine/docker_client.go
Outdated
@@ -83,6 +83,10 @@ const ( | |||
// around a docker bug which sometimes results in pulls not progressing. | |||
dockerPullBeginTimeout = 5 * time.Minute | |||
|
|||
// dockerPullInactivityTimeout is the amount of time that we will | |||
// wait when the pulling does not progress | |||
dockerPullInactivityTimeout = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did we decide on 5 mins? is there any data/metrics available for us? Ideally p99 would be a good value to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently it's set arbitrarily. i'm not sure if there's any data/metrics for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yunhee-l, hm we havent figured out a good way to measure p99 in this case. the only real visibility we've had so far is from root causing hanging tasks/containers which will get stuck for arbitrary periods of time. we want to introduce as a fail fast for docker inactivity at least during image pulls.
do you know if we could get a better idea of what to set here from ecr side? 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From ECR side, P99 is in seconds not minutes (I can show you the dashboard in person) so 5 mins is more than long enough timeout. Thought 5 mins was a bit too long in general but if it's only meant to be used for hanging tasks I suppose it is not a big issue. However flip side is whether this might adversely impact large image pull that can take longer than 5 mins? Is ECR the only repo we pull from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However flip side is whether this might adversely impact large image pull that can take longer than 5 mins?
The dockerPullInactivityTimeout
shouldn't adversely affect large image pulls. This is explicitly to checking to timeout if docker is hanging during the download and large image pulls should show continuous activity during the download.
Is ECR the only repo we pull from?
No, the agent could be pulling from anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation! if timeout is specifically for the case where docker is hanging and inactive, then I don't think any of my concerns are relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as a check, i tested it manually with a large image (3GB) and small timeout (5 secs), and the pulling worked fine with no timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5m
seems too long here. I don't think this should be greater than 1m
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout is set to 1 minute now. I have commented on the SageMaker ticket to see if there's any input at their side.
@fenxiong can you please incorporate the following tests for this as well?
|
@aaithal I have finished the above tests, with timeout set to 5 seconds. For the Docker daemon test, i'm able to observe the timeout and able to restore it. For the blackhole test, i'm not able to observe the timeout because after i added the ip to blackhole the pulling failed immediately and wouldn't reach timeout. For the random drop case, i saw no timeout happen, and i think this is expected right? if 50% packets are not dropped the pulling won't stop |
agent/engine/docker_client_test.go
Outdated
docker.ErrInactivityTimeout).Times(maximumPullRetries) // expected number of retries | ||
|
||
metadata := client.PullImage("image", nil) | ||
if metadata.Error == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the assert
instead. Same below.
8712e0d
to
b953fba
Compare
b953fba
to
f35fb22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change lgtm, thanks for the detailed testing report in the PR summary 😁
@fenxiong can you please make sure to submit a new PR that modifies Thanks, |
Summary
This pull request is for issue 1249 here: #1249, which is to introduce InactivityTimeout for image pulling.
Implementation details
By using the inactivity timeout parameter supported by go-dockerclient.
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes: yes
Three manual testings are performed as follow, with timeout set to 5 seconds:
(1) Start pulling a large image, send SIGSTOP to Docker daemon process, see whether inactivity timeout is triggered; send SIGCONT to Docker daemon process, see whether pulling resumes.
Result: after SIGSTOP, timeout is triggered; after SIGCONT, pulling resumes
(2) Start pulling a large image (from ECR), blackhole ECR, see whether timeout is triggered; remove blackhole, see whether pulling resumes
Result: after adding blackhole, the pulling fails immediately due to connection failure, and doesn't reach the timeout limit.
(3) Start pulling a large image, randomly drop 50% of packets on the host during the pull
Result: inactivity timeout is not triggered; pulling succeeds normally
Description for the changelog
Enhancement - Introduce InactivityTimeout to image pulling.
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.