Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: better logging for debugging #1762

Merged
merged 1 commit into from
Dec 27, 2018
Merged

Conversation

sharanyad
Copy link
Contributor

@sharanyad sharanyad commented Dec 26, 2018

Summary

Implementation details

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes:

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@@ -78,6 +79,7 @@ func (n *NvidiaGPUManager) Initialize() error {
n.SetGPUIDs(gpuIDs)
n.SetDevices()
}
seelog.Info("Config for GPU support is enabled, but GPU information is not found; continuing without it")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if this should be handled as a failure, for example when customer is expecting to use GPU, shouldn't the customer be notified of the misconfiguration in a more explicit way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config is not recommended for customers to add. This is a GPU instance level config, but we also let customers override it to true/false. If for some reason, a customer sets it to true on non GPU instance, there is value is not stopping agent and continuing to register as a normal instance. I can change the log level to error.

@sharanyad
Copy link
Contributor Author

flakey windows functional test failure:
--- FAIL: TestOOMContainer (20.53s)

TestTaskLevelVolume timed out for windows integ test, which seems unrelated. Will add a note and it could be investigated outside of this PR.

@sharanyad
Copy link
Contributor Author

merging PR since failures are not related to the changes

@sharanyad sharanyad merged commit 86d3a26 into aws:gpu-support Dec 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants