-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dist mnist model for e2e test #549
Add dist mnist model for e2e test #549
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Does the test code have any changes compared to the official example in TensorFlow?
@gaocegege Just a little change in args parsing: read args from ENV |
OK, could we place the file in examples/v1alpha2/mnist/? I think we do not run e2e test for v1alpha2 in short term. |
test/e2e/dist-mnist/README.md
Outdated
@@ -0,0 +1,39 @@ | |||
### Distributed mnist model for e2e test | |||
|
|||
This folder containers docker file and distributed mnist model for e2e test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/containers docker file/contains Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
Why do we need to use mnist for testing? The current E2E tests verify that ops can be assigned to all of the workers and that those ops are executed successfully How does using a model like mnist help? Are you actually verifying that distributed training is working and that ops are properly assigned and executed on multiple workers? For E2E tests, I think the thing we want to test is not that we can train mnist but that all the TFServers are created and properly configured to talk to each other. Can we create simpler tests that explicitly test this? For GPUs for example we can log device placement and then check that ops are actually assigned to the GPU. I thought we were already doing this but looks like we aren't. We should open an issue to track that. |
/assign jlewi |
@jlewi Hi,
We can achieve this goal by launching a real distributed training not just a smoke test. |
@jlewi We need more tests including GPUs of course. I will add more tests later. |
9be81fe
to
59f5b15
Compare
@ScorpioCPH great thanks. |
@jlewi Thanks, any concerns about this PR? @gaocegege By the way, CI failed as this command "gometalinter --config=linter_config.json ./pkg/..." exited with 1. Maybe not related to this PR. |
@ScorpioCPH no |
Yeah, Travis CI failed but the presubmit should passed. BTW lgtm. |
gometalinter is a little werid, seems that it is false negative. Sometimes there are errors but it won't report to us. |
/hold cancel |
After the rebase, Travis should pass. |
59f5b15
to
cbd9b69
Compare
@gaocegege Rebase is done, but the CI is still failed, could you take a look please? |
/retest OK, sure. |
I think the CI is fixed, please rebase the master 😄 |
cbd9b69
to
e202a8f
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
|
/retest |
Hi @ScorpioCPH I think we should fix the linting issues in the python module dist_mnist. |
e202a8f
to
d512fdf
Compare
New changes are detected. LGTM label has been removed. |
@ScorpioCPH: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Hi, could you please fix the linting issues? The second failure is not caused by your code, IMO. |
I met an error when I run the build script: https://pastebin.ubuntu.com/p/Z8Xv3NNhss/ 🤔 Do you have any idea about it? |
@gaocegege Hi, please try the latest code, the FROM tensorflow/tensorflow:1.5.0
ADD . /var/tf_dist_mnist
ENTRYPOINT ["python", "/var/tf_dist_mnist/dist_mnist.py"] |
And can we ignore the python lint warning? I think it is not the critical thing and will take some time to fix all of the warnings. |
Then we could add it into pylint ignore. |
Could you do it or we could merge it and I can add it for you. |
I will merge it first since I rely on the PR to test. Then I will file a PR to fix the presubmit test. |
Thanks for your contribution! |
Hi, after v1alpha2 code merged, we need some e2e test for this new API.
This PR import distributed mnist model for e2e test, @gaocegege @jlewi PTAL, thanks!
This change is