Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dist mnist model for e2e test #549

Merged
merged 1 commit into from
May 10, 2018

Conversation

ScorpioCPH
Copy link
Member

@ScorpioCPH ScorpioCPH commented Apr 21, 2018

Hi, after v1alpha2 code merged, we need some e2e test for this new API.
This PR import distributed mnist model for e2e test, @gaocegege @jlewi PTAL, thanks!


This change is Reviewable

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Does the test code have any changes compared to the official example in TensorFlow?

@ScorpioCPH
Copy link
Member Author

@gaocegege Just a little change in args parsing: read args from ENV TF_CONFIG.

@gaocegege
Copy link
Member

OK, could we place the file in examples/v1alpha2/mnist/? I think we do not run e2e test for v1alpha2 in short term.

@coveralls
Copy link

coveralls commented Apr 21, 2018

Coverage Status

Coverage remained the same at 49.861% when pulling d512fdf on ScorpioCPH:add-dist-mnist-for-e2e-test into 2a22ad4 on kubeflow:master.

@@ -0,0 +1,39 @@
### Distributed mnist model for e2e test

This folder containers docker file and distributed mnist model for e2e test.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/containers docker file/contains Dockerfile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

@jlewi
Copy link
Contributor

jlewi commented Apr 23, 2018

Why do we need to use mnist for testing?
Why can't we use the current E2E tests?

The current E2E tests verify that ops can be assigned to all of the workers and that those ops are executed successfully
https://github.com/kubeflow/tf-operator/blob/master/examples/tf_sample/tf_sample/tf_smoke.py#L52

How does using a model like mnist help? Are you actually verifying that distributed training is working and that ops are properly assigned and executed on multiple workers?

For E2E tests, I think the thing we want to test is not that we can train mnist but that all the TFServers are created and properly configured to talk to each other. Can we create simpler tests that explicitly test this?

For GPUs for example we can log device placement and then check that ops are actually assigned to the GPU. I thought we were already doing this but looks like we aren't. We should open an issue to track that.

@jlewi
Copy link
Contributor

jlewi commented Apr 23, 2018

/assign jlewi

@ScorpioCPH
Copy link
Member Author

ScorpioCPH commented Apr 23, 2018

@jlewi Hi, tf_smoke.py is good, i think mnist training is simple enough and we need some real training which running a while to verify the network and other congfigs.

Are you actually verifying that distributed training is working and that ops are properly assigned and executed on multiple workers?

We can achieve this goal by launching a real distributed training not just a smoke test.

@ScorpioCPH
Copy link
Member Author

@jlewi We need more tests including GPUs of course. I will add more tests later.

@ScorpioCPH ScorpioCPH force-pushed the add-dist-mnist-for-e2e-test branch from 9be81fe to 59f5b15 Compare April 23, 2018 06:04
@jlewi
Copy link
Contributor

jlewi commented Apr 24, 2018

@ScorpioCPH great thanks.

@ScorpioCPH
Copy link
Member Author

@jlewi Thanks, any concerns about this PR?

@gaocegege By the way, CI failed as this command "gometalinter --config=linter_config.json ./pkg/..." exited with 1. Maybe not related to this PR.

@jlewi
Copy link
Contributor

jlewi commented Apr 24, 2018

@ScorpioCPH no
/lgtm
/hold
Hold to let @gaocegege approve

@gaocegege
Copy link
Member

Yeah, Travis CI failed but the presubmit should passed.

BTW lgtm.

@gaocegege
Copy link
Member

gometalinter is a little werid, seems that it is false negative. Sometimes there are errors but it won't report to us.

@gaocegege
Copy link
Member

/hold cancel

@gaocegege
Copy link
Member

After the rebase, Travis should pass.

@ScorpioCPH ScorpioCPH force-pushed the add-dist-mnist-for-e2e-test branch from 59f5b15 to cbd9b69 Compare April 25, 2018 13:03
@k8s-ci-robot k8s-ci-robot removed the lgtm label Apr 25, 2018
@gaocegege gaocegege added the lgtm label Apr 26, 2018
@ScorpioCPH
Copy link
Member Author

@gaocegege Rebase is done, but the CI is still failed, could you take a look please?

@gaocegege
Copy link
Member

/retest

OK, sure.

@gaocegege
Copy link
Member

I think the CI is fixed, please rebase the master 😄

@gaocegege
Copy link
Member

/lgtm

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gaocegege
Copy link
Member

/retest

@gaocegege
Copy link
Member

************* Module dist_mnist
W: 97, 9: Unused argument 'unused_argv' (unused-argument)
R: 97, 0: Too many branches (21/12) (too-many-branches)
R: 97, 0: Too many statements (104/50) (too-many-statements)
-----------------------------------
Your code has been rated at 9.77/10

@ScorpioCPH
Copy link
Member Author

/retest

@gaocegege
Copy link
Member

Hi @ScorpioCPH

I think we should fix the linting issues in the python module dist_mnist.

@ScorpioCPH ScorpioCPH force-pushed the add-dist-mnist-for-e2e-test branch from e202a8f to d512fdf Compare May 7, 2018 03:50
@k8s-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm label May 7, 2018
@k8s-ci-robot
Copy link

@ScorpioCPH: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
kubeflow-tf-operator-presubmit d512fdf link /test kubeflow-tf-operator-presubmit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@gaocegege
Copy link
Member

Hi, could you please fix the linting issues? The second failure is not caused by your code, IMO.

@gaocegege
Copy link
Member

I met an error when I run the build script: https://pastebin.ubuntu.com/p/Z8Xv3NNhss/

🤔 Do you have any idea about it?

@ScorpioCPH
Copy link
Member Author

@gaocegege Hi, please try the latest code, the Dockerfile is more simpler:

FROM tensorflow/tensorflow:1.5.0

ADD . /var/tf_dist_mnist
ENTRYPOINT ["python", "/var/tf_dist_mnist/dist_mnist.py"]

@ScorpioCPH
Copy link
Member Author

And can we ignore the python lint warning? I think it is not the critical thing and will take some time to fix all of the warnings.

@gaocegege
Copy link
Member

Then we could add it into pylint ignore.

@gaocegege
Copy link
Member

Could you do it or we could merge it and I can add it for you.

@gaocegege
Copy link
Member

I will merge it first since I rely on the PR to test. Then I will file a PR to fix the presubmit test.

@gaocegege gaocegege merged commit e380580 into kubeflow:master May 10, 2018
@gaocegege
Copy link
Member

Thanks for your contribution!

yph152 pushed a commit to yph152/tf-operator that referenced this pull request Jun 18, 2018
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this pull request Jul 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants