Add dist mnist model for e2e test #549

ScorpioCPH · 2018-04-21T07:51:49Z

Hi, after v1alpha2 code merged, we need some e2e test for this new API.
This PR import distributed mnist model for e2e test, @gaocegege @jlewi PTAL, thanks!

This change is

gaocegege

LGTM.

Does the test code have any changes compared to the official example in TensorFlow?

ScorpioCPH · 2018-04-21T08:00:11Z

@gaocegege Just a little change in args parsing: read args from ENV TF_CONFIG.

gaocegege · 2018-04-21T08:02:24Z

OK, could we place the file in examples/v1alpha2/mnist/? I think we do not run e2e test for v1alpha2 in short term.

coveralls · 2018-04-21T08:06:42Z

Coverage remained the same at 49.861% when pulling d512fdf on ScorpioCPH:add-dist-mnist-for-e2e-test into 2a22ad4 on kubeflow:master.

ddysher · 2018-04-21T09:08:11Z

test/e2e/dist-mnist/README.md

@@ -0,0 +1,39 @@
+### Distributed mnist model for e2e test
+
+This folder containers docker file and distributed mnist model for e2e test.


s/containers docker file/contains Dockerfile

Thanks, done.

jlewi · 2018-04-23T02:53:52Z

Why do we need to use mnist for testing?
Why can't we use the current E2E tests?

The current E2E tests verify that ops can be assigned to all of the workers and that those ops are executed successfully
https://github.com/kubeflow/tf-operator/blob/master/examples/tf_sample/tf_sample/tf_smoke.py#L52

How does using a model like mnist help? Are you actually verifying that distributed training is working and that ops are properly assigned and executed on multiple workers?

For E2E tests, I think the thing we want to test is not that we can train mnist but that all the TFServers are created and properly configured to talk to each other. Can we create simpler tests that explicitly test this?

For GPUs for example we can log device placement and then check that ops are actually assigned to the GPU. I thought we were already doing this but looks like we aren't. We should open an issue to track that.

jlewi · 2018-04-23T02:54:05Z

/assign jlewi

ScorpioCPH · 2018-04-23T05:52:05Z

@jlewi Hi, tf_smoke.py is good, i think mnist training is simple enough and we need some real training which running a while to verify the network and other congfigs.

Are you actually verifying that distributed training is working and that ops are properly assigned and executed on multiple workers?

We can achieve this goal by launching a real distributed training not just a smoke test.

ScorpioCPH · 2018-04-23T05:59:55Z

@jlewi We need more tests including GPUs of course. I will add more tests later.

jlewi · 2018-04-24T01:20:12Z

@ScorpioCPH great thanks.

ScorpioCPH · 2018-04-24T04:04:30Z

@jlewi Thanks, any concerns about this PR?

@gaocegege By the way, CI failed as this command "gometalinter --config=linter_config.json ./pkg/..." exited with 1. Maybe not related to this PR.

jlewi · 2018-04-24T18:04:47Z

@ScorpioCPH no
/lgtm
/hold
Hold to let @gaocegege approve

gaocegege · 2018-04-25T02:24:43Z

Yeah, Travis CI failed but the presubmit should passed.

BTW lgtm.

gaocegege · 2018-04-25T02:26:29Z

gometalinter is a little werid, seems that it is false negative. Sometimes there are errors but it won't report to us.

gaocegege · 2018-04-25T02:26:37Z

/hold cancel

gaocegege · 2018-04-25T08:02:53Z

After the rebase, Travis should pass.

ScorpioCPH · 2018-05-02T07:44:25Z

@gaocegege Rebase is done, but the CI is still failed, could you take a look please?

gaocegege · 2018-05-02T07:47:00Z

/retest

OK, sure.

gaocegege · 2018-05-02T15:08:26Z

I think the CI is fixed, please rebase the master 😄

gaocegege · 2018-05-03T02:01:20Z

/lgtm

k8s-ci-robot · 2018-05-03T02:01:46Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gaocegege · 2018-05-05T05:13:07Z

/retest

gaocegege · 2018-05-05T05:35:33Z

************* Module dist_mnist
W: 97, 9: Unused argument 'unused_argv' (unused-argument)
R: 97, 0: Too many branches (21/12) (too-many-branches)
R: 97, 0: Too many statements (104/50) (too-many-statements)
-----------------------------------
Your code has been rated at 9.77/10

ScorpioCPH · 2018-05-07T02:35:23Z

/retest

gaocegege · 2018-05-07T03:01:57Z

Hi @ScorpioCPH

I think we should fix the linting issues in the python module dist_mnist.

k8s-ci-robot · 2018-05-07T03:50:07Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot · 2018-05-07T04:10:05Z

@ScorpioCPH: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
kubeflow-tf-operator-presubmit	`d512fdf`	link	`/test kubeflow-tf-operator-presubmit`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

gaocegege · 2018-05-09T10:04:13Z

Hi, could you please fix the linting issues? The second failure is not caused by your code, IMO.

gaocegege · 2018-05-09T10:10:41Z

I met an error when I run the build script: https://pastebin.ubuntu.com/p/Z8Xv3NNhss/

🤔 Do you have any idea about it?

ScorpioCPH · 2018-05-10T01:56:20Z

@gaocegege Hi, please try the latest code, the Dockerfile is more simpler:

FROM tensorflow/tensorflow:1.5.0

ADD . /var/tf_dist_mnist
ENTRYPOINT ["python", "/var/tf_dist_mnist/dist_mnist.py"]

ScorpioCPH · 2018-05-10T01:58:30Z

And can we ignore the python lint warning? I think it is not the critical thing and will take some time to fix all of the warnings.

gaocegege · 2018-05-10T02:08:19Z

Then we could add it into pylint ignore.

gaocegege · 2018-05-10T02:08:42Z

Could you do it or we could merge it and I can add it for you.

gaocegege · 2018-05-10T02:46:39Z

I will merge it first since I rely on the PR to test. Then I will file a PR to fix the presubmit test.

gaocegege · 2018-05-10T02:47:00Z

Thanks for your contribution!

k8s-ci-robot requested review from ddysher and willb April 21, 2018 07:51

k8s-ci-robot added the size/L label Apr 21, 2018

gaocegege approved these changes Apr 21, 2018

View reviewed changes

ddysher reviewed Apr 21, 2018

View reviewed changes

k8s-ci-robot assigned jlewi Apr 23, 2018

ScorpioCPH force-pushed the add-dist-mnist-for-e2e-test branch from 9be81fe to 59f5b15 Compare April 23, 2018 06:04

k8s-ci-robot added lgtm do-not-merge/hold approved labels Apr 24, 2018

k8s-ci-robot removed the do-not-merge/hold label Apr 25, 2018

ScorpioCPH force-pushed the add-dist-mnist-for-e2e-test branch from 59f5b15 to cbd9b69 Compare April 25, 2018 13:03

k8s-ci-robot removed the lgtm label Apr 25, 2018

gaocegege added the lgtm label Apr 26, 2018

ScorpioCPH force-pushed the add-dist-mnist-for-e2e-test branch from cbd9b69 to e202a8f Compare May 3, 2018 01:56

k8s-ci-robot removed lgtm approved labels May 3, 2018

k8s-ci-robot assigned gaocegege May 3, 2018

k8s-ci-robot added the lgtm label May 3, 2018

gaocegege added the approved label May 3, 2018

Add dist mnist model for e2e test

d512fdf

ScorpioCPH force-pushed the add-dist-mnist-for-e2e-test branch from e202a8f to d512fdf Compare May 7, 2018 03:50

k8s-ci-robot removed the lgtm label May 7, 2018

gaocegege merged commit e380580 into kubeflow:master May 10, 2018

gaocegege mentioned this pull request May 10, 2018

.pylinrc: Add dist_mnist #581

Merged

yph152 pushed a commit to yph152/tf-operator that referenced this pull request Jun 18, 2018

Add dist mnist model for e2e test (kubeflow#549)

8190508

jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this pull request Jul 9, 2018

Add dist mnist model for e2e test (kubeflow#549)

62c19fc

		@@ -0,0 +1,39 @@
		### Distributed mnist model for e2e test

		This folder containers docker file and distributed mnist model for e2e test.

Add dist mnist model for e2e test #549

Add dist mnist model for e2e test #549

Conversation

ScorpioCPH commented Apr 21, 2018 • edited by jlewi Loading

gaocegege left a comment • edited Loading

Choose a reason for hiding this comment

ScorpioCPH commented Apr 21, 2018

gaocegege commented Apr 21, 2018

coveralls commented Apr 21, 2018 • edited Loading

ddysher Apr 21, 2018

Choose a reason for hiding this comment

ScorpioCPH Apr 23, 2018

Choose a reason for hiding this comment

jlewi commented Apr 23, 2018

jlewi commented Apr 23, 2018

ScorpioCPH commented Apr 23, 2018 • edited Loading

ScorpioCPH commented Apr 23, 2018

jlewi commented Apr 24, 2018

ScorpioCPH commented Apr 24, 2018

jlewi commented Apr 24, 2018

gaocegege commented Apr 25, 2018

gaocegege commented Apr 25, 2018

gaocegege commented Apr 25, 2018

gaocegege commented Apr 25, 2018

ScorpioCPH commented May 2, 2018

gaocegege commented May 2, 2018

gaocegege commented May 2, 2018

gaocegege commented May 3, 2018

k8s-ci-robot commented May 3, 2018

gaocegege commented May 5, 2018

gaocegege commented May 5, 2018

ScorpioCPH commented May 7, 2018

gaocegege commented May 7, 2018

k8s-ci-robot commented May 7, 2018

k8s-ci-robot commented May 7, 2018

gaocegege commented May 9, 2018

gaocegege commented May 9, 2018

ScorpioCPH commented May 10, 2018

ScorpioCPH commented May 10, 2018

gaocegege commented May 10, 2018

gaocegege commented May 10, 2018

gaocegege commented May 10, 2018

gaocegege commented May 10, 2018

ScorpioCPH commented Apr 21, 2018 •

edited by jlewi

Loading

gaocegege left a comment •

edited

Loading

coveralls commented Apr 21, 2018 •

edited

Loading

ScorpioCPH commented Apr 23, 2018 •

edited

Loading