Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm test tf-job does not pass validation #351

Closed
jose5918 opened this issue Jan 29, 2018 · 3 comments
Closed

Helm test tf-job does not pass validation #351

jose5918 opened this issue Jan 29, 2018 · 3 comments
Assignees

Comments

@jose5918
Copy link
Contributor

I'm on gke with Kubernetes 1.8.6.

I have found that helm test e2e test does not pass validation.

Here are the steps I did to recreate:

▶ helm install tf-job-operator-chart -n tf-job --wait --replace --set cloud=gke
▶ helm test tf-job
RUNNING: tf-job-tfjob-test-r2ikp2
FAILED: tf-job-tfjob-test-r2ikp2, run `kubectl logs tf-job-tfjob-test-r2ikp2 --namespace jgaguirr` for more info
Error: 1 test(s) failed
▶ kubectl logs tf-job-tfjob-test-r2ikp2 | grep reason
    "reason": "",
    "reason": "invalid job spec: Replica is missing Template; {\n  \"replicas\": 1,\n  \"tfPort\": 2222,\n  \"tfReplicaType\": \"PS\"\n}",
    "reason": "invalid job spec: Replica is missing Template; {\n  \"replicas\": 1,\n  \"tfPort\": 2222,\n  \"tfReplicaType\": \"PS\"\n}",

It also appears to be fail validation for the example:

▶ kubectl create -f https://raw.githubusercontent.com/tensorflow/k8s/master/examples/tf_job.yaml
tfjob "example-job" created
▶ k get tfjobs example-job -o json | grep reason
        "reason": "invalid job spec: Replica is missing Template; {\n  \"replicas\": 2,\n  \"tfPort\": 2222,\n  \"tfReplicaType\": \"PS\"\n}",

It looks like this might have been introduced in https://github.com/tensorflow/k8s/pull/343/files.

I would like to fix this but I'm not sure if the fix involved specifying a template for the parameter server in the example (where I would probably use the previous default), or if the validation is supposed to allow the parameter server replica to omit the template.

@jose5918 jose5918 changed the title Helm test TFjob does not pass validation Helm test tf-job does not pass validation Jan 29, 2018
@jlewi
Copy link
Contributor

jlewi commented Jan 30, 2018

So there are two issues one is the issue you mentioned and the other is that fact that are tests are reported as passing even though the test is failing.

To fix the issue mentioned above we need to specify the program to invoke for the parameter servers because #343 removed the default program.

So here we need to add a Template for the PS Replica just like template we have for the master and workers.

I'll open a separate issue about why helm test failure isn't properly reported to gubernator.

@karthikvadla
Copy link

I tried adding spec to PS section for https://github.com/tensorflow/k8s/blob/master/examples/tf_job.yaml#L24. It fixed invalid spec issue while creating TFjob example-job.
I did same change in https://github.com/tensorflow/k8s/blob/master/test/e2e/main.go#L90 and
https://github.com/tensorflow/k8s/blob/master/test/e2e/simple_job.yaml.template#L24. But Still helm test tf-job failing with same invalid spec error.

@jose5918 jose5918 mentioned this issue Jan 30, 2018
@jlewi
Copy link
Contributor

jlewi commented Jan 30, 2018

@karthikvadla Can you take a look at #356 looks like the tests are passing with that PR.

jlewi pushed a commit that referenced this issue Jan 31, 2018
* Helm test was failing because validation for a tfjob required that replicaSpecs for a Parameter server specify a template. Helm test failure also was not reported.

Changes made:

* Updated e2e tests and examples to include a template for the PS replicaSpec
* Check for None before concatenating the error.
* Fixes #351 and #355
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants