-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFJob not marked as success when master exits but not workers #634
Comments
The problem appears to be here The code doesn't appear to be implementing the termination policy logic as we discussed in the v1alpha2 spec. In particular If there is a chief the job should be marked as completed when the chief exits. |
Support for exit policies was discussed here in the original proposal |
We have no logic about master replicas now, and I think we do not just miss the support of exit policies. I think it could be added in 0.3. My opinion is to make ps/worker high availability, WDYT |
* The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589
And FYI, the master spec is using |
I implement the logic #637 and the code will generate the cluster spec like this:
Thus we need to write an example for the case. |
I don't think we can wait for 0.3 to implement exit policy. Without exit policy most TF jobs won't worker and this would be a major regression so we effectively can't launch v1alpha2 in 0.2; which would be a major blow to our plans to be v1 by the end of the year. Lots of TF jobs either have an explicit chief or use worker 0 as the chief. Can you open up an issue on high availability? Its not clear to me what this means or why its important. |
/cc @ddysher @DjangoPeng |
OK, SGTM. we have 3 days before the DL and I will work on it. |
* The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589
* The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589
* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to #589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.
I think it is closed by #637 |
thanks! @gaocegege |
* Fix kubeflow#634 * Speedup the E2E test by running the build and setup cluster steps in parallel * To do this we split the setup step into two steps 1. setting up the cluster and 2. setting up Kubeflow. Fix kubeflow#659 * Shorten the name of the workflow for v1alpha2 * Otherwise the label for the workflow pod becomes too long and argo can't run it. * Pin the test worker image so that we don't get broken when someone updates the latest image * Make it a parameter in the prow_config.yaml * Use a file lock to ensure only one instance of test_runner is modifying the ksonnet app at a time; this should help with various test flakes.
* Fix #634 * Speedup the E2E test by running the build and setup cluster steps in parallel * To do this we split the setup step into two steps 1. setting up the cluster and 2. setting up Kubeflow. Fix #659 * Shorten the name of the workflow for v1alpha2 * Otherwise the label for the workflow pod becomes too long and argo can't run it. * Pin the test worker image so that we don't get broken when someone updates the latest image * Make it a parameter in the prow_config.yaml * Use a file lock to ensure only one instance of test_runner is modifying the ksonnet app at a time; this should help with various test flakes.
@gaocegege Can you provide a sample training with |
@372046933 I do not really train a mode, I just use busybox to test the functionality. |
* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.
* Fix kubeflow#634 * Speedup the E2E test by running the build and setup cluster steps in parallel * To do this we split the setup step into two steps 1. setting up the cluster and 2. setting up Kubeflow. Fix kubeflow#659 * Shorten the name of the workflow for v1alpha2 * Otherwise the label for the workflow pod becomes too long and argo can't run it. * Pin the test worker image so that we don't get broken when someone updates the latest image * Make it a parameter in the prow_config.yaml * Use a file lock to ensure only one instance of test_runner is modifying the ksonnet app at a time; this should help with various test flakes.
* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.
* Fix kubeflow#634 * Speedup the E2E test by running the build and setup cluster steps in parallel * To do this we split the setup step into two steps 1. setting up the cluster and 2. setting up Kubeflow. Fix kubeflow#659 * Shorten the name of the workflow for v1alpha2 * Otherwise the label for the workflow pod becomes too long and argo can't run it. * Pin the test worker image so that we don't get broken when someone updates the latest image * Make it a parameter in the prow_config.yaml * Use a file lock to ensure only one instance of test_runner is modifying the ksonnet app at a time; this should help with various test flakes.
I ran the simple TFJob we use for smoke tests but modified it to have PS, Workers, and Master
https://github.com/kubeflow/tf-operator/blob/master/test/workflows/components/simple_tfjob.jsonnet
The master is marked as succeeded but the job is still marked as running. See below.
This is different behavior from v1alpha1 and seems like a regression.
@gaocegege @ScorpioCPH
The text was updated successfully, but these errors were encountered: